BARTSmiles: Generative Masked Language Models for Molecular Representations

11/29/2022
by   Gayane Chilingaryan, et al.
0

We discover a robust self-supervised strategy tailored towards molecular representations for generative masked language models through a series of tailored, in-depth ablations. Using this pre-training strategy, we train BARTSmiles, a BART-like model with an order of magnitude more compute than previous self-supervised molecular representations. In-depth evaluations show that BARTSmiles consistently outperforms other self-supervised representations across classification, regression, and generation tasks setting a new state-of-the-art on 11 tasks. We then quantitatively show that when applied to the molecular domain, the BART objective learns representations that implicitly encode our downstream tasks of interest. For example, by selecting seven neurons from a frozen BARTSmiles, we can obtain a model having performance within two percentage points of the full fine-tuned model on task Clintox. Lastly, we show that standard attribution interpretability methods, when applied to BARTSmiles, highlight certain substructures that chemists use to explain specific properties of molecules. The code and the pretrained model are publicly available.

READ FULL TEXT
research
11/26/2020

Molecular representation learning with language models and domain-relevant auxiliary tasks

We apply a Transformer architecture, specifically BERT, to learn flexibl...
research
10/03/2021

Motif-based Graph Self-Supervised Learning forMolecular Property Prediction

Predicting molecular properties with data-driven methods has drawn much ...
research
09/19/2021

A Study of the Generalizability of Self-Supervised Representations

Recent advancements in self-supervised learning (SSL) made it possible t...
research
09/01/2023

Geometry-aware Line Graph Transformer Pre-training for Molecular Property Prediction

Molecular property prediction with deep learning has gained much attenti...
research
07/14/2023

Can Large Language Models Empower Molecular Property Prediction?

Molecular property prediction has gained significant attention due to it...
research
05/23/2023

CompoundPiece: Evaluating and Improving Decompounding Performance of Language Models

While many languages possess processes of joining two or more words to c...
research
09/26/2019

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

Increasing model size when pretraining natural language representations ...

Please sign up or login with your details

Forgot password? Click here to reset