MolXPT: Wrapping Molecules with Text for Generative Pre-training

by   Zequn Liu, et al.
Renmin University of China
Peking University

Generative pre-trained Transformer (GPT) has demonstrates its great success in natural language processing and related techniques have been adapted into molecular modeling. Considering that text is the most important record for scientific discovery, in this paper, we propose MolXPT, a unified language model of text and molecules pre-trained on SMILES (a sequence representation of molecules) wrapped by text. Briefly, we detect the molecule names in each sequence and replace them to the corresponding SMILES. In this way, the SMILES could leverage the information from surrounding text, and vice versa. The above wrapped sequences, text sequences from PubMed and SMILES sequences from PubChem are all fed into a language model for pre-training. Experimental results demonstrate that MolXPT outperforms strong baselines of molecular property prediction on MoleculeNet, performs comparably to the best model in text-molecule translation while using less than half of its parameters, and enables zero-shot molecular generation without finetuning.


page 1

page 2

page 3

page 4


SMILES Transformer: Pre-trained Molecular Fingerprint for Low Data Drug Discovery

In drug-discovery-related tasks such as virtual screening, machine learn...

Pre-training Transformers for Molecular Property Prediction Using Reaction Prediction

Molecular property prediction is essential in chemistry, especially for ...

Automated 3D Pre-Training for Molecular Property Prediction

Molecular property prediction is an important problem in drug discovery ...

Geometry-aware Line Graph Transformer Pre-training for Molecular Property Prediction

Molecular property prediction with deep learning has gained much attenti...

Interactive Molecular Discovery with Natural Language

Natural language is expected to be a key medium for various human-machin...

Dual-view Molecule Pre-training

Inspired by its success in natural language processing and computer visi...

On Automatic Text Extractive Summarization Based on Graph and pre-trained Language Model Attention

Representing text as graph to solve the summarization task has been disc...

Please sign up or login with your details

Forgot password? Click here to reset