Keeping it Simple: Language Models can learn Complex Molecular Distributions

by   Daniel Flam-Shepherd, et al.

Deep generative models of molecules have grown immensely in popularity, trained on relevant datasets, these models are used to search through chemical space. The downstream utility of generative models for the inverse design of novel functional compounds depends on their ability to learn a training distribution of molecules. The most simple example is a language model that takes the form of a recurrent neural network and generates molecules using a string representation. More sophisticated are graph generative models, which sequentially construct molecular graphs and typically achieve state of the art results. However, recent work has shown that language models are more capable than once thought, particularly in the low data regime. In this work, we investigate the capacity of simple language models to learn distributions of molecules. For this purpose, we introduce several challenging generative modeling tasks by compiling especially complex distributions of molecules. On each task, we evaluate the ability of language models as compared with two widely used graph generative models. The results demonstrate that language models are powerful generative models, capable of adeptly learning complex molecular distributions – and yield better performance than the graph models. Language models can accurately generate: distributions of the highest scoring penalized LogP molecules in ZINC15, multi-modal molecular distributions as well as the largest molecules in PubChem.


page 1

page 2

page 3

page 4


Language models can generate molecules, materials, and protein binding sites directly in three dimensions as XYZ, CIF, and PDB files

Language models are powerful tools for molecular design. Currently, the ...

GEN: Highly Efficient SMILES Explorer Using Autodidactic Generative Examination Networks

Recurrent neural networks have been widely used to generate millions of ...

Are VAEs Bad at Reconstructing Molecular Graphs?

Many contemporary generative models of molecules are variational auto-en...

Deep learning for molecular generation and optimization - a review of the state of the art

In the space of only a few years, deep generative modeling has revolutio...

Lingo3DMol: Generation of a Pocket-based 3D Molecule using a Language Model

Structure-based drug design powered by deep generative models have attra...

Data-Efficient Graph Grammar Learning for Molecular Generation

The problem of molecular generation has received significant attention r...

Molecular dynamics without molecules: searching the conformational space of proteins with generative neural networks

All-atom and coarse-grained molecular dynamics are two widely used compu...

Please sign up or login with your details

Forgot password? Click here to reset