PoET: A generative model of protein families as sequences-of-sequences

06/09/2023
by   Timothy F. Truong Jr, et al.
0

Generative protein language models are a natural way to design new proteins with desired functions. However, current models are either difficult to direct to produce a protein from a specific family of interest, or must be trained on a large multiple sequence alignment (MSA) from the specific family of interest, making them unable to benefit from transfer learning across families. To address this, we propose Protein Evolutionary Transformer (PoET), an autoregressive generative model of whole protein families that learns to generate sets of related proteins as sequences-of-sequences across tens of millions of natural protein sequence clusters. PoET can be used as a retrieval-augmented language model to generate and score arbitrary modifications conditioned on any protein family of interest, and can extrapolate from short context lengths to generalize well even for small families. This is enabled by a unique Transformer layer; we model tokens sequentially within sequences while attending between sequences order invariantly, allowing PoET to scale to context lengths beyond those used during training. PoET outperforms existing protein language models and evolutionary sequence models for variant function prediction in extensive experiments on deep mutational scanning datasets, improving variant effect prediction across proteins of all MSA depths.

READ FULL TEXT
research
04/03/2022

Few Shot Protein Generation

We present the MSA-to-protein transformer, a generative model of protein...
research
06/02/2023

Enhancing the Protein Tertiary Structure Prediction by Multiple Sequence Alignment Generation

The field of protein folding research has been greatly advanced by deep ...
research
05/27/2022

Tranception: protein fitness prediction with autoregressive transformers and inference-time retrieval

The ability to accurately model the fitness landscape of protein sequenc...
research
12/07/2022

Unsupervised language models for disease variant prediction

There is considerable interest in predicting the pathogenicity of protei...
research
11/01/2022

Machine learning can guide experimental approaches for protein digestibility estimations

Food protein digestibility and bioavailability are critical aspects in a...
research
08/16/2023

PEvoLM: Protein Sequence Evolutionary Information Language Model

With the exponential increase of the protein sequence databases over tim...
research
11/23/2020

Sparse generative modeling of protein-sequence families

Pairwise Potts models (PM) provide accurate statistical models of famili...

Please sign up or login with your details

Forgot password? Click here to reset