Generative power of a protein language model trained on multiple sequence alignments

04/14/2022
by   Damiano Sgarbossa, et al.
0

Computational models starting from large ensembles of evolutionarily related protein sequences capture a representation of protein families and learn constraints associated to protein structure and function. They thus open the possibility for generating novel sequences belonging to protein families. Protein language models trained on multiple sequence alignments, such as MSA Transformer, are highly attractive candidates to this end. We propose and test an iterative method that directly uses the masked language modeling objective to generate sequences using MSA Transformer. We demonstrate that the resulting sequences generally score better than those generated by Potts models, and even than natural sequences, for homology, coevolution and structure-based measures. Moreover, MSA Transformer better reproduces the higher-order statistics and the distribution of sequences in sequence space of natural data than Potts models, although Potts models better reproduce first- and second-order statistics. MSA Transformer is thus a strong candidate for protein sequence generation and protein design.

READ FULL TEXT

page 9

page 12

research
08/14/2023

Pairing interacting protein sequences using masked language modeling

Predicting which proteins interact together from amino-acid sequences is...
research
06/17/2022

Transformer Neural Networks Attending to Both Sequence and Structure for Protein Prediction Tasks

The increasing number of protein sequences decoded from genomes is openi...
research
03/29/2022

Protein language models trained on multiple sequence alignments learn phylogenetic relationships

Self-supervised neural language models with attention have recently been...
research
11/23/2020

Sparse generative modeling of protein-sequence families

Pairwise Potts models (PM) provide accurate statistical models of famili...
research
05/18/2023

Vaxformer: Antigenicity-controlled Transformer for Vaccine Design Against SARS-CoV-2

The SARS-CoV-2 pandemic has emphasised the importance of developing a un...
research
12/03/2020

Generative Capacity of Probabilistic Protein Sequence Models

Variational autoencoders (VAEs) have recently gained popularity as gener...
research
11/30/2022

xTrimoABFold: Improving Antibody Structure Prediction without Multiple Sequence Alignments

In the field of antibody engineering, an essential task is to design a n...

Please sign up or login with your details

Forgot password? Click here to reset