Retrieved Sequence Augmentation for Protein Representation Learning

02/24/2023
by   Chang Ma, et al.
0

Protein language models have excelled in a variety of tasks, ranging from structure prediction to protein engineering. However, proteins are highly diverse in functions and structures, and current state-of-the-art models including the latest version of AlphaFold rely on Multiple Sequence Alignments (MSA) to feed in the evolutionary knowledge. Despite their success, heavy computational overheads, as well as the de novo and orphan proteins remain great challenges in protein representation learning. In this work, we show that MSAaugmented models inherently belong to retrievalaugmented methods. Motivated by this finding, we introduce Retrieved Sequence Augmentation(RSA) for protein representation learning without additional alignment or pre-processing. RSA links query protein sequences to a set of sequences with similar structures or properties in the database and combines these sequences for downstream prediction. We show that protein language models benefit from the retrieval enhancement on both structure prediction and property prediction tasks, with a 5 addition, we show that our model can transfer to new protein domains better and outperforms MSA Transformer on de novo protein prediction. Our study fills a much-encountered gap in protein prediction and brings us a step closer to demystifying the domain knowledge needed to understand protein sequences. Code is available on https://github.com/HKUNLP/RSA.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/28/2023

ProtST: Multi-Modality Learning of Protein Sequences and Biomedical Texts

Current protein language models (PLMs) learn protein representations mai...
research
03/11/2022

Protein Representation Learning by Geometric Structure Pretraining

Learning effective protein representations is critical in a variety of t...
research
12/07/2022

When Geometric Deep Learning Meets Pretrained Protein Language Models

Geometric deep learning has recently achieved great success in non-Eucli...
research
01/05/2023

Reprogramming Pretrained Language Models for Protein Sequence Representation Learning

Machine Learning-guided solutions for protein learning tasks have made s...
research
01/30/2023

Protein Representation Learning via Knowledge Enhanced Primary Structure Modeling

Protein representation learning has primarily benefited from the remarka...
research
06/08/2023

Multi-level Protein Representation Learning for Blind Mutational Effect Prediction

Directed evolution plays an indispensable role in protein engineering th...
research
06/14/2022

Exploring evolution-based -free protein language models as protein function predictors

Large-scale Protein Language Models (PLMs) have improved performance in ...

Please sign up or login with your details

Forgot password? Click here to reset