Reprogramming Pretrained Language Models for Protein Sequence Representation Learning

01/05/2023
by   Ria Vinod, et al.
32

Machine Learning-guided solutions for protein learning tasks have made significant headway in recent years. However, success in scientific discovery tasks is limited by the accessibility of well-defined and labeled in-domain data. To tackle the low-data constraint, recent adaptions of deep learning models pretrained on millions of protein sequences have shown promise; however, the construction of such domain-specific large-scale model is computationally expensive. Here, we propose Representation Learning via Dictionary Learning (R2DL), an end-to-end representation learning framework in which we reprogram deep models for alternate-domain tasks that can perform well on protein property prediction with significantly fewer training samples. R2DL reprograms a pretrained English language model to learn the embeddings of protein sequences, by learning a sparse linear mapping between English and protein sequence vocabulary embeddings. Our model can attain better accuracy and significantly improve the data efficiency by up to 10^5 times over the baselines set by pretrained and standard supervised methods. To this end, we reprogram an off-the-shelf pre-trained English language transformer and benchmark it on a set of protein physicochemical prediction tasks (secondary structure, stability, homology, stability) as well as on a biomedically relevant set of protein function prediction tasks (antimicrobial, toxicity, antibody affinity).

READ FULL TEXT

page 6

page 8

research
02/24/2023

Retrieved Sequence Augmentation for Protein Representation Learning

Protein language models have excelled in a variety of tasks, ranging fro...
research
11/30/2022

xTrimoABFold: Improving Antibody Structure Prediction without Multiple Sequence Alignments

In the field of antibody engineering, an essential task is to design a n...
research
05/16/2021

Protein sequence-to-structure learning: Is this the end(-to-end revolution)?

The potential of deep learning has been recognized in the protein struct...
research
12/07/2020

Reprogramming Language Models for Molecular Representation Learning

Recent advancements in transfer learning have made it a promising approa...
research
04/04/2022

Multi-Scale Representation Learning on Proteins

Proteins are fundamental biological entities mediating key roles in cell...
research
12/07/2022

When Geometric Deep Learning Meets Pretrained Protein Language Models

Geometric deep learning has recently achieved great success in non-Eucli...
research
05/07/2023

Generative Pretrained Autoregressive Transformer Graph Neural Network applied to the Analysis and Discovery of Novel Proteins

We report a flexible language-model based deep learning strategy, applie...

Please sign up or login with your details

Forgot password? Click here to reset