Epigenomic language models powered by Cerebras

by   Meredith V. Trotter, et al.

Large scale self-supervised pre-training of Transformer language models has advanced the field of Natural Language Processing and shown promise in cross-application to the biological `languages' of proteins and DNA. Learning effective representations of DNA sequences using large genomic sequence corpuses may accelerate the development of models of gene regulation and function through transfer learning. However, to accurately model cell type-specific gene regulation and function, it is necessary to consider not only the information contained in DNA nucleotide sequences, which is mostly invariant between cell types, but also how the local chemical and structural `epigenetic state' of chromosomes varies between cell types. Here, we introduce a Bidirectional Encoder Representations from Transformers (BERT) model that learns representations based on both DNA sequence and paired epigenetic state inputs, which we call Epigenomic BERT (or EBERT). We pre-train EBERT with a masked language model objective across the entire human genome and across 127 cell types. Training this complex model with a previously prohibitively large dataset was made possible for the first time by a partnership with Cerebras Systems, whose CS-1 system powered all pre-training experiments. We show EBERT's transfer learning potential by demonstrating strong performance on a cell type-specific transcription factor binding prediction task. Our fine-tuned model exceeds state of the art performance on 4 of 13 evaluation datasets from ENCODE-DREAM benchmarks and earns an overall rank of 3rd on the challenge leaderboard. We explore how the inclusion of epigenetic data and task specific feature augmentation impact transfer learning performance.


Multi-modal Self-supervised Pre-training for Regulatory Genome Across Cell Types

In the genome biology research, regulatory genome modeling is an importa...

A single-cell gene expression language model

Gene regulation is a dynamic process that connects genotype and phenotyp...

Biomedical relation extraction with pre-trained language representations and minimal task-specific architecture

This paper presents our participation in the AGAC Track from the 2019 Bi...

SemanticCAP: Chromatin Accessibility Prediction Enhanced by Features Learning from a Language Model

A large number of inorganic and organic compounds are able to bind DNA a...

Generative Language Models on Nucleotide Sequences of Human Genes

Language models, primarily transformer-based ones, obtained colossal suc...

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

As Transfer Learning from large-scale pre-trained models becomes more pr...

DPCIPI: A pre-trained deep learning model for estimation of cross-immunity between drifted strains of Influenza A/H3N2

Motivation: This study aims to develop a novel model called DNA Pretrain...

Please sign up or login with your details

Forgot password? Click here to reset