Embed-Search-Align: DNA Sequence Alignment using Transformer Models

09/20/2023
by   Pavan Holur, et al.
0

DNA sequence alignment involves assigning short DNA reads to the most probable locations on an extensive reference genome. This process is crucial for various genomic analyses, including variant calling, transcriptomics, and epigenomics. Conventional methods, refined over decades, tackle this challenge in two steps: genome indexing followed by efficient search to locate likely positions for given reads. Building on the success of Large Language Models (LLM) in encoding text into embeddings, where the distance metric captures semantic similarity, recent efforts have explored whether the same Transformer architecture can produce numerical representations for DNA sequences. Such models have shown early promise in tasks involving classification of short DNA sequences, such as the detection of coding vs non-coding regions, as well as the identification of enhancer and promoter sequences. Performance at sequence classification tasks does not, however, translate to sequence alignment, where it is necessary to conduct a genome-wide search to successfully align every read. We address this open problem by framing it as an Embed-Search-Align task. In this framework, a novel encoder model DNA-ESA generates representations of reads and fragments of the reference, which are projected into a shared vector space where the read-fragment distance is used as surrogate for alignment. In particular, DNA-ESA introduces: (1) Contrastive loss for self-supervised training of DNA sequence representations, facilitating rich sequence-level embeddings, and (2) a DNA vector store to enable search across fragments on a global scale. DNA-ESA is >97 human reference genome of 3 gigabases (single-haploid), far exceeds the performance of 6 recent DNA-Transformer model baselines and shows task transfer across chromosomes and species.

READ FULL TEXT
research
10/10/2019

LISA: Towards Learned DNA Sequence Search

Next-generation sequencing (NGS) technologies have enabled affordable se...
research
03/29/2019

Private Shotgun DNA Sequencing: A Structured Approach

Current techniques in sequencing a genome allow a service provider (e.g....
research
11/17/2022

Knowledge distillation for fast and accurate DNA sequence correction

Accurate genome sequencing can improve our understanding of biology and ...
research
09/28/2019

Deep Multiple Instance Learning for Taxonomic Classification of Metagenomic read sets

Metagenomic studies have increasingly utilized sequencing technologies i...
research
05/16/2022

Genomic Compression with Read Alignment at the Decoder

We propose a new compression scheme for genomic data given as sequence f...
research
04/27/2017

DNA Steganalysis Using Deep Recurrent Neural Networks

The technique of hiding messages in digital data is called a steganograp...
research
05/26/2015

Large-scale Machine Learning for Metagenomics Sequence Classification

Metagenomics characterizes the taxonomic diversity of microbial communit...

Please sign up or login with your details

Forgot password? Click here to reset