AirLift: A Fast and Comprehensive Technique for Translating Alignments between Reference Genomes

by   Jeremie S. Kim, et al.

As genome sequencing tools and techniques improve, researchers are able to incrementally assemble more accurate reference genomes. A more accurate reference genome enables increased accuracy in read mappings, which provides more accurate variant information and thus health data on the donor. Therefore, read data sets from sequenced samples should ideally be mapped to the latest available reference genome. Unfortunately, the increasingly large amounts of available genomic data makes it prohibitively expensive to fully map each read data set to its respective reference genome every time the reference is updated. Several tools that attempt to reduce the procedure of updating a read data set from one reference to another (i.e., remapping) have been published. These tools identify regions of similarity across the two references and update the mapping locations of a read based on the locations of similar regions in the new reference genome. The main drawback of existing approaches is that if a read maps to a region in the old reference without similar regions in the new reference, it cannot be remapped. We find that, as a result of this drawback, a significant portion of annotations are lost when using state-of-the-art remapping tools. To address this major limitation in existing tools, we propose AirLift, a fast and comprehensive technique for moving alignments from one genome to another. AirLift can reduce 1) the number of reads that need to be mapped from the entire read set by up to 99.9 time to remap the reads between the two most recent reference versions by 6.94x, 44.0x, and 16.4x for large (human), medium (C. elegans), and small (yeast) references, respectively.


Accelerating Genome Analysis: A Primer on an Ongoing Journey

Genome analysis fundamentally starts with a process known as read mappin...

GenStore: A High-Performance and Energy-Efficient In-Storage Computing System for Genome Sequence Analysis

Read mapping is a fundamental, yet computationally-expensive step in man...

GenPIP: In-Memory Acceleration of Genome Analysis via Tight Integration of Basecalling and Read Mapping

Nanopore sequencing is a widely-used high-throughput genome sequencing t...

Specified Certainty Classification, with Application to Read Classification for Reference-Guided Metagenomic Assembly

Specified Certainty Classification (SCC) is a new paradigm for employing...

Fully scalable online-preprocessing algorithm for short oligonucleotide microarray atlases

Accumulation of standardized data collections is opening up novel opport...

Taming Large-Scale Genomic Analyses via Sparsified Genomics

Searching for similar genomic sequences is an essential and fundamental ...

MetaCache-GPU: Ultra-Fast Metagenomic Classification

The cost of DNA sequencing has dropped exponentially over the past decad...

Please sign up or login with your details

Forgot password? Click here to reset