Taming Large-Scale Genomic Analyses via Sparsified Genomics

11/15/2022
by   Mohammed Alser, et al.
0

Searching for similar genomic sequences is an essential and fundamental step in biomedical research and an overwhelming majority of genomic analyses. State-of-the-art computational methods performing such comparisons fail to cope with the exponential growth of genomic sequencing data. We introduce the concept of sparsified genomics where we systematically exclude a large number of bases from genomic sequences and enable much faster and more memory-efficient processing of the sparsified, shorter genomic sequences, while providing similar or even higher accuracy compared to processing non-sparsified sequences. Sparsified genomics provides significant benefits to many genomic analyses and has broad applicability. We show that sparsifying genomic sequences greatly accelerates the state-of-the-art read mapper (minimap2) by 1.54-8.8x using real Illumina, HiFi, and ONT reads, while providing a higher number of mapped reads and more detected small and structural variations. Sparsifying genomic sequences makes containment search through very large genomes and very large databases 72.7-75.88x faster and 723.3x more storage-efficient than searching through non-sparsified genomic sequences (with CMash and KMC3). Sparsifying genomic sequences enables robust microbiome discovery by providing 54.15-61.88x faster and 720x more storage-efficient taxonomic profiling of metagenomic samples over the state-of-art tool (Metalign). We design and open-source a framework called Genome-on-Diet as an example tool for sparsified genomics, which can be freely downloaded from https://github.com/CMU-SAFARI/Genome-on-Diet.

READ FULL TEXT

page 3

page 6

page 11

page 13

page 24

page 29

research
12/09/2022

TargetCall: Eliminating the Wasted Computation in Basecalling via Pre-Basecalling Filtering

Basecalling is an essential step in nanopore sequencing analysis where t...
research
02/21/2022

GenStore: A High-Performance and Energy-Efficient In-Storage Computing System for Genome Sequence Analysis

Read mapping is a fundamental, yet computationally-expensive step in man...
research
05/21/2021

GapPredict: A Language Model for Resolving Gaps in Draft Genome Assemblies

Short-read DNA sequencing instruments can yield over 1e+12 bases per run...
research
04/17/2023

Lossy Compressor preserving variant calling through Extended BWT

A standard format used for storing the output of high-throughput sequenc...
research
12/18/2019

AirLift: A Fast and Comprehensive Technique for Translating Alignments between Reference Genomes

As genome sequencing tools and techniques improve, researchers are able ...
research
09/27/2019

Telescope: an interactive tool for managing large scale analysis from mobile devices

In today's world of big data, computational analysis has become a key dr...
research
06/07/2019

Datalog Disassembly

Disassembly is fundamental to binary analysis and rewriting. We present ...

Please sign up or login with your details

Forgot password? Click here to reset