Lossy Compressor preserving variant calling through Extended BWT

04/17/2023
by   Veronica Guerrini, et al.
0

A standard format used for storing the output of high-throughput sequencing experiments is the FASTQ format. It comprises three main components: (i) headers, (ii) bases (nucleotide sequences), and (iii) quality scores. FASTQ files are widely used for variant calling, where sequencing data are mapped into a reference genome to discover variants that may be used for further analysis. There are many specialized compressors that exploit redundancy in FASTQ data with the focus only on either the bases or the quality scores components. In this paper we consider the novel problem of lossy compressing, in a reference-free way, FASTQ data by modifying both components at the same time, while preserving the important information of the original FASTQ. We introduce a general strategy, based on the Extended Burrows-Wheeler Transform (EBWT) and positional clustering, and we present implementations in both internal memory and external memory. Experimental results show that the lossy compression performed by our tool is able to achieve good compression while preserving information relating to variant calling more than the competitors. Availability: the software is freely available at https://github.com/veronicaguerrini/BFQzip.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/06/2022

Pareto Optimal Compression of Genomic Dictionaries, with or without Random Access in Main Memory

Motivation: A Genomic Dictionary, i.e., the set of the k-mers appearing ...
research
04/14/2023

Groebner.jl: A package for Gröbner bases computations in Julia

We introduce the Julia package Groebner.jl for computing Gröbner bases w...
research
11/15/2022

Taming Large-Scale Genomic Analyses via Sparsified Genomics

Searching for similar genomic sequences is an essential and fundamental ...
research
10/30/2022

CAD 3D Model classification by Graph Neural Networks: A new approach based on STEP format

In this paper, we introduce a new approach for retrieval and classificat...
research
09/14/2022

Typesafe Coordinate Systems in High-Throughput Sequencing Applications

High-throughput sequencing file formats and tools encode coordinate inte...
research
12/09/2022

TargetCall: Eliminating the Wasted Computation in Basecalling via Pre-Basecalling Filtering

Basecalling is an essential step in nanopore sequencing analysis where t...
research
11/26/2018

Interlacing Personal and Reference Genomes for Machine Learning Disease-Variant Detection

DNA sequencing to identify genetic variants is becoming increasingly val...

Please sign up or login with your details

Forgot password? Click here to reset