Apollo: A Sequencing-Technology-Independent, Scalable, and Accurate Assembly Polishing Algorithm

by   Can Firtina, et al.

A large proportion of the basepairs in the long reads that third-generation sequencing technologies produce possess sequencing errors. These errors propagate to the assembly and affect the accuracy of genome analysis. Assembly polishing algorithms minimize error propagation by polishing or fixing errors in the assembly by using information from alignments between reads and the assembly (i.e., read-to-assembly alignment). However, current assembly polishing algorithms can only polish an assembly using reads either from a certain sequencing technology or from a small genome. This technology and genome-size dependency prevents assembly polishing algorithms from either (1) using all the available read sets from multiple sequencing technologies or (2) polishing large genomes. We introduce Apollo, a new assembly polishing algorithm that can 1) scale to polish assemblies of large genomes and 2) use multiple sets of reads from any sequencing technology to polish an assembly. Our goal is to provide a single algorithm that uses read sets from all sequencing technologies to polish assemblies and that can polish large genomes. Apollo 1) models an assembly as a profile hidden Markov model (pHMM), 2) uses read-to-assembly alignment to train the pHMM with the Forward-Backward algorithm, and 3) decodes the trained model with the Viterbi algorithm to produce a polished assembly. Our experiments with real read sets demonstrate that 1) Apollo is the only algorithm that can use reads from multiple sequencing technology within a single run and that can polish an assembly of any size, 2) using reads from multiple sequencing technologies produces a more accurate assembly compared to using reads from a single sequencing technology, and 3) Apollo performs better than or comparable to the state-of-the-art algorithms in terms of accuracy even when using reads from a single sequencing technology.


page 1

page 2

page 3

page 4


A Crowdsourced Gameplay for Whole-Genome Assembly via Short Reads

Next-generation sequencing has revolutionized the field of genomics by p...

mTim: Rapid and accurate transcript reconstruction from RNA-Seq data

Recent advances in high-throughput cDNA sequencing (RNA-Seq) technology ...

Fast genomic optical map assembly algorithm using binary representation

Reducing the cost of sequencing genomes provided by next-generation sequ...

HQAlign: Aligning nanopore reads for SV detection using current-level modeling

Motivation: Detection of structural variants (SV) from the alignment of ...

Extreme Scale De Novo Metagenome Assembly

Metagenome assembly is the process of transforming a set of short, overl...

Specified Certainty Classification, with Application to Read Classification for Reference-Guided Metagenomic Assembly

Specified Certainty Classification (SCC) is a new paradigm for employing...

ViQUF: de novo Viral Quasispecies reconstruction using Unitig-based Flow networks

During viral infection, intrahost mutation and recombination can lead to...

Please sign up or login with your details

Forgot password? Click here to reset