On the Maximal Independent Sets of k-mers with the Edit Distance

03/20/2023
by   Leran Ma, et al.
0

In computational biology, k-mers and edit distance are fundamental concepts. However, little is known about the metric space of all k-mers equipped with the edit distance. In this work, we explore the structure of the k-mer space by studying its maximal independent sets (MISs). An MIS is a sparse sketch of all k-mers with nice theoretical properties, and therefore admits critical applications in clustering, indexing, hashing, and sketching large-scale sequencing data, particularly those with high error-rates. Finding an MIS is a challenging problem, as the size of a k-mer space grows geometrically with respect to k. We propose three algorithms for this problem. The first and the most intuitive one uses a greedy strategy. The second method implements two techniques to avoid redundant comparisons by taking advantage of the locality-property of the k-mer space and the estimated bounds on the edit distance. The last algorithm avoids expensive calculations of the edit distance by translating the edit distance into the shortest path in a specifically designed graph. These algorithms are implemented and the calculated MISs of k-mer spaces and their statistical properties are reported and analyzed for k up to 15. Source code is freely available at https://github.com/Shao-Group/kmerspace .

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/17/2018

Revisiting the tree edit distance and its backtracing: A tutorial

Almost 30 years ago, Zhang and Shasha published a seminal paper describi...
research
07/02/2019

Approximate Similarity Search Under Edit Distance Using Locality-Sensitive Hashing

Edit distance similarity search, also called approximate pattern matchin...
research
04/29/2019

Soft edit distance for differentiable comparison of symbolic sequences

Edit distance, also known as Levenshtein distance, is an essential way t...
research
07/31/2020

Improved Time Warp Edit Distance – A Parallel Dynamic Program in Linear Memory

Edit Distance is a classic family of dynamic programming problems, among...
research
06/10/2021

Small space and streaming pattern matching with k edits

In this work, we revisit the fundamental and well-studied problem of app...
research
05/17/2023

Revisiting the Complexity of and Algorithms for the Graph Traversal Edit Distance and Its Variants

The graph traversal edit distance (GTED) is an elegant distance measure ...
research
05/29/2015

Geometry of Graph Edit Distance Spaces

In this paper we study the geometry of graph spaces endowed with a speci...

Please sign up or login with your details

Forgot password? Click here to reset