An Algorithmic Bridge Between Hamming and Levenshtein Distances

11/22/2022
by   Elazar Goldenberg, et al.
0

The edit distance between strings classically assigns unit cost to every character insertion, deletion, and substitution, whereas the Hamming distance only allows substitutions. In many real-life scenarios, insertions and deletions (abbreviated indels) appear frequently but significantly less so than substitutions. To model this, we consider substitutions being cheaper than indels, with cost 1/a for a parameter a≥ 1. This basic variant, denoted ED_a, bridges classical edit distance (a=1) with Hamming distance (a→∞), leading to interesting algorithmic challenges: Does the time complexity of computing ED_a interpolate between that of Hamming distance (linear time) and edit distance (quadratic time)? What about approximating ED_a? We first present a simple deterministic exact algorithm for ED_a and further prove that it is near-optimal assuming the Orthogonal Vectors Conjecture. Our main result is a randomized algorithm computing a (1+ϵ)-approximation of ED_a(X,Y), given strings X,Y of total length n and a bound k≥ ED_a(X,Y). For simplicity, let us focus on k≥ 1 and a constant ϵ > 0; then, our algorithm takes Õ(n/a + ak^3) time. Unless a=Õ(1) and for small enough k, this running time is sublinear in n. We also consider a very natural version that asks to find a (k_I, k_S)-alignment – an alignment with at most k_I indels and k_S substitutions. In this setting, we give an exact algorithm and, more importantly, an Õ(nk_I/k_S + k_S· k_I^3)-time (1,1+ϵ)-bicriteria approximation algorithm. The latter solution is based on the techniques we develop for ED_a for a=Θ(k_S / k_I). These bounds are in stark contrast to unit-cost edit distance, where state-of-the-art algorithms are far from achieving (1+ϵ)-approximation in sublinear time, even for a favorable choice of k.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset