Deep Squared Euclidean Approximation to the Levenshtein Distance for DNA Storage

07/11/2022
by   Alan J. X. Guo, et al.
0

Storing information in DNA molecules is of great interest because of its advantages in longevity, high storage density, and low maintenance cost. A key step in the DNA storage pipeline is to efficiently cluster the retrieved DNA sequences according to their similarities. Levenshtein distance is the most suitable metric on the similarity between two DNA sequences, but it is inferior in terms of computational complexity and less compatible with mature clustering algorithms. In this work, we propose a novel deep squared Euclidean embedding for DNA sequences using Siamese neural network, squared Euclidean embedding, and chi-squared regression. The Levenshtein distance is approximated by the squared Euclidean distance between the embedding vectors, which is fast calculated and clustering algorithm friendly. The proposed approach is analyzed theoretically and experimentally. The results show that the proposed embedding is efficient and robust.

READ FULL TEXT
research
06/22/2023

MQ-Coder inspired arithmetic coder for synthetic DNA data storage

Over the past years, the ever-growing trend on data storage demand, more...
research
09/16/2019

Unaligned Sequence Similarity Search Using Deep Learning

Gene annotation has traditionally required direct comparison of DNA sequ...
research
10/20/2022

Robust Multi-Read Reconstruction from Contaminated Clusters Using Deep Neural Network for DNA Storage

DNA has immense potential as an emerging data storage medium. The princi...
research
11/27/2018

Tackling Early Sparse Gradients in Softmax Activation Using Leaky Squared Euclidean Distance

Softmax activation is commonly used to output the probability distributi...
research
08/31/2021

Deep DNA Storage: Scalable and Robust DNA Storage via Coding Theory and Deep Learning

The concept of DNA storage was first suggested in 1959 by Richard Feynma...
research
05/14/2020

Thermodynamically Stable DNA Code Design using a Similarity Significance Model

DNA code design aims to generate a set of DNA sequences (codewords) with...
research
12/27/2022

Efficiently Supporting Hierarchy and Data Updates in DNA Storage

We propose a novel and flexible DNA-storage architecture that provides t...

Please sign up or login with your details

Forgot password? Click here to reset