Optimal Reference for DNA Synthesis

04/14/2022
by   Ohad Elishco, et al.
0

In the recent years, DNA has emerged as a potentially viable storage technology. DNA synthesis, which refers to the task of writing the data into DNA, is perhaps the most costly part of existing storage systems. Accordingly, this high cost and low throughput limits the practical use in available DNA synthesis technologies. It has been found that the homopolymer run (i.e., the repetition of the same nucleotide) is a major factor affecting the synthesis and sequencing errors. Quite recently, [26] studied the role of batch optimization in reducing the cost of large scale DNA synthesis, for a given pool 𝒮 of random quaternary strings of fixed length. Among other things, it was shown that the asymptotic cost savings of batch optimization are significantly greater when the strings in 𝒮 contain repeats of the same character (homopolymer run of length one), as compared to the case where strings are unconstrained. Following the lead of [26], in this paper, we take a step forward towards the theoretical understanding of DNA synthesis, and study the homopolymer run of length k≥1. Specifically, we are given a set of DNA strands 𝒮, randomly drawn from a natural Markovian distribution modeling a general homopolymer run length constraint, that we wish to synthesize. For this problem, we prove that for any k≥ 1, the optimal reference strand, minimizing the cost of DNA synthesis is, perhaps surprisingly, the periodic sequence 𝖠𝖢𝖦𝖳. It turns out that tackling the homopolymer constraint of length k≥2 is a challenging problem; our main technical contribution is the representation of the DNA synthesis process as a certain constrained system, for which string techniques can be applied.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/30/2020

Batch Optimization for DNA Synthesis

Large pools of synthetic DNA molecules have been recently used to reliab...
research
05/12/2023

Deletion Correcting Codes for Efficient DNA Synthesis

The synthesis of DNA strands remains the most costly part of the DNA sto...
research
11/11/2021

Multivariate Analytic Combinatorics for Cost Constrained Channels and Subsequence Enumeration

Analytic combinatorics in several variables is a powerful tool for deriv...
research
02/17/2016

Lexis: An Optimization Framework for Discovering the Hierarchical Structure of Sequential Data

Data represented as strings abounds in biology, linguistics, document mi...
research
05/31/2021

Sequenceable Event Recorders

With recent high-throughput technology we can synthesize large heterogen...
research
02/02/2021

On Codes for the Noisy Substring Channel

We consider the problem of coding for the substring channel, in which in...
research
03/05/2021

Iterative DNA Coding Scheme With GC Balance and Run-Length Constraints Using a Greedy Algorithm

In this paper, we propose a novel iterative encoding algorithm for DNA s...

Please sign up or login with your details

Forgot password? Click here to reset