Sequence-Subset Distance and Coding for Error Control in DNA Data Storage
The process of DNA data storage can be mathematically modelled as a communication channel, termed DNA storage channel, whose inputs and outputs are sets of unordered sequences. To design error correcting codes for DNA storage channel, a new metric, termed the sequence-subset distance, is introduced, which generalizes the Hamming distance to a distance function defined between any two sets of unordered vectors and helps to establish a uniform framework to design error correcting codes for DNA storage channel. We further introduce a family of error correcting codes, termed sequence subset codes, for DNA storage and show that the error-correcting ability of such codes is completely determined by their minimum distance. We derived some upper bounds on the size of the sequence subset codes including a Singleton-like bound and a Plotkin-like bound. We also propose some constructions, which imply lower bounds on the size of such codes.
READ FULL TEXT