Optimal Reference for DNA Synthesis

Ohad Elishco, Wasim Huleihel

Research output: Contribution to journalArticlepeer-review

6 Scopus citations

Abstract

—In recent years, DNA has emerged as a potentially viable storage technology. DNA synthesis, which refers to the task of writing the data into DNA, is perhaps the most costly part of existing storage systems. Consequently, the high cost and low throughput limit the practical use of available DNA synthesis technologies. It has been found that the homopolymer run (i.e., the repetition of the same nucleotide) is a major factor affecting the synthesis and sequencing errors. Recently, [26] raised and studied the coding problem for efficient synthesis for DNA-based storage systems. Among other things, they studied the maximal code size under synthesis constraints. In [29], the authors studied the role of batch optimization in reducing the cost of large-scale DNA synthesis, for a given pool S of random quaternary strings of fixed length. This problem is related to the problem posedin [26] which can be viewed as the opposite side of the coin. Instead of seeking the largest code in which every codeword can be synthesized in a certain amount of time, they asked what is the average synthesis time of a randomly chosen string. Following the lead of [29], in this paper, we take a step forward towards the theoretical understanding of DNA synthesis, and study the homopolymer run of length k > 1. Specifically, we are given a set of DNA strands S, randomly drawn from a Markovian distribution modeling a general homopolymer run length constraint, that we wish to synthesize. For this problem, we derive asymptotically tight high probability lower and upper bounds on the cost of DNA synthesis, for any k > 1. Our bounds imply that, perhaps surprisingly, the periodic sequence ACGT is asymptotically optimal in the sense of achieving the smallest possible cost. Our main technical contribution is the representation of the DNA synthesis process as a certain constrained system, for which string techniques can be applied

Original languageEnglish
Pages (from-to)6941-6955
Number of pages15
JournalIEEE Transactions on Information Theory
Volume69
Issue number11
DOIs
StatePublished - 19 Jun 2023

Keywords

  • DNA
  • Information entropy
  • biological information theory
  • codes
  • decoding
  • encoding

ASJC Scopus subject areas

  • Information Systems
  • Computer Science Applications
  • Library and Information Sciences

Fingerprint

Dive into the research topics of 'Optimal Reference for DNA Synthesis'. Together they form a unique fingerprint.

Cite this