TY - JOUR
T1 - The Entropy Rate of Some Pólya String Models
AU - Elishco, Ohad
AU - Hassanzadeh, Farzad Farnoud
AU - Schwartz, Moshe
AU - Bruck, Jehoshua
N1 - Funding Information:
Dr. Bruck is a recipient of the Feynman Prize for Excellence in Teaching, the Sloan Research Fellowship, the National Science Foundation Young Investigator Award, the IBM Outstanding Innovation Award and the IBM Outstanding Technical Achievement Award.
Funding Information:
Manuscript received August 20, 2018; revised May 15, 2019; accepted August 9, 2019. Date of publication August 22, 2019; date of current version November 20, 2019. This work was supported in part by the National Foundation under Grant CCF-1317694, Grant CCF-1755773, Grant CCF-1816409, Grant CCF-1816965, and Grant CCF-1717884 and in part by the United States–Israel Binational Science Foundation (BSF) under Grant 2017652. This article was presented in part at the 2016 IEEE International Symposium on Information Theory.
Publisher Copyright:
© 1963-2012 IEEE.
PY - 2019/12/1
Y1 - 2019/12/1
N2 - We study random string-duplication systems, which we call Pólya string models. These are motivated by a class of mutations that are common in most organisms and lead to an abundance of repeated sequences in their genomes. Unlike previous works that study the combinatorial capacity of string-duplication systems, or in a probabilistic setting, various string statistics, this work provides the exact entropy rate or bounds on it, for several probabilistic models. The entropy rate determines the compressibility of the resulting sequences, as well as quantifying the amount of sequence diversity that these mutations can create. In particular, we study the entropy rate of noisy string-duplication systems, including the tandem-duplication, end-duplication, and interspersed-duplication systems, where in all cases we study duplication of length 1 only. Interesting connections are drawn between some systems and the signature of random permutations, as well as to the beta distribution common in population genetics.
AB - We study random string-duplication systems, which we call Pólya string models. These are motivated by a class of mutations that are common in most organisms and lead to an abundance of repeated sequences in their genomes. Unlike previous works that study the combinatorial capacity of string-duplication systems, or in a probabilistic setting, various string statistics, this work provides the exact entropy rate or bounds on it, for several probabilistic models. The entropy rate determines the compressibility of the resulting sequences, as well as quantifying the amount of sequence diversity that these mutations can create. In particular, we study the entropy rate of noisy string-duplication systems, including the tandem-duplication, end-duplication, and interspersed-duplication systems, where in all cases we study duplication of length 1 only. Interesting connections are drawn between some systems and the signature of random permutations, as well as to the beta distribution common in population genetics.
KW - DNA storage
KW - Pólya string models
KW - entropy rate
KW - string-duplication systems
UR - http://www.scopus.com/inward/record.url?scp=85077357828&partnerID=8YFLogxK
U2 - 10.1109/TIT.2019.2936556
DO - 10.1109/TIT.2019.2936556
M3 - Article
AN - SCOPUS:85077357828
SN - 0018-9448
VL - 65
SP - 8180
EP - 8193
JO - IEEE Transactions on Information Theory
JF - IEEE Transactions on Information Theory
IS - 12
M1 - 8809682
ER -