TY - JOUR
T1 - Evolution of k -Mer Frequencies and Entropy in Duplication and Substitution Mutation Systems
AU - Lou, Hao
AU - Schwartz, Moshe
AU - Bruck, Jehoshua
AU - Farnoud, Farzad
N1 - Funding Information:
Dr. Bruck is a recipient of the Feynman Prize for Excellence in Teaching, the Sloan Research Fellowship, the National Science Foundation Young Investigator Award, the IBM Outstanding Innovation Award and the IBM Outstanding Technical Achievement Award.
Funding Information:
Manuscript received November 28, 2018; revised June 25, 2019; accepted September 28, 2019. Date of publication October 10, 2019; date of current version April 21, 2020. This work was supported in part by the United States– Israel Binational Science Foundation (BSF) under Grant 2017652 and in part by NSF under Grant CCF-1816409, Grant CCF-1755773, Grant CCF-1816965, and Grant CCF-1717884. This article was presented in part at ISIT 2018 and ISIT 2015.
Publisher Copyright:
© 1963-2012 IEEE.
PY - 2020/5/1
Y1 - 2020/5/1
N2 - Genomic evolution can be viewed as string-editing processes driven by mutations. An understanding of the statistical properties resulting from these mutation processes is of value in a variety of tasks related to biological sequence data, e.g., estimation of model parameters and compression. At the same time, due to the complexity of these processes, designing tractable stochastic models and analyzing them are challenging. In this paper, we study two kinds of systems, each representing a set of mutations. In the first system, tandem duplications and substitution mutations are allowed and in the other, interspersed duplications. We provide stochastic models and, via stochastic approximation, study the evolution of substring frequencies for these two systems separately. Specifically, we show that $k$ -mer frequencies converge almost surely and determine the limit set. Furthermore, we present a method for finding upper bounds on entropy for such systems.
AB - Genomic evolution can be viewed as string-editing processes driven by mutations. An understanding of the statistical properties resulting from these mutation processes is of value in a variety of tasks related to biological sequence data, e.g., estimation of model parameters and compression. At the same time, due to the complexity of these processes, designing tractable stochastic models and analyzing them are challenging. In this paper, we study two kinds of systems, each representing a set of mutations. In the first system, tandem duplications and substitution mutations are allowed and in the other, interspersed duplications. We provide stochastic models and, via stochastic approximation, study the evolution of substring frequencies for these two systems separately. Specifically, we show that $k$ -mer frequencies converge almost surely and determine the limit set. Furthermore, we present a method for finding upper bounds on entropy for such systems.
KW - String-duplication systems
KW - entropy
KW - substitution mutation
UR - http://www.scopus.com/inward/record.url?scp=85084127783&partnerID=8YFLogxK
U2 - 10.1109/TIT.2019.2946846
DO - 10.1109/TIT.2019.2946846
M3 - Article
AN - SCOPUS:85084127783
SN - 0018-9448
VL - 66
SP - 3171
EP - 3186
JO - IEEE Transactions on Information Theory
JF - IEEE Transactions on Information Theory
IS - 5
M1 - 8864099
ER -