TY - GEN
T1 - Evolution of N-Gram Frequencies under Duplication and Substitution Mutations
AU - Lou, Hao
AU - Schwartz, Moshe
AU - Hassanzadeh, Farzad Farnoud
N1 - Publisher Copyright:
© 2018 IEEE.
PY - 2018/8/15
Y1 - 2018/8/15
N2 - The driving force behind the generation of biological sequences are genomic mutations that shape these sequences throughout their evolutionary history. An understanding of the statistical properties that result from mutation processes is of value in a variety of tasks related to biological sequence data, e.g., estimation of model parameters and compression. At the same time, due to the complexity of these processes, designing tractable stochastic models and analyzing them are challenging. In this paper, we study two types of mutations, tandem duplication and substitution. These play a critical role in forming tandem repeat regions, which are common features of the genome of many organisms. We provide a stochastic model and, via stochastic approximation, study the behavior of the frequencies of N- grams in resulting sequences. Specifically, we show that N-gram frequencies converge almost surely to a set which we identify as a function of model parameters. From these frequencies, other statistics can be derived. In particular, we present a method for finding upper bounds on entropy.
AB - The driving force behind the generation of biological sequences are genomic mutations that shape these sequences throughout their evolutionary history. An understanding of the statistical properties that result from mutation processes is of value in a variety of tasks related to biological sequence data, e.g., estimation of model parameters and compression. At the same time, due to the complexity of these processes, designing tractable stochastic models and analyzing them are challenging. In this paper, we study two types of mutations, tandem duplication and substitution. These play a critical role in forming tandem repeat regions, which are common features of the genome of many organisms. We provide a stochastic model and, via stochastic approximation, study the behavior of the frequencies of N- grams in resulting sequences. Specifically, we show that N-gram frequencies converge almost surely to a set which we identify as a function of model parameters. From these frequencies, other statistics can be derived. In particular, we present a method for finding upper bounds on entropy.
UR - http://www.scopus.com/inward/record.url?scp=85052484694&partnerID=8YFLogxK
U2 - 10.1109/ISIT.2018.8437507
DO - 10.1109/ISIT.2018.8437507
M3 - Conference contribution
AN - SCOPUS:85052484694
SN - 9781538647806
T3 - IEEE International Symposium on Information Theory - Proceedings
SP - 2246
EP - 2250
BT - 2018 IEEE International Symposium on Information Theory, ISIT 2018
PB - Institute of Electrical and Electronics Engineers
T2 - 2018 IEEE International Symposium on Information Theory, ISIT 2018
Y2 - 17 June 2018 through 22 June 2018
ER -