Evolution of N-Gram Frequencies under Duplication and Substitution Mutations

Hao Lou, Moshe Schwartz, Farzad Farnoud Hassanzadeh

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

5 Scopus citations

Abstract

The driving force behind the generation of biological sequences are genomic mutations that shape these sequences throughout their evolutionary history. An understanding of the statistical properties that result from mutation processes is of value in a variety of tasks related to biological sequence data, e.g., estimation of model parameters and compression. At the same time, due to the complexity of these processes, designing tractable stochastic models and analyzing them are challenging. In this paper, we study two types of mutations, tandem duplication and substitution. These play a critical role in forming tandem repeat regions, which are common features of the genome of many organisms. We provide a stochastic model and, via stochastic approximation, study the behavior of the frequencies of N- grams in resulting sequences. Specifically, we show that N-gram frequencies converge almost surely to a set which we identify as a function of model parameters. From these frequencies, other statistics can be derived. In particular, we present a method for finding upper bounds on entropy.

Original languageEnglish
Title of host publication2018 IEEE International Symposium on Information Theory, ISIT 2018
PublisherInstitute of Electrical and Electronics Engineers
Pages2246-2250
Number of pages5
ISBN (Print)9781538647806
DOIs
StatePublished - 15 Aug 2018
Event2018 IEEE International Symposium on Information Theory, ISIT 2018 - Vail, United States
Duration: 17 Jun 201822 Jun 2018

Publication series

NameIEEE International Symposium on Information Theory - Proceedings
Volume2018-June
ISSN (Print)2157-8095

Conference

Conference2018 IEEE International Symposium on Information Theory, ISIT 2018
Country/TerritoryUnited States
CityVail
Period17/06/1822/06/18

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Information Systems
  • Modeling and Simulation
  • Applied Mathematics

Fingerprint

Dive into the research topics of 'Evolution of N-Gram Frequencies under Duplication and Substitution Mutations'. Together they form a unique fingerprint.

Cite this