Abstract
We study random string-duplication systems, which we call Pólya string models. These are motivated by a class of mutations that are common in most organisms and lead to an abundance of repeated sequences in their genomes. Unlike previous works that study the combinatorial capacity of string-duplication systems, or in a probabilistic setting, various string statistics, this work provides the exact entropy rate or bounds on it, for several probabilistic models. The entropy rate determines the compressibility of the resulting sequences, as well as quantifying the amount of sequence diversity that these mutations can create. In particular, we study the entropy rate of noisy string-duplication systems, including the tandem-duplication, end-duplication, and interspersed-duplication systems, where in all cases we study duplication of length 1 only. Interesting connections are drawn between some systems and the signature of random permutations, as well as to the beta distribution common in population genetics.
| Original language | English |
|---|---|
| Article number | 8809682 |
| Pages (from-to) | 8180-8193 |
| Number of pages | 14 |
| Journal | IEEE Transactions on Information Theory |
| Volume | 65 |
| Issue number | 12 |
| DOIs | |
| State | Published - 1 Dec 2019 |
Keywords
- DNA storage
- Pólya string models
- entropy rate
- string-duplication systems
ASJC Scopus subject areas
- Information Systems
- Computer Science Applications
- Library and Information Sciences