TY - JOUR
T1 - Design of shortest double-stranded DNA sequences covering all k-mers with applications to protein-binding microarrays and synthetic enhancers
AU - Orenstein, Yaron
AU - Shamir, Ron
N1 - Funding Information:
Funding: This study was supported in part by the Israel Science Foundation (grant no. 802/08), and by the I-CORE Program of the Planning and Budgeting Committee and the Israel Science Foundation (grant no. 41/11). Y.O. was supported in part by a fellowship from the Edmond J. Safra Center for Bioinformatics at Tel Aviv University and by a Dan David PhD Fellowship.
PY - 2013/7/1
Y1 - 2013/7/1
N2 - Motivation: Novel technologies can generate large sets of short double-stranded DNA sequences that can be used to measure their regulatory effects. Microarrays can measure in vitro the binding intensity of a protein to thousands of probes. Synthetic enhancer sequences inserted into an organism's genome allow us to measure in vivo the effect of such sequences on the phenotype. In both applications, by using sequence probes that cover all k-mers, a comprehensive picture of the effect of all possible short sequences on gene regulation is obtained. The value of k that can be used in practice is, however, severely limited by cost and space considerations. A key challenge is, therefore, to cover all k-mers with a minimal number of probes. The standard way to do this uses the de Bruijn sequence of length. However, as probes are double stranded, when a k-mer is included in a probe, its reverse complement k-mer is accounted for as well.Results: Here, we show how to efficiently create a shortest possible sequence with the property that it contains each k-mer or its reverse complement, but not necessarily both. The length of the resulting sequence approaches half that of the de Bruijn sequence as k increases resulting in a more efficient array, which allows covering more longer sequences; alternatively, additional sequences with redundant k-mers of interest can be added.
AB - Motivation: Novel technologies can generate large sets of short double-stranded DNA sequences that can be used to measure their regulatory effects. Microarrays can measure in vitro the binding intensity of a protein to thousands of probes. Synthetic enhancer sequences inserted into an organism's genome allow us to measure in vivo the effect of such sequences on the phenotype. In both applications, by using sequence probes that cover all k-mers, a comprehensive picture of the effect of all possible short sequences on gene regulation is obtained. The value of k that can be used in practice is, however, severely limited by cost and space considerations. A key challenge is, therefore, to cover all k-mers with a minimal number of probes. The standard way to do this uses the de Bruijn sequence of length. However, as probes are double stranded, when a k-mer is included in a probe, its reverse complement k-mer is accounted for as well.Results: Here, we show how to efficiently create a shortest possible sequence with the property that it contains each k-mer or its reverse complement, but not necessarily both. The length of the resulting sequence approaches half that of the de Bruijn sequence as k increases resulting in a more efficient array, which allows covering more longer sequences; alternatively, additional sequences with redundant k-mers of interest can be added.
UR - http://www.scopus.com/inward/record.url?scp=84879903891&partnerID=8YFLogxK
U2 - 10.1093/bioinformatics/btt230
DO - 10.1093/bioinformatics/btt230
M3 - Article
AN - SCOPUS:84879903891
SN - 1367-4803
VL - 29
SP - i71-i79
JO - Bioinformatics
JF - Bioinformatics
IS - 13
ER -