TY - GEN
T1 - Efficient design of compact unstructured RNA libraries covering all k-mers
AU - Orenstein, Yaron
AU - Berger, Bonnie
N1 - Publisher Copyright:
© Springer-Verlag Berlin Heidelberg 2015.
PY - 2015/1/1
Y1 - 2015/1/1
N2 - Current microarray technologies to determine RNA structure or measure protein-RNA interactions rely on single-stranded, unstructured RNA probes on a chip covering together all k-mers. Since space on the array is limited, the problem is to efficiently design a compact library of unstructured ℓ-long RNA probes, where each k-mer is covered at least p times. Ray et al. designed such a library for specific values of k, ℓ and p using ad-hoc rules. To our knowledge, there is no general method to date to solve this problem. Here, we address the problem of finding a minimum-size covering of all k-mers by ℓ-long sequences with the desired properties for any value of k, ℓ and p. As we prove that the problem is NP-hard, we give two solutions: the first is a greedy algorithm with a logarithmic approximation ratio; the second, a heuristic greedy approach based on random walks in de Bruijn graphs. The heuristic algorithm works well in practice and produces a library of unstructured RNA probes that is only ∼ 1.1-times greater in size compared to the theoretical lower bound. We present results for typical values of k and probe lengths ℓ and show that our algorithm generates a library that is significantly smaller than the library of Ray et al.; moreover, we show that our algorithm outperforms naive methods. Our approach can be generalized and extended to generate RNA or DNA oligo libraries with other desired properties. The software is freely available on curlcake.csail.mit.edu.
AB - Current microarray technologies to determine RNA structure or measure protein-RNA interactions rely on single-stranded, unstructured RNA probes on a chip covering together all k-mers. Since space on the array is limited, the problem is to efficiently design a compact library of unstructured ℓ-long RNA probes, where each k-mer is covered at least p times. Ray et al. designed such a library for specific values of k, ℓ and p using ad-hoc rules. To our knowledge, there is no general method to date to solve this problem. Here, we address the problem of finding a minimum-size covering of all k-mers by ℓ-long sequences with the desired properties for any value of k, ℓ and p. As we prove that the problem is NP-hard, we give two solutions: the first is a greedy algorithm with a logarithmic approximation ratio; the second, a heuristic greedy approach based on random walks in de Bruijn graphs. The heuristic algorithm works well in practice and produces a library of unstructured RNA probes that is only ∼ 1.1-times greater in size compared to the theoretical lower bound. We present results for typical values of k and probe lengths ℓ and show that our algorithm generates a library that is significantly smaller than the library of Ray et al.; moreover, we show that our algorithm outperforms naive methods. Our approach can be generalized and extended to generate RNA or DNA oligo libraries with other desired properties. The software is freely available on curlcake.csail.mit.edu.
KW - De Bruijn graph
KW - Microarray library design
KW - RNA secondary structure
UR - http://www.scopus.com/inward/record.url?scp=84947743427&partnerID=8YFLogxK
U2 - 10.1007/978-3-662-48221-6_23
DO - 10.1007/978-3-662-48221-6_23
M3 - Conference contribution
AN - SCOPUS:84947743427
SN - 9783662482209
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 308
EP - 325
BT - Algorithms in Bioinformatics - 15th International Workshop, WABI 2015, Proceedings
A2 - Pop, Mihai
A2 - Touzet, Hélène
PB - Springer Verlag
T2 - 15th International Workshop on Algorithms in Bioinformatics, WABI 2015
Y2 - 10 September 2015 through 12 September 2015
ER -