Compact universal k-mer hitting sets

Yaron Orenstein, David Pellow, Guillaume Marçais, Ron Shamir, Carl Kingsford

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

14 Scopus citations

Abstract

We address the problem of finding a minimum-size set of k-mers that hits L-long sequences. The problem arises in the design of compact hash functions and other data structures for efficient handling of large sequencing datasets. We prove that the problem of hitting a given set of L-long sequences is NP-hard and give a heuristic solution that finds a compact universal k-mer set that hits any set of L-long sequences. The algorithm, called DOCKS (design of compact k-mer sets), works in two phases: (i) finding a minimum-size k-mer set that hits every infinite sequence; (ii) greedily adding k-mers such that together they hit all remaining L-long sequences. We show that DOCKS works well in practice and produces a set of k-mers that is much smaller than a random choice of k-mers. We present results for various values of k and sequence lengths L and by applying them to two bacterial genomes show that universal hitting k-mers improve on minimizers. The software and exemplary sets are freely available at acgt.cs.tau.ac.il/docks/.

Original languageEnglish
Title of host publicationAlgorithms in Bioinformatics - 16th International Workshop, WABI 2016, Proceedings
EditorsMartin Frith, Christian Nørgaard Storm Pedersen
PublisherSpringer Verlag
Pages257-268
Number of pages12
ISBN (Print)9783319436807
DOIs
StatePublished - 1 Jan 2016
Externally publishedYes
Event16th International Workshop on Algorithms in Bioinformatics, WABI 2016 - Aarhus, Denmark
Duration: 22 Aug 201624 Aug 2016

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume9838 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference16th International Workshop on Algorithms in Bioinformatics, WABI 2016
Country/TerritoryDenmark
CityAarhus
Period22/08/1624/08/16

Fingerprint

Dive into the research topics of 'Compact universal k-mer hitting sets'. Together they form a unique fingerprint.

Cite this