Improving the efficiency of de Bruijn graph construction using compact universal hitting sets

Yael Ben-Ari, Dan Flomin, Lianrong Pu, Yaron Orenstein, Ron Shamir

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

High-throughput sequencing techniques generate large volumes of DNA sequencing data at ultra-fast speed and extremely low cost. As a consequence, sequencing techniques have become ubiquitous in biomedical research and are used in hundreds of genomic applications. Efficient data structures and algorithms have been developed to handle the large datasets produced by these techniques. The prevailing method to index DNA sequences in those data structures and algorithms is by using k-mers (k-long substrings) known as minimizers. Minimizers are the smallest k-mers selected in every consecutive window of a fixed length in a sequence, where the smallest is determined according to a predefined order, e.g., lexicographic. Recently, a new k-mer order based on a universal hitting set (UHS) was suggested. While several studies have shown that orders based on a small UHS have improved properties, the utility of using them in high-throughput sequencing analysis tasks has been demonstrated to date in only one application of k-mer counting. Here, we demonstrate the practical benefit of UHSs in the genome assembly task. Reconstructing a genome from billions of short reads is a fundamental task in high-throughput sequencing analyses. De Bruijn graph construction is a key step in genome assembly, which often requires very large amounts of memory and long computation time. A critical bottleneck lies in the partitioning of DNA sequences into bins. The sequences in each bin are assembled separately, and the final de Bruijn graph is constructed by merging the bin-specific subgraphs. We incorporated a UHS-based order in the bin partition step of the Minimum Substring Partitioning algorithm. Using a UHS-based order instead of lexicographic-or random-ordered minimizers produced lower density minimizers with more balanced bin partitioning, which led to a reduction in both runtime and memory usage.

Original languageEnglish
Title of host publicationProceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB 2021
PublisherAssociation for Computing Machinery, Inc
ISBN (Electronic)9781450384506
DOIs
StatePublished - 18 Jan 2021
Event12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB 2021 - Virtual, Online, United States
Duration: 1 Aug 20214 Aug 2021

Publication series

NameProceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB 2021

Conference

Conference12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, BCB 2021
Country/TerritoryUnited States
CityVirtual, Online
Period1/08/214/08/21

Keywords

  • assembly
  • de Bruijn graph
  • minimum substring partitioning
  • universal hitting set

ASJC Scopus subject areas

  • Computer Science Applications
  • Software
  • Biomedical Engineering
  • Health Informatics

Fingerprint

Dive into the research topics of 'Improving the efficiency of de Bruijn graph construction using compact universal hitting sets'. Together they form a unique fingerprint.

Cite this