Needle in a haystack queries in cloud data lakes

Grisha Weintraub, Ehud Gudes, Shlomi Dolev

Research output: Contribution to journalConference articlepeer-review

8 Scopus citations

Abstract

Cloud data lakes are a modern approach for storing large amounts of data in a convenient and inexpensive way. Query engines (e.g. Hive, Presto, SparkSQL) are used to run SQL queries on data lakes. Their main focus is on analytical queries while random reads are overlooked. In this paper, we present our approach for optimizing needle in a haystack queries in cloud data lakes. The main idea is to maintain an index structure that maps indexed column values to their files. According to our analysis and experimental evaluation, our solution imposes a reasonable storage overhead while providing an order of magnitude performance improvement.

Original languageEnglish
JournalCEUR Workshop Proceedings
Volume2841
StatePublished - 1 Jan 2021
Event2021 Workshops of the EDBT/ICDT Joint Conference, EDBT/ICDT-WS 2021 - Nicosia, Cyprus
Duration: 23 Mar 2021 → …

ASJC Scopus subject areas

  • General Computer Science

Fingerprint

Dive into the research topics of 'Needle in a haystack queries in cloud data lakes'. Together they form a unique fingerprint.

Cite this