Needle in a haystack queries in cloud data lakes

Grisha Weintraub, Ehud Gudes, Shlomi Dolev

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Cloud data lakes are a modern approach for storing large amounts of data in a convenient and inexpensive way. Query engines (e.g. Hive, Presto, SparkSQL) are used to run SQL queries on data lakes. Their main focus is on analytical queries while random reads are overlooked. In this paper, we present our approach for optimizing needle in a haystack queries in cloud data lakes. The main idea is to maintain an index structure that maps indexed column values to their files. According to our analysis and experimental evaluation, our solution imposes a reasonable storage overhead while providing an order of magnitude performance improvement.
Original languageEnglish
Title of host publicationProceedings of the Workshops of the EDBT/ICDT 2021 Joint Conference, Nicosia, Cyprus, March 23, 2021
EditorsConstantinos Costa, Evaggelia Pitoura
PublisherCEUR-WS.org
Volume2841
StatePublished - 2021
Event2021 Workshops of the EDBT/ICDT Joint Conference, EDBT/ICDT-WS 2021 - Nicosia, Cyprus
Duration: 23 Mar 2021 → …

Publication series

NameCEUR Workshop Proceedings
PublisherCEUR-WS.org
ISSN (Print)1613-0073

Conference

Conference2021 Workshops of the EDBT/ICDT Joint Conference, EDBT/ICDT-WS 2021
Country/TerritoryCyprus
Period23/03/21 → …

Fingerprint

Dive into the research topics of 'Needle in a haystack queries in cloud data lakes'. Together they form a unique fingerprint.

Cite this