Abstract
Cloud data lakes are a modern approach for storing large amounts of data in a convenient and inexpensive way. Query engines (e.g. Hive, Presto, SparkSQL) are used to run SQL queries on data lakes. Their main focus is on analytical queries while random reads are overlooked. In this paper, we present our approach for optimizing needle in a haystack queries in cloud data lakes. The main idea is to maintain an index structure that maps indexed column values to their files. According to our analysis and experimental evaluation, our solution imposes a reasonable storage overhead while providing an order of magnitude performance improvement.
Original language | English |
---|---|
Journal | CEUR Workshop Proceedings |
Volume | 2841 |
State | Published - 1 Jan 2021 |
Event | 2021 Workshops of the EDBT/ICDT Joint Conference, EDBT/ICDT-WS 2021 - Nicosia, Cyprus Duration: 23 Mar 2021 → … |
ASJC Scopus subject areas
- General Computer Science