Cloud data lakes are a modern approach for storing large amounts of data in a convenient and inexpensive way. Query engines (e.g. Hive, Presto, SparkSQL) are used to run SQL queries on data lakes. Their main focus is on analytical queries while random reads are overlooked. In this paper, we present our approach for optimizing needle in a haystack queries in cloud data lakes. The main idea is to maintain an index structure that maps indexed column values to their files. According to our analysis and experimental evaluation, our solution imposes a reasonable storage overhead while providing an order of magnitude performance improvement.
|Name||CEUR Workshop Proceedings|
|Conference||2021 Workshops of the EDBT/ICDT Joint Conference, EDBT/ICDT-WS 2021|
|Period||23/03/21 → …|