TY - JOUR
T1 - Optimizing Cloud Data Lakes Queries
AU - Weintraub, Grisha
N1 - Funding Information:
This research was partially supported by the Israeli Council for Higher Education (CHE) via Data Science Research Center, the Israel Data Science Initiative (IDSI), and the Lynne and William Frankel Center for Computer Science Envelope-Open grishaw@post.bgu.ac.il (G. Weintraub) Orcid 0000-0003-4823-4757 (G. Weintraub) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org)
Publisher Copyright:
© 2023 Copyright for this paper by its authors.
PY - 2023/1/1
Y1 - 2023/1/1
N2 - Cloud data lakes emerge as an inexpensive solution for storing very large amounts of data. The main idea is the separation of compute and storage layers. Thus, cheap cloud storage is used for storing the data, while compute engines are used for running analytics on this data in “on-demand” mode. However, to perform any computation on the data in this architecture, the data should be moved from the storage layer to the compute layer over the network for each calculation. Obviously, that hurts calculation performance and requires huge network bandwidth. Our research focuses on three related topics: (1) identify the key challenges to improving query performance in cloud data lakes, (2) provide a theoretical model that formally defines the problem of poor query performance in cloud data lakes, (3) design a practical solution to the problem and demonstrate its efficiency via large-scale experimental evaluation.
AB - Cloud data lakes emerge as an inexpensive solution for storing very large amounts of data. The main idea is the separation of compute and storage layers. Thus, cheap cloud storage is used for storing the data, while compute engines are used for running analytics on this data in “on-demand” mode. However, to perform any computation on the data in this architecture, the data should be moved from the storage layer to the compute layer over the network for each calculation. Obviously, that hurts calculation performance and requires huge network bandwidth. Our research focuses on three related topics: (1) identify the key challenges to improving query performance in cloud data lakes, (2) provide a theoretical model that formally defines the problem of poor query performance in cloud data lakes, (3) design a practical solution to the problem and demonstrate its efficiency via large-scale experimental evaluation.
KW - cloud storage
KW - data lakes
KW - query optimization
UR - http://www.scopus.com/inward/record.url?scp=85169472236&partnerID=8YFLogxK
M3 - Conference article
AN - SCOPUS:85169472236
SN - 1613-0073
VL - 3452
SP - 13
EP - 16
JO - CEUR Workshop Proceedings
JF - CEUR Workshop Proceedings
T2 - 49th International Conference on Very Large Data Bases PhD Workshop, VLDB-PhD Workshop 2023
Y2 - 28 August 2023
ER -