Abstract
Cloud data lakes emerge as an inexpensive solution for storing very large amounts of data. The main idea is the separation of compute and storage layers. Thus, cheap cloud storage is used for storing the data, while compute engines are used for running analytics on this data in “on-demand” mode. However, to perform any computation on the data in this architecture, the data should be moved from the storage layer to the compute layer over the network for each calculation. Obviously, that hurts calculation performance and requires huge network bandwidth. Our research focuses on three related topics: (1) identify the key challenges to improving query performance in cloud data lakes, (2) provide a theoretical model that formally defines the problem of poor query performance in cloud data lakes, (3) design a practical solution to the problem and demonstrate its efficiency via large-scale experimental evaluation.
Original language | English |
---|---|
Pages (from-to) | 13-16 |
Number of pages | 4 |
Journal | CEUR Workshop Proceedings |
Volume | 3452 |
State | Published - 1 Jan 2023 |
Event | 49th International Conference on Very Large Data Bases PhD Workshop, VLDB-PhD Workshop 2023 - Vancouver, Canada Duration: 28 Aug 2023 → … |
Keywords
- cloud storage
- data lakes
- query optimization
ASJC Scopus subject areas
- General Computer Science