TY - GEN
T1 - Predicting Fact Contributions from Query Logs with Machine Learning
AU - Arad, Dana
AU - Deutch, Daniel
AU - Frost, Nave
N1 - Publisher Copyright:
© 2024 Copyright held by the owner/author(s).
PY - 2024/3/18
Y1 - 2024/3/18
N2 - A recent line of work has proposed to quantify the contribution of database tuples to query answers using Shapley values, a game theoretic function that has been extensively used as means of attribution in other areas, notably Machine Learning. In this paper we analyze and evaluate LearnShapley, a solution that employs Machine Learning to rank input facts based on their estimated (Shapley-based) contribution to query answers. LearnShapley is trained on a corpus of SPJU queries, their output and the Shapley values of each input tuple with respect to each output tuple. At inference time, LearnShapley is given a new SPJU query over the same database schema, an output tuple of interest, and its lineage (i.e. the set of all facts that have contributed in some way to the generation of the tuple). Our experiments evaluate to what extent LearnShapley is able to leverage similarity measures applied to the query in hand and the queries stored in the repository, to compute a ranking of the facts in the lineage based on their contribution. Overall, our experiments indicate that a log of past queries, output tuples and their Shapley values includes a reasonably relevant signal for predicting the ranking of facts contributions for a new SPJU query over the same database. Both DBShap and our code are publicly available, and may serve for further investigation of Machine Learning approaches for explainability in databases.
AB - A recent line of work has proposed to quantify the contribution of database tuples to query answers using Shapley values, a game theoretic function that has been extensively used as means of attribution in other areas, notably Machine Learning. In this paper we analyze and evaluate LearnShapley, a solution that employs Machine Learning to rank input facts based on their estimated (Shapley-based) contribution to query answers. LearnShapley is trained on a corpus of SPJU queries, their output and the Shapley values of each input tuple with respect to each output tuple. At inference time, LearnShapley is given a new SPJU query over the same database schema, an output tuple of interest, and its lineage (i.e. the set of all facts that have contributed in some way to the generation of the tuple). Our experiments evaluate to what extent LearnShapley is able to leverage similarity measures applied to the query in hand and the queries stored in the repository, to compute a ranking of the facts in the lineage based on their contribution. Overall, our experiments indicate that a log of past queries, output tuples and their Shapley values includes a reasonably relevant signal for predicting the ranking of facts contributions for a new SPJU query over the same database. Both DBShap and our code are publicly available, and may serve for further investigation of Machine Learning approaches for explainability in databases.
UR - http://www.scopus.com/inward/record.url?scp=85191000601&partnerID=8YFLogxK
U2 - 10.48786/edbt.2024.60
DO - 10.48786/edbt.2024.60
M3 - Conference contribution
AN - SCOPUS:85191000601
T3 - Advances in Database Technology - EDBT
SP - 704
EP - 716
BT - Advances in Database Technology - EDBT
PB - OpenProceedings.org
T2 - 27th International Conference on Extending Database Technology, EDBT 2024
Y2 - 25 March 2024 through 28 March 2024
ER -