TY - GEN
T1 - Reverse-engineering conjunctive queries from provenance examples
AU - Deutch, Daniel
AU - Gilad, Amir
N1 - Publisher Copyright:
© 2019 Copyright held by the owner/author(s).
PY - 2019/1/1
Y1 - 2019/1/1
N2 - The provenance of a query result details relevant parts of the input data as well as the computation leading to each output tuple. Multiple lines of work have studied the tracking and presentation of provenance, showing its effectiveness in explaining and justifying query results. The willingness of application owners to share provenance information for these purposes may however be hindered by the resulting exposure of the underlying query logic, which may be proprietary and confidential. We therefore formalize and study the following problem: when a (small) subset of the query results along with their provenance is given, what information is revealed about the underlying query? Our model is based on the provenance semiring framework and applies to many previously proposed provenance models. We analyze two flavors of the problem: (1) how many queries may be consistent with a given provenance example? and (2) what is the complexity of inferring a consistent query, or one that is a “best fit"? Our theoretical analysis shows that there may be many (for some models, even infinitely many in presence of self-joins) consistent queries, yet we provide practically efficient algorithms to find (best-fit) such queries. We experimentally show that the algorithms are generally successful in correctly reverse engineering queries, even when given only a few output examples and their provenance.
AB - The provenance of a query result details relevant parts of the input data as well as the computation leading to each output tuple. Multiple lines of work have studied the tracking and presentation of provenance, showing its effectiveness in explaining and justifying query results. The willingness of application owners to share provenance information for these purposes may however be hindered by the resulting exposure of the underlying query logic, which may be proprietary and confidential. We therefore formalize and study the following problem: when a (small) subset of the query results along with their provenance is given, what information is revealed about the underlying query? Our model is based on the provenance semiring framework and applies to many previously proposed provenance models. We analyze two flavors of the problem: (1) how many queries may be consistent with a given provenance example? and (2) what is the complexity of inferring a consistent query, or one that is a “best fit"? Our theoretical analysis shows that there may be many (for some models, even infinitely many in presence of self-joins) consistent queries, yet we provide practically efficient algorithms to find (best-fit) such queries. We experimentally show that the algorithms are generally successful in correctly reverse engineering queries, even when given only a few output examples and their provenance.
UR - http://www.scopus.com/inward/record.url?scp=85064905540&partnerID=8YFLogxK
U2 - 10.5441/002/edbt.2019.25
DO - 10.5441/002/edbt.2019.25
M3 - Conference contribution
AN - SCOPUS:85064905540
T3 - Advances in Database Technology - EDBT
SP - 277
EP - 288
BT - Advances in Database Technology - EDBT 2019
A2 - Kaoudi, Zoi
A2 - Herschel, Melanie
A2 - Binnig, Carsten
A2 - Reinwald, Berthold
A2 - Galhardas, Helena
A2 - Fundulaki, Irini
PB - OpenProceedings.org
T2 - 22nd International Conference on Extending Database Technology, EDBT 2019
Y2 - 26 March 2019 through 29 March 2019
ER -