TY - GEN
T1 - Finding optimal probabilistic generators for XML collections
AU - Abiteboul, Serge
AU - Amsterdamer, Yael
AU - Deutch, Daniel
AU - Milo, Tova
AU - Senellart, Pierre
PY - 2012/7/23
Y1 - 2012/7/23
N2 - We study the problem of, given a corpus of XML documents and its schema, finding an optimal (generative) probabilistic model, where optimality here means maximizing the likelihood of the particular corpus to be generated. Focusing first on the structure of documents, we present an efficient algorithm for finding the best generative probabilistic model, in the absence of constraints. We further study the problem in the presence of integrity constraints, namely key, inclusion, and domain constraints. We study in this case two different kinds of generators. First, we consider a continuation-test generator that performs, while generating documents, tests of schema satisfiability; these tests prevent from generating a document violating the constraints but, as we will see, they are computationally expensive. We also study a restart generator that may generate an invalid document and, when this is the case, restarts and tries again. Finally, we consider the injection of data values into the structure, to obtain a full XML document. We study different approaches for generating these values.
AB - We study the problem of, given a corpus of XML documents and its schema, finding an optimal (generative) probabilistic model, where optimality here means maximizing the likelihood of the particular corpus to be generated. Focusing first on the structure of documents, we present an efficient algorithm for finding the best generative probabilistic model, in the absence of constraints. We further study the problem in the presence of integrity constraints, namely key, inclusion, and domain constraints. We study in this case two different kinds of generators. First, we consider a continuation-test generator that performs, while generating documents, tests of schema satisfiability; these tests prevent from generating a document violating the constraints but, as we will see, they are computationally expensive. We also study a restart generator that may generate an invalid document and, when this is the case, restarts and tries again. Finally, we consider the injection of data values into the structure, to obtain a full XML document. We study different approaches for generating these values.
KW - Constraints
KW - Generator
KW - Probabilistic model
KW - Schema
KW - XML
UR - http://www.scopus.com/inward/record.url?scp=84863967419&partnerID=8YFLogxK
U2 - 10.1145/2274576.2274591
DO - 10.1145/2274576.2274591
M3 - Conference contribution
AN - SCOPUS:84863967419
SN - 9781450307918
T3 - ACM International Conference Proceeding Series
SP - 127
EP - 139
BT - Database Theory - ICDT 2012
T2 - 15th International Conference on Database Theory, ICDT 2012
Y2 - 26 March 2012 through 29 March 2012
ER -