TY - GEN
T1 - Indexing a dictionary for subset matching queries
AU - Landau, Gad M.
AU - Tsur, Dekel
AU - Weimann, Oren
PY - 2010/12/28
Y1 - 2010/12/28
N2 - We consider a subset matching variant of the Dictionary Query problem. Consider a dictionary D of n strings, where each string location contains a set of characters drawn from some alphabet Σ = {1, ..., |Σ|}. Our goal is to preprocess D so when given a query pattern p, where each location in p contains a single character from Σ, we answer if p matches to D. p is said to match to D if there is some s ∈ D where |p| = |s| and p[i] ∈ s[i] for every 1 ≤ i ≤ |p|. To achieve a query time of O(|p|), we construct a compressed trie of all possible patterns that appear in D. Assuming that for every s ∈ D there are at most k locations where |s[i]| > 1, we present two constructions of the trie that yield a preprocessing time of O(nm+|Σ|kn log(min{n,m})), where n is the number of strings in D and m is the maximum length of a string in D. The first construction is based on divide and conquer and the second construction uses ideas introduced in [2] for text fingerprinting. Furthermore, we show how to obtain O(nm + |Σ|kn + |Σ|k/2n log(min{n,m})) preprocessing time and O(|p| log log |Σ|+ min{|p|, log(|Σ|kn)} log log(|Σ|kn)) query time by cutting the dictionary strings and constructing two compressed tries. Our problem is motivated by haplotype inference from a library of genotypes [13, 16]. There, D is a known library of genotypes (|Σ| = 2), and p is a haplotype. Indexing all possible haplotypes that can be inferred from D as well as gathering statistical information about them can be used to accelerate various haplotype inference algorithms.
AB - We consider a subset matching variant of the Dictionary Query problem. Consider a dictionary D of n strings, where each string location contains a set of characters drawn from some alphabet Σ = {1, ..., |Σ|}. Our goal is to preprocess D so when given a query pattern p, where each location in p contains a single character from Σ, we answer if p matches to D. p is said to match to D if there is some s ∈ D where |p| = |s| and p[i] ∈ s[i] for every 1 ≤ i ≤ |p|. To achieve a query time of O(|p|), we construct a compressed trie of all possible patterns that appear in D. Assuming that for every s ∈ D there are at most k locations where |s[i]| > 1, we present two constructions of the trie that yield a preprocessing time of O(nm+|Σ|kn log(min{n,m})), where n is the number of strings in D and m is the maximum length of a string in D. The first construction is based on divide and conquer and the second construction uses ideas introduced in [2] for text fingerprinting. Furthermore, we show how to obtain O(nm + |Σ|kn + |Σ|k/2n log(min{n,m})) preprocessing time and O(|p| log log |Σ|+ min{|p|, log(|Σ|kn)} log log(|Σ|kn)) query time by cutting the dictionary strings and constructing two compressed tries. Our problem is motivated by haplotype inference from a library of genotypes [13, 16]. There, D is a known library of genotypes (|Σ| = 2), and p is a haplotype. Indexing all possible haplotypes that can be inferred from D as well as gathering statistical information about them can be used to accelerate various haplotype inference algorithms.
UR - http://www.scopus.com/inward/record.url?scp=78650463049&partnerID=8YFLogxK
U2 - 10.1007/978-3-642-12476-1_11
DO - 10.1007/978-3-642-12476-1_11
M3 - Conference contribution
AN - SCOPUS:78650463049
SN - 3642124755
SN - 9783642124754
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 158
EP - 169
BT - Algorithms and Applications - Essays Dedicated to Esko Ukkonen on the Occasion of His 60th Birthday
A2 - Elomaa, Tapio
A2 - Mannila, Heikki
A2 - Orponen, Pekka
ER -