## Abstract

We consider a subset matching variant of the Dictionary Query problem. Consider a dictionary D of n strings, where each string location contains a set of characters drawn from some alphabet E. Our goal is to preprocess D so when given a query pattern p, where each location in p contains a single character from Σ, we answer if p matches to D. p is said to match to D if there is some s ∈ D where |p| = |s| and p[i] ∈ s[i] for every 1 ≤ i ≤ |p|. To achieve a query time of O(|p|), we construct a compressed trie of all possible patterns that appear in D. Assuming that for every a ∈ D there are at most k locations where |s[i]| > 1, we present two constructions of the trie that yield a preprocessing time of O(nm + |Σ|^{k} n lg(min{n, m})), where n is the number of strings in D and m is the maximum length of a string in D. The first construction is based on divide and conquer and the second construction uses ideas introduced in [2] for text fingerprinting. Furthermore, we show how to obtain O(nm + |Σ|^{k} n + |Σ|^{k/2} n lg(min{n, m})) preprocessing time and O(|p|lg lg|Σ| + min{|p|lg(|Σ|^{k}n)}lg lg(|Σ|^{k}n)) query time by cutting the dictionary strings and constructing two compressed tries. Our problem is motivated by haplotype inference from a library of genotypes [14,17]. There, D is a known library of genotypes (|Σ| = 2), and p is a haplotype. Indexing all possible haplotypes that can be inferred from D as well as gathering statistical information about them can be used to accelerate various haplotype inference algorithms. In particular, algorithms based on the "pure parsimony criteria" [13,16], greedy heuristics such as "Clarks rule" [6,18], EM based algorithms [1,11,12,20,26,30], and algorithms for inferring haplotypes from a set of Trios [4,27].

Original language | English |
---|---|

Title of host publication | String Processing and Information Retrieval - 14th International Symposium, SPIRE 2007, Proceedings |

Publisher | Springer Verlag |

Pages | 195-204 |

Number of pages | 10 |

ISBN (Print) | 9783540755296 |

DOIs | |

State | Published - 1 Jan 2007 |

Event | 14th International Symposium on String Processing and Information Retrieval, SPIRE 2007 - Santiago, Chile Duration: 29 Oct 2007 → 31 Oct 2007 |

### Publication series

Name | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) |
---|---|

Volume | 4726 LNCS |

ISSN (Print) | 0302-9743 |

ISSN (Electronic) | 1611-3349 |

### Conference

Conference | 14th International Symposium on String Processing and Information Retrieval, SPIRE 2007 |
---|---|

Country/Territory | Chile |

City | Santiago |

Period | 29/10/07 → 31/10/07 |