Abstract
We introduce regular expression constrained sequence alignment as the problem of finding the maximum alignment score between given strings S1 and S2 over all alignments such that in these alignments there exists a segment where some substring s1 of S1 is aligned to some substring s2 of S2, and both s1 and s2 match a given regular expression R, i.e. s1, s2 ∈ L (R) where L (R) is the regular language described by R. For complexity results we assume, without loss of generality, that n = | S1 | ≥ | m | = | S2 |. A motivation for the problem is that protein sequences can be aligned in a way that known motifs guide the alignments. We present an O (n m r) time algorithm for the regular expression constrained sequence alignment problem where r = O (t4), and t is the number of states of a nondeterministic finite automaton N that accepts L (R). We use in our algorithm a nondeterministic weighted finite automaton M that we construct from N. M has O (t2) states where the transition-weights are obtained from the given costs of edit operations, and state-weights correspond to optimum alignment scores we compute using the underlying dynamic programming solution for sequence alignment. If we are given a deterministic finite automaton D accepting L (R) with td states then our construction creates a deterministic finite automaton Md with td 2 states. In this case, our algorithm takes O (td 2 n m) time. Using Md results in faster computation than using M when td < t2. If we only want to compute the optimum score, the space required by our algorithm is O (t2 n) (O (td 2 m) if we use a given Md). If we also want to compute an optimal alignment then our algorithm uses O (t2 m + t2 | s1 | | s2 |) space (O (td 2 m + td 2 | s1 | | s2 |) space if we use a given Md) where s1 and s2 are substrings of S1 and S2, respectively, s1, s2 ∈ L (R), and s1 and s2 are aligned together in the optimal alignment that we construct. We also show that our method generalizes for the case of the problem with affine gap penalties, and for finding optimal regular expression constrained local sequence alignments.
Original language | English |
---|---|
Pages (from-to) | 647-661 |
Number of pages | 15 |
Journal | Journal of Discrete Algorithms |
Volume | 5 |
Issue number | 4 |
DOIs | |
State | Published - 9 May 2011 |
Keywords
- Dynamic programming
- Finite automaton
- Pattern matching
- Regular expression
- Sequence alignment
ASJC Scopus subject areas
- Theoretical Computer Science
- Discrete Mathematics and Combinatorics
- Computational Theory and Mathematics