Regular expression constrained sequence alignment revisited

Gregory Kucherov, Tamar Pinhas, Michal Ziv-Ukelson

Research output: Contribution to journalArticlepeer-review

12 Scopus citations

Abstract

We introduce regular expression constrained sequence alignment as the problem of finding the maximum alignment score between given strings S1 and S2 over all alignments such that in these alignments there exists a segment where some substring s1 of S1 is aligned to some substring s2 of S2, and both s1 and s2 match a given regular expression R, i.e. s1, s2 ∈ L (R) where L (R) is the regular language described by R. For complexity results we assume, without loss of generality, that n = | S1 | ≥ | m | = | S2 |. A motivation for the problem is that protein sequences can be aligned in a way that known motifs guide the alignments. We present an O (n m r) time algorithm for the regular expression constrained sequence alignment problem where r = O (t4), and t is the number of states of a nondeterministic finite automaton N that accepts L (R). We use in our algorithm a nondeterministic weighted finite automaton M that we construct from N. M has O (t2) states where the transition-weights are obtained from the given costs of edit operations, and state-weights correspond to optimum alignment scores we compute using the underlying dynamic programming solution for sequence alignment. If we are given a deterministic finite automaton D accepting L (R) with td states then our construction creates a deterministic finite automaton Md with td 2 states. In this case, our algorithm takes O (td 2 n m) time. Using Md results in faster computation than using M when td < t2. If we only want to compute the optimum score, the space required by our algorithm is O (t2 n) (O (td 2 m) if we use a given Md). If we also want to compute an optimal alignment then our algorithm uses O (t2 m + t2 | s1 | | s2 |) space (O (td 2 m + td 2 | s1 | | s2 |) space if we use a given Md) where s1 and s2 are substrings of S1 and S2, respectively, s1, s2 ∈ L (R), and s1 and s2 are aligned together in the optimal alignment that we construct. We also show that our method generalizes for the case of the problem with affine gap penalties, and for finding optimal regular expression constrained local sequence alignments.

Original languageEnglish
Pages (from-to)647-661
Number of pages15
JournalJournal of Discrete Algorithms
Volume5
Issue number4
DOIs
StatePublished - 9 May 2011

Keywords

  • Dynamic programming
  • Finite automaton
  • Pattern matching
  • Regular expression
  • Sequence alignment

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Discrete Mathematics and Combinatorics
  • Computational Theory and Mathematics

Fingerprint

Dive into the research topics of 'Regular expression constrained sequence alignment revisited'. Together they form a unique fingerprint.

Cite this