TY - JOUR

T1 - Efficient edit distance with duplications and contractions

AU - Pinhas, Tamar

AU - Zakov, Shay

AU - Tsur, Dekel

AU - Ziv-Ukelson, Michal

N1 - Funding Information:
We would like to thank Prof. Yefim Dinitz for kindly pointing us to some relevant references. The research of T.P., S.Z. and M.Z.U. was partially supported by ISF grant 478/10 and by the Frankel Center for Computer Science at Ben Gurion University of the Negev. The research of D.T. was partially supported by ISF grant 981/11 and by the Frankel Center for Computer Science at Ben Gurion University of the Negev. The authors thank the anonymous reviewers for their very helpful comments.
Publisher Copyright:
© 2013 Pinhas et al.; licensee BioMed Central Ltd.

PY - 2013/10/29

Y1 - 2013/10/29

N2 - We propose three algorithms for string edit distance with duplications and contractions. These include an efficient general algorithm and two improvements which apply under certain constraints on the cost function. The new algorithms solve a more general problem variant and obtain better time complexities with respect to previous algorithms. Our general algorithm is based on min-plus multiplication of square matrices and has time and space complexities of O (|Σ|MP(n)) and O (|Σ|n2), respectively, where |Σ| is the alphabet size, n is the length of the strings, and MP(n) is the time bound for the computation of min-plus matrix multiplication of two n × n matrices (currently, MP(n) = O(n3log3log n/log2 n) due to an algorithm by Chan). For integer cost functions, the running time is further improved to O(|Σ|n3/log2 n). In addition, this variant of the algorithm is online, in the sense that the input strings may be given letter by letter, and its time complexity bounds the processing time of the first n given letters. This acceleration is based on our efficient matrix-vector min-plus multiplication algorithm, intended for matrices and vectors for which differences between adjacent entries are from a finite integer interval D. Choosing a constant 1/log|D| n < λ < 1, the algorithm preprocesses an n × n matrix in O(n2+λ/|D|λ2 log|D|2 n) space. Then, it may multiply the matrix with any given n-length vector in O(n2/λ2 log|D|2 n) time. Under some discreteness assumptions, this matrix-vector min-plus multiplication algorithm applies to several problems from the domains of context-free grammar parsing and RNA folding and, in particular, implies the asymptotically fastest O(n3/log2 n) time algorithm for single-strand RNA folding with discrete cost functions. Finally, assuming a different constraint on the cost function, we present another version of the algorithm that exploits the run-length encoding of the strings and runs in O(|Σ|nMP(ñ)/ñ) time and O(|Σ|nñ) space, where ñ is the length of the run-length encoding of the strings.

AB - We propose three algorithms for string edit distance with duplications and contractions. These include an efficient general algorithm and two improvements which apply under certain constraints on the cost function. The new algorithms solve a more general problem variant and obtain better time complexities with respect to previous algorithms. Our general algorithm is based on min-plus multiplication of square matrices and has time and space complexities of O (|Σ|MP(n)) and O (|Σ|n2), respectively, where |Σ| is the alphabet size, n is the length of the strings, and MP(n) is the time bound for the computation of min-plus matrix multiplication of two n × n matrices (currently, MP(n) = O(n3log3log n/log2 n) due to an algorithm by Chan). For integer cost functions, the running time is further improved to O(|Σ|n3/log2 n). In addition, this variant of the algorithm is online, in the sense that the input strings may be given letter by letter, and its time complexity bounds the processing time of the first n given letters. This acceleration is based on our efficient matrix-vector min-plus multiplication algorithm, intended for matrices and vectors for which differences between adjacent entries are from a finite integer interval D. Choosing a constant 1/log|D| n < λ < 1, the algorithm preprocesses an n × n matrix in O(n2+λ/|D|λ2 log|D|2 n) space. Then, it may multiply the matrix with any given n-length vector in O(n2/λ2 log|D|2 n) time. Under some discreteness assumptions, this matrix-vector min-plus multiplication algorithm applies to several problems from the domains of context-free grammar parsing and RNA folding and, in particular, implies the asymptotically fastest O(n3/log2 n) time algorithm for single-strand RNA folding with discrete cost functions. Finally, assuming a different constraint on the cost function, we present another version of the algorithm that exploits the run-length encoding of the strings and runs in O(|Σ|nMP(ñ)/ñ) time and O(|Σ|nñ) space, where ñ is the length of the run-length encoding of the strings.

KW - Edit distance

KW - Four Russians

KW - Min-plus matrix multiplication

KW - Minisatellites

UR - http://www.scopus.com/inward/record.url?scp=84886420985&partnerID=8YFLogxK

U2 - 10.1186/1748-7188-8-27

DO - 10.1186/1748-7188-8-27

M3 - Article

C2 - 24168705

AN - SCOPUS:84886420985

VL - 8

JO - Algorithms for Molecular Biology

JF - Algorithms for Molecular Biology

SN - 1748-7188

IS - 1

M1 - 27

ER -