TY - GEN
T1 - Efficient indexing of versioned document sequences
AU - Herscovici, Michael
AU - Lempel, Rouny
AU - Yogev, Sivan
PY - 2007/1/1
Y1 - 2007/1/1
N2 - Many information systems keep multiple versions of documents. Examples include content management systems, version control systems (e.g. ClearCase, CVS), Wikis, and backup and archiving solutions. Often, it is desired to enable free-text search over such repositories, i.e. to enable submitting queries that may match any version of any document. We propose an indexing method that takes advantage of the inherent redundancy present in versioned documents by solving a variant of the multiple sequence alignment problem. The scheme produces an index that is much more compact than a standard index that treats each version independently. In experiments over publicly available versioned data, our method achieved compaction ratios of 81% as compared with standard indexing, while supporting the same retrieval capabilities.
AB - Many information systems keep multiple versions of documents. Examples include content management systems, version control systems (e.g. ClearCase, CVS), Wikis, and backup and archiving solutions. Often, it is desired to enable free-text search over such repositories, i.e. to enable submitting queries that may match any version of any document. We propose an indexing method that takes advantage of the inherent redundancy present in versioned documents by solving a variant of the multiple sequence alignment problem. The scheme produces an index that is much more compact than a standard index that treats each version independently. In experiments over publicly available versioned data, our method achieved compaction ratios of 81% as compared with standard indexing, while supporting the same retrieval capabilities.
UR - http://www.scopus.com/inward/record.url?scp=37149018391&partnerID=8YFLogxK
U2 - 10.1007/978-3-540-71496-5_10
DO - 10.1007/978-3-540-71496-5_10
M3 - Conference contribution
AN - SCOPUS:37149018391
SN - 3540714944
SN - 9783540714941
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 76
EP - 87
BT - Advances in Information Retrieval - 29th European Conference on IR Research, ECIR 2007, Proceedings
PB - Springer Verlag
T2 - 29th European Conference on IR Research, ECIR 2007
Y2 - 2 April 2007 through 5 April 2007
ER -