Efficient indexing of versioned document sequences

Michael Herscovici, Rouny Lempel, Sivan Yogev

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

8 Scopus citations

Abstract

Many information systems keep multiple versions of documents. Examples include content management systems, version control systems (e.g. ClearCase, CVS), Wikis, and backup and archiving solutions. Often, it is desired to enable free-text search over such repositories, i.e. to enable submitting queries that may match any version of any document. We propose an indexing method that takes advantage of the inherent redundancy present in versioned documents by solving a variant of the multiple sequence alignment problem. The scheme produces an index that is much more compact than a standard index that treats each version independently. In experiments over publicly available versioned data, our method achieved compaction ratios of 81% as compared with standard indexing, while supporting the same retrieval capabilities.

Original languageEnglish
Title of host publicationAdvances in Information Retrieval - 29th European Conference on IR Research, ECIR 2007, Proceedings
PublisherSpringer Verlag
Pages76-87
Number of pages12
ISBN (Print)3540714944, 9783540714941
DOIs
StatePublished - 1 Jan 2007
Externally publishedYes
Event29th European Conference on IR Research, ECIR 2007 - Rome, Italy
Duration: 2 Apr 20075 Apr 2007

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume4425 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference29th European Conference on IR Research, ECIR 2007
Country/TerritoryItaly
CityRome
Period2/04/075/04/07

ASJC Scopus subject areas

  • Theoretical Computer Science
  • General Computer Science

Fingerprint

Dive into the research topics of 'Efficient indexing of versioned document sequences'. Together they form a unique fingerprint.

Cite this