Using wikipedia links to construct word segmentation corpora

David Gabay, Ben Eliahu Ziv, Michael Elhadad

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

3 Scopus citations


Tagged corpora are essential for evaluating and training natural language processing tools. The cost of constructing large enough manually tagged corpora is high, even when the annotation level is shallow. This article describes a simple method to automatically create a partially tagged corpus, using Wikipedia hyperlinks. The resulting corpus contains information about the correct segmentation of 523,599 non-consecutive words in 363,090 sentences. We used our method to construct a corpus of Modern Hebrew (which we have made available at The method can also be applied to other languages where word segmentation is difficult to determine, such as East and South-East Asian languages.

Original languageEnglish
Title of host publicationWikipedia and Artificial Intelligence
Subtitle of host publicationAn Evolving Synergy - Papers from the 2008 AAAI Workshop
Number of pages3
StatePublished - 1 Dec 2008
Event2008 AAAI Workshop - Chicago, IL, United States
Duration: 13 Jul 200813 Jul 2008

Publication series

NameAAAI Workshop - Technical Report


Conference2008 AAAI Workshop
Country/TerritoryUnited States
CityChicago, IL

ASJC Scopus subject areas

  • Engineering (all)


Dive into the research topics of 'Using wikipedia links to construct word segmentation corpora'. Together they form a unique fingerprint.

Cite this