An HMM Approach to Vowel Restoration in Arabic and Hebrew

Research output: Contribution to journalConference articlepeer-review

58 Scopus citations

Abstract

Semitic languages pose a problem to Natural Language Processing since most of the vowels are omitted from written prose, resulting in considerable ambiguity at the word level. However, while reading text, native speakers can generally vocalize each word based on their familiarity with the lexicon and the context of the word. Methods for vowel restoration in previous work involving morphological analysis concentrated on a single language and relied on a parsed corpus that is difficult to create for many Semitic languages. We show that Hidden Markov Models are a useful tool for the task of vowel restoration in Semitic languages. Our technique is simple to implement, does not require any language specific knowledge to be embedded in the model and generalizes well to both Hebrew and Arabic. Using a publicly available version of the Bible and the Qur'an as corpora, we achieve a success rate of 86% for restoring the exact vowel pattern in Arabic and 81% in Hebrew. For Hebrew, we also report on 87% success rate for restoring the correct phonetic value of the words.

Original languageEnglish
JournalProceedings of the Annual Meeting of the Association for Computational Linguistics
StatePublished - 1 Jan 2002
Externally publishedYes
EventACL 2002 Workshop on Computational Approaches to Semitic Languages, SEMITIC@ACL 2002 - Philadelphia, United States
Duration: 11 Jul 2002 → …

ASJC Scopus subject areas

  • Computer Science Applications
  • Linguistics and Language
  • Language and Linguistics

Fingerprint

Dive into the research topics of 'An HMM Approach to Vowel Restoration in Arabic and Hebrew'. Together they form a unique fingerprint.

Cite this