Diacritization for the World's Scripts

Project Details

Description

In many writing systems, full knowledge about word identity and pronunciation is encoded by diacritic marks attached or adjacent to the core characters. Often, these marks are dropped either deliberately by the writer or during automatic text processing, leaving technologies like translation services and text—to-speech engines without information necessary for their operation. We propose to systematically quantify the role of diacritics in various types of scripts and languages, and to evaluate a wide array of systems proposed to this day whose goal is to restore diacritics to text lacking them. We will examine methods making use of linguistic knowledge, as well as those relying mostly on the power of vast computational power operating over large corpora, as well as a novel method inspired by recent advances in text—free processing of language through its rendered visual form. We will cover a range of language families and assess the importance of preserving or restoring diacritics across languages in multilingual applications.

StatusActive
Effective start/end date1/01/22 → …

Funding

  • United States-Israel Binational Science Foundation (BSF)

Fingerprint

Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.