Skip to main navigation Skip to search Skip to main content

Døñ’t Tòůčḥ Mý Dïąçŗıtīcs

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

The common practice of preprocessing text before feeding it into NLP models introduces many decision points which have unintended consequences on model performance. In this opinion piece, we focus on the handling of diacritics in texts originating in many languages and scripts. We demonstrate, through several case studies, the adverse effects of inconsistent encoding of diacritized characters and of removing diacritics altogether. We call on the community to adopt simple but necessary steps across all models and toolkits in order to improve handling of diacritized text and, by extension, increase equity in multilingual NLP.

Original languageEnglish
Title of host publicationShort Papers
EditorsLuis Chiruzzo, Alan Ritter, Lu Wang
PublisherAssociation for Computational Linguistics (ACL)
Pages285-291
Number of pages7
ISBN (Electronic)9798891761902
DOIs
StatePublished - 1 Jan 2025
Event2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2025 - Hybrid, Albuquerque, United States
Duration: 29 Apr 20254 May 2025

Publication series

NameProceedings of the 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies: Long Papers, NAACL-HLT 2025
Volume2

Conference

Conference2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2025
Country/TerritoryUnited States
CityHybrid, Albuquerque
Period29/04/254/05/25

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Hardware and Architecture
  • Information Systems
  • Software

Fingerprint

Dive into the research topics of 'Døñ’t Tòůčḥ Mý Dïąçŗıtīcs'. Together they form a unique fingerprint.

Cite this