Multilingual and Explainable Text Detoxification with Parallel Corpora

  • Daryna Dementieva
  • , Nikolay Babakov
  • , Amit Ronen
  • , Abinew Ali Ayele
  • , Naquee Rizwan
  • , Florian Schneider
  • , Xintong Wang
  • , Seid Muhie Yimam
  • , Daniil Moskovskiy
  • , Elisei Stakovskii
  • , Eran Kaufman
  • , Ashraf Elnagar
  • , Animesh Mukherjee
  • , Alexander Panchenko

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

6 Scopus citations

Abstract

Even with various regulations in place across countries and social media platforms (Government of India, 2021; European Parliament and Council of the European Union, 2022), digital abusive speech remains a significant issue. One potential approach to address this challenge is automatic text detoxification, a text style transfer (TST) approach that transforms toxic language into a more neutral or non-toxic form. To date, the availability of parallel corpora for the text detoxification task (Logacheva et al., 2022; Atwell et al., 2022; Dementieva et al., 2024a) has proven to be crucial for state-of-the-art approaches. With this work, we extend parallel text detoxification corpus to new languages-German, Chinese, Arabic, Hindi, and Amharic-testing in the extensive multilingual setup TST baselines. Next, we conduct the first of its kind an automated, explainable analysis of the descriptive features of both toxic and non-toxic sentences, diving deeply into the nuances, similarities, and differences of toxicity and detoxification across 9 languages. Finally, based on the obtained insights, we experiment with a novel text detoxification method inspired by the Chain-of-Thoughts reasoning approach, enhancing the prompting process through clustering on relevant descriptive attributes.

Original languageEnglish
Title of host publicationMain Conference
EditorsOwen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
PublisherAssociation for Computational Linguistics (ACL)
Pages7998-8025
Number of pages28
ISBN (Electronic)9798891761964
StatePublished - 1 Jan 2025
Externally publishedYes
Event31st International Conference on Computational Linguistics, COLING 2025 - Abu Dhabi, United Arab Emirates
Duration: 19 Jan 202524 Jan 2025

Publication series

NameProceedings - International Conference on Computational Linguistics, COLING
ISSN (Print)2951-2093

Conference

Conference31st International Conference on Computational Linguistics, COLING 2025
Country/TerritoryUnited Arab Emirates
CityAbu Dhabi
Period19/01/2524/01/25

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science Applications
  • Computational Theory and Mathematics

Fingerprint

Dive into the research topics of 'Multilingual and Explainable Text Detoxification with Parallel Corpora'. Together they form a unique fingerprint.

Cite this