Hebrew offensive language taxonomy and dataset

Chaya Liebeskind, Natalia Vanetik, Marina Litvak

Research output: Contribution to journalArticlepeer-review

Abstract

This paper introduces a streamlined taxonomy for categorizing offensive language in Hebrew, addressing a gap in the literature that has, until now, largely focused on Indo-European languages. Our taxonomy divides offensive language into seven levels (six explicit and one implicit level). We based our work on the simplified offensive language (SOL) taxonomy introduced in (Lewandowska-Tomaszczyk et al. 2021a) hoping that our adjustment of SOL to the Hebrew language will be capable of reflecting the unique linguistic and cultural nuances of Hebrew. The study involves both linguistic and cultural analysis beyond Natural Language Processing (NLP). We employed manual linguistic analysis to understand the nuances of offensive language in Hebrew. An accompanying dataset, gathered on Twitter and manually curated by human annotators, is described in detail. This dataset was constructed to both validate the taxonomy and serve as a foundation for future research on offensive language detection and analysis in Hebrew. Preliminary analysis of the dataset reveals intriguing patterns and distributions, underscoring the complexity and specificity of offensive expressions in the Hebrew language. The aim of our work is to capture the complexity and specificity of offensive expressions in Hebrew beyond what automated NLP methods alone can provide. Our findings highlight the significance of considering linguistic and cultural variations when researching and correcting abusive language online. We believe that our streamlined taxonomy and associated dataset will be crucial in improving research in Hebrew language sociocultural studies, natural language processing, and offensive language detection. Our study also makes a substantial contribution to the study of low-resource languages and can be used as a model for future research on other languages.

Original languageEnglish
Pages (from-to)325-351
Number of pages27
JournalLodz Papers in Pragmatics
Volume19
Issue number2
DOIs
StatePublished - 1 Dec 2023
Externally publishedYes

Keywords

  • Hebrew offensive language dataset
  • low-resource languages
  • offensive language
  • taxonomy

ASJC Scopus subject areas

  • Language and Linguistics
  • Communication
  • Linguistics and Language

Fingerprint

Dive into the research topics of 'Hebrew offensive language taxonomy and dataset'. Together they form a unique fingerprint.

Cite this