HARALD: Augmenting Hate Speech Data Sets with Real Data

Tal Ilan, Dan Vilenchik

    Research output: Contribution to conferencePaperpeer-review

    2 Scopus citations

    Abstract

    The successful completion of the hate speech detection task hinges upon the availability of rich and variable labeled data, which is hard to obtain. In this work, we present a new approach for data augmentation that uses as input real unlabelled data, which is carefully selected from online platforms where invited hate speech is abundant. We show that by harvesting and processing this data (in an automatic manner), one can augment existing manually-labeled datasets to improve the classification performance of hate speech classification models. We observed an improvement in F1-score ranging from 2.7% and up to 9.5%, depending on the task (in- or cross-domain) and the model used.

    Original languageEnglish
    Pages2241-2248
    Number of pages8
    StatePublished - 1 Jan 2022
    Event2022 Findings of the Association for Computational Linguistics: EMNLP 2022 - Abu Dhabi, United Arab Emirates
    Duration: 7 Dec 202211 Dec 2022

    Conference

    Conference2022 Findings of the Association for Computational Linguistics: EMNLP 2022
    Country/TerritoryUnited Arab Emirates
    CityAbu Dhabi
    Period7/12/2211/12/22

    ASJC Scopus subject areas

    • Computational Theory and Mathematics
    • Computer Science Applications
    • Information Systems

    Fingerprint

    Dive into the research topics of 'HARALD: Augmenting Hate Speech Data Sets with Real Data'. Together they form a unique fingerprint.

    Cite this