TY - GEN
T1 - HateVersarial
T2 - 30th ACM Conference on User Modeling, Adaptation and Personalization, UMAP2022
AU - Grolman, Edita
AU - Binyamini, Hodaya
AU - Shabtai, Asaf
AU - Elovici, Yuval
AU - Morikawa, Ikuya
AU - Shimizu, Toshiya
N1 - Publisher Copyright:
© 2022 ACM.
PY - 2022/4/7
Y1 - 2022/4/7
N2 - Machine learning (ML) models are commonly used to detect hate speech, which is considered one of the main challenges of online social networks. However, ML models have been shown to be vulnerable to well-crafted input samples referred to as adversarial examples. In this paper, we present an adversarial attack against hate speech detection models and explore the attack's ability to: (1) prevent the detection of a hateful user, which should result in termination of the user's account, and (2) classify normal users as hateful, which may lead to the termination of a legitimate user's account. The attack is targeted at ML models that are trained on tabular, heterogeneous datasets (such as the datasets used for hate speech detection) and attempts to determine the minimal number of the most influential mutable features that should be altered in order to create a successful adversarial example. To demonstrate and evaluate the attack, we used the open and publicly available "Hateful Users on Twitter"dataset. We show that under a black-box assumption (i.e., the attacker does not have any knowledge on the attacked model), the attack has a 75% success rate, whereas under a white-box assumption (i.e., the attacker has full knowledge on the attacked model), the attack has an 88% success rate.
AB - Machine learning (ML) models are commonly used to detect hate speech, which is considered one of the main challenges of online social networks. However, ML models have been shown to be vulnerable to well-crafted input samples referred to as adversarial examples. In this paper, we present an adversarial attack against hate speech detection models and explore the attack's ability to: (1) prevent the detection of a hateful user, which should result in termination of the user's account, and (2) classify normal users as hateful, which may lead to the termination of a legitimate user's account. The attack is targeted at ML models that are trained on tabular, heterogeneous datasets (such as the datasets used for hate speech detection) and attempts to determine the minimal number of the most influential mutable features that should be altered in order to create a successful adversarial example. To demonstrate and evaluate the attack, we used the open and publicly available "Hateful Users on Twitter"dataset. We show that under a black-box assumption (i.e., the attacker does not have any knowledge on the attacked model), the attack has a 75% success rate, whereas under a white-box assumption (i.e., the attacker has full knowledge on the attacked model), the attack has an 88% success rate.
KW - adversarial attack
KW - hate speech
KW - social media
KW - Twitter
UR - http://www.scopus.com/inward/record.url?scp=85135175816&partnerID=8YFLogxK
U2 - 10.1145/3503252.3531309
DO - 10.1145/3503252.3531309
M3 - Conference contribution
AN - SCOPUS:85135175816
T3 - UMAP2022 - Proceedings of the 30th ACM Conference on User Modeling, Adaptation and Personalization
SP - 143
EP - 152
BT - UMAP2022 - Proceedings of the 30th ACM Conference on User Modeling, Adaptation and Personalization
PB - Association for Computing Machinery, Inc
Y2 - 4 July 2022 through 7 July 2022
ER -