Exploring Straightforward Methods for Automatic Conversational Red-Teaming

  • George Kour
  • , Naama Zwerdling
  • , Marcel Zalmanovici
  • , Ateret Anaby-Tavor
  • , Ora Nova Fandina
  • , Eitan Farchi

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Large language models (LLMs) are increasingly used in business dialogue systems but they also pose security and ethical risks. Multi-turn conversations, in which context influences the model's behavior, can be exploited to generate undesired responses. In this paper, we investigate the use of off-the-shelf LLMs in conversational red-teaming settings, where an attacker LLM attempts to elicit undesired outputs from a target LLM. Our experiments address critical questions and offer valuable insights regarding the effectiveness of using LLMs as automated red-teamers, shedding light on key strategies and usage approaches that significantly impact their performance. Our findings demonstrate that off-the-shelf models can serve as effective red-teamers, capable of adapting their attack strategies based on prior attempts. Allowing these models to freely steer conversations and conceal their malicious intent further increases attack success. However, their effectiveness decreases as the alignment of the target model improves.

Original languageEnglish
Title of host publicationIndustry Track
EditorsWeizhu Chen, Yi Yang, Mohammad Kachuee, Xue-Yong Fu
PublisherAssociation for Computational Linguistics (ACL)
Pages112-128
Number of pages17
ISBN (Electronic)9798891761940
DOIs
StatePublished - 1 Jan 2025
Externally publishedYes
Event2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2025 - Hybrid, Albuquerque, United States
Duration: 29 Apr 20254 May 2025

Publication series

NameProceedings of the 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies: Long Papers, NAACL-HLT 2025
Volume3

Conference

Conference2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2025
Country/TerritoryUnited States
CityHybrid, Albuquerque
Period29/04/254/05/25

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Hardware and Architecture
  • Information Systems
  • Software

Fingerprint

Dive into the research topics of 'Exploring Straightforward Methods for Automatic Conversational Red-Teaming'. Together they form a unique fingerprint.

Cite this