Jailbreak Attack Initializations as Extractors of Compliance Directions

  • Amit LeVi
  • , Rom Himelstein
  • , Yaniv Nemcovsky
  • , Avi Mendelson
  • , Chaim Baskin

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Safety-aligned LLMs respond to prompts with either compliance or refusal, each corresponding to distinct directions in the model’s activation space. Recent studies have shown that initializing attacks via self-transfer from other prompts significantly enhances their performance. However, the underlying mechanisms of these initializations remain unclear, and attacks utilize arbitrary or hand-picked initializations. This work presents that each gradient-based jailbreak attack and subsequent initialization gradually converge to a single compliance direction that suppresses refusal, thereby enabling an efficient transition from refusal to compliance. Based on this insight, we propose CRI, an initialization framework that aims to project unseen prompts further along compliance directions. We demonstrate our approach on multiple attacks, models, and datasets, achieving an increased attack success rate (ASR) and reduced computational overhead, highlighting the fragility of safety-aligned LLMs.
Original languageEnglish
Title of host publicationFindings of the Association for Computational Linguistics: EMNLP 2025
EditorsChristos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Place of PublicationSuzhou, China
PublisherAssociation for Computational Linguistics
Pages6672-6705
Number of pages34
ISBN (Print)9798891763357
DOIs
StatePublished - 1 Nov 2025

Fingerprint

Dive into the research topics of 'Jailbreak Attack Initializations as Extractors of Compliance Directions'. Together they form a unique fingerprint.

Cite this