The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against Llm Jailbreaks and Prompt Injections

We intuitively know that a truly robust security defense would need testing against attackers who understand the defense mechanism. In traditional computer security, cryptographic systems like RSA assume the attacker knows everything except the key. Or, network defenses should also be robust against red teams who study the architecture before attempting intrusion. Yet when evaluating LLM defenses against jailbreaks and prompt injections, the field has largely abandoned this principle.

Many LLM defense papers typically report near-zero attack success rates by testing against static datasets of known attacks or applying generic optimization methods like GCG without modification. Both approaches operate under severe computational constraints and, more importantly, ignore the specific mechanisms each defense employs. The paper demonstrates that when attackers actually adapt their methods to each defense's design, success rates jump above 90% for most systems.

Likewise, a different paper also makes an observation that AI Control protocols that we currently have break with adaptive attacks based on the knowledge of the specific protocol. In fact, we should always assume that security through obscurity is insufficient. Security through obscurity fails for LLM defenses specifically furthermore because even the defense mechanisms themselves will eventually appear in internet scale training data. In the world where the boundary of train and test sets blurs, we can't design a defense that's guaranteed to remain unknown to the model being defended against.

The paper unifies different attack methods under this common framework. Gradient-based methods propose token substitutions using embedding gradients. Reinforcement learning approaches treat prompt generation as an environment to optimize. Search-based methods use LLMs to suggest mutations of successful attacks. Human red-teamers iterate through creative trial and error. All four instantiations of this loop achieve success rates above 90% against defenses that claimed near-zero vulnerability.