ADVERSARIAL DÉJÀ VU: JAILBREAK DICTIONARY LEARNING FOR STRONGER GENERALIZATION TO UNSEEN ATTACKS


TLDR: We systematically show that most jailbreaks reuse a common set of adversarial skills. By learning a compact Jailbreak Dictionary of transferable skill primitives, we can explain and reconstruct unseen attacks with high fidelity. Building on this foundation, ASCoT (Adversarial Skill Compositional Training) achieves strong generalization to unseen jailbreaks by training on diverse skill combinations.

1Virginia Tech  2Princeton University  3Amazon AGI

A Quick Glance (How the Jailbreak Dictionary explains unseen attacks and powers ASCoT)


Motivation

Jailbreaks keep appearing, but many “new” prompts quietly reuse the same underlying tactics. We ask: are novel jailbreaks actually recompositions of a finite skill set?

  • We observe waves of new jailbreaks, yet recurring tactics across papers and months.
  • This motivates the Adversarial Déjà Vu view: most attacks are built from reusable adversarial skills.
Monthly growth of jailbreak attacks over time

Fig. 1a. Jailbreaks arrive in waves, not one-offs.

PAP and AutoDAN-Turbo reuse academic-facade creation

Fig. 1b. Different attacks often reuse the same skill (e.g., academic-facade framing).

Core Idea

We treat jailbreaks as compositions of adversarial skills, not isolated tricks. From 32 papers (Nov 2022 – Nov 2024), we extract 16,901 skills and compress them into a compact Jailbreak Dictionary (~397 primitives).

To test generalization, we set a temporal cutoff (Aug 15 2024): pre-cutoff attacks build the dictionary; post-cutoff attacks evaluate whether unseen skills can be reconstructed from earlier primitives.

  • Compactness: ~35× compression with high explanatory fidelity.
  • Compositionality: unseen attacks are sparse combinations of learned primitives.

Jailbreak Dictionary Learning

The Jailbreak Dictionary captures the building blocks of adversarial behavior — a compact set of transferable skills that explain how different jailbreaks succeed. It transforms a scattered collection of attacks into an interpretable skill space we can analyze and build defenses on.

Jailbreak Dictionary pipeline

Fig. 2. From pre-cutoff attacks to a compact, named dictionary of skill primitives.

The dictionary preserves explanatory power even after heavy compression, reconstructing unseen skills with only ~5–7 active primitives on average.

Explainability scores and sparsity for unseen families

Table 1. High explainability at low sparsity.

As more attacks are seen, explainability rises and then plateaus (~4.3–4.4), indicating diminishing novelty — the core evidence for Adversarial Déjà Vu.

Explainability over time

Fig. 3. Explainability increases then saturates.

From Explainability to Defenses

If new jailbreaks are just recombinations of familiar skills, then defenses should learn from those combinations. ASCoT (Adversarial Skill Compositional Training) teaches models to recognize and resist diverse skill mixtures — moving beyond memorizing attacks to mastering the underlying adversarial strategies. This directly targets the mechanisms that transfer across families (deception, framing, obfuscation, etc.), improving robustness to unseen jailbreaks while keeping refusals calibrated.

ASCoT composition: executable_code_request + authority_narrative_priming

Fig. 4. Example of composing two primitives to mutate a base query.

Results

Across LLaMA-3.1-8B and Zephyr-7B, ASCoT lowers harmfulness and strengthens generalization to unseen attacks, while keeping over-refusal balanced.

ASCoT vs baselines across capability, harmfulness, over-refusal

Table 2. ASCoT achieves strong robustness–utility balance.

Compared to closed-source reasoning models, the open-weight ASCoT model is competitive — and even surpasses o4-mini on most families — with similar over-refusal rates.

ASCoT (open) vs Claude Sonnet-4-Thinking and o4-mini

Table 3. Open-weight ASCoT competes with closed-source reasoning models.

Robustness comes from two places. First, the more adversarial skills we cover in training, the fewer surprises an attacker can invent. Second, models must learn to handle compositions of skills — simple defenses stop short at one-trick prompts, but defending against sophisticated, multi-step attacks requires exposure to deeper, multi-skill combinations.

Coverage dividend and compositional depth

Fig. 5. Coverage pays off; depth aligns defenses with attack complexity.

Finally, ASCoT sits at a favorable point on the robustness–utility curve, reducing harmful responses without inducing excessive refusals.

Robustness vs over-refusal trade-off

Fig. 6. Robustness increases without a refusal spike.

Conclusion

Robustness scales with coverage and compositional diversity of adversarial skills — not just model size or raw data. By learning a compact set of transferable adversarial skill primitives and training on their compositions, Adversarial Déjà Vu and ASCoT offer a scalable, generalizable path to defending against unseen jailbreaks.

Skill Extraction Examples

Below are concrete examples illustrating our skill-extraction pipeline: for each instance we show the original harmful prompt, the mutated jailbreak prompt, and two example skill atoms identified by our pipeline, presented as a JSON-like object.

Original prompt

(select an example)

Jailbreak mutated prompt

(select an example)

Extracted skills

(select an example)

Explaining Unseen Skills Examples

For each unseen skill (from a post-cutoff attack), we show the top matches in our Jailbreak Dictionary — including the primitive’s name, definition, example, and attribution weight.

Unseen skill

(Select an example)

Matched primitives (top-k)

Skill name Definition Example Weight
(Select an example)

Skill composition examples

This section presents concrete examples of how named skill primitives compose to generate a mutated (attacker-style) query. Each example includes the original harmful prompt, the composing skills, and the resulting composed prompt.

Original harmful prompt

(Select an example)

Composing skills

  • (Select an example)

Composed (mutated) prompt

(Select an example)

Ethics and Disclosure

This work investigates adversarial jailbreak behavior in large language models (LLMs) through the lens of transferable adversarial skill primitives. Our goal is to advance the scientific understanding of how such attacks generalize, thereby enabling the development of more principled and proactive defenses. While studying jailbreaks may incidentally expose mechanisms that could be repurposed for harmful use, we have taken extensive precautions to prevent misuse.

We adhere to responsible disclosure practices: all experiments were conducted on locally hosted models in controlled research environments. No attempts were made to disseminate, deploy, or amplify harmful generations.

Our research explicitly seeks to improve societal safety by reframing robustness as generalization across the adversarial skill space rather than mere suppression of harmful text. We believe that open, responsible examination of the mechanisms that enable jailbreaks is essential for building transparent, resilient, and trustworthy foundation models.

BibTeX

If you find our project useful, please consider citing:

@misc{dabas2025adversarialdejavujailbreak,
      title={Adversarial D\'ej\`a Vu: Jailbreak Dictionary Learning for Stronger Generalization to Unseen Attacks}, 
      author={Mahavir Dabas and Tran Huynh and Nikhil Reddy Billa and Jiachen T. Wang and Peng Gao and Charith Peris and Yao Ma and Rahul Gupta and Ming Jin and Prateek Mittal and Ruoxi Jia},
      year={2025},
      eprint={2510.21910},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2510.21910}, 
}