LLM Jailbreak Taxonomy — 40 patterns · 10 categories

THE TEN CATEGORIES

40 patterns mapped to 10 mechanism-grounded categories

CAT 01HIGH

Role-Play & Persona

Fictional framing redirects instruction-following over safety. Structural — Wei 2023.

5 patterns43.0% ASR

CAT 02HIGH

Direct Prompt Injection

Authorized vs adversarial instruction confusion. Greshake 2023 indirect PI mostly unmitigated.

5 patterns29.0% ASR

CAT 03MED-HIGH

GCG / Adversarial Suffix

Gradient-based suffix optimization (Zou 2023). Renamed from "Token Smuggling" in v4.0.1.

7 patterns34.3% ASR

CAT 04MED

Context Manipulation

Many-shot attacks (Anil 2024) scale monotonically with context window.

4 patterns28.1% ASR

CAT 05HIGH

Multi-Turn Deception

Largest benchmark gap. DRA 91.1% / FITD 94% / Crescendo — most benchmarks evaluate single-turn only.

4 patterns54.4% ASR

CAT 06MED

System Prompt Extraction

Low direct severity, but amplifies subsequent attacks across categories.

5 patterns30.0% ASR

CAT 07CRITICAL

LRM Autonomous Attacks

Reasoning models autonomously plan multi-turn jailbreaks. Hagendorff 2026 — 97.14% across 9 models.

3 patterns93.3% ASR

CAT 08CRITICAL

Fuzzing-Based

Mutation engines vs guardrails. JBFuzz v1 — 99% ASR across 9 LLMs at ~60s/bypass.

3 patterns92.5% ASR

CAT 09HIGH

Multimodal Injection

Cross-modal safety classifiers don't transfer. UltraBreak 2026 transfers across labs.

2 patterns36.3% ASR

CAT 10CRITICAL

Agentic Chain Exploitation

Tool-chain hijack + cross-session memory poisoning. PoisonedRAG 90% · MINJA 95% · Sleeper memory undocumented defense.

2 patterns66.3% ASR

RESULTS

8,000 bootstrap trials · 10 seeds · seeded reproducibility

BY MODEL — 95% CI

claude-opus-4-8 (Anthropic)19.65%

95% CI [17.25, 23.25] · σ 1.85

gpt-5.5 (OpenAI)41.48%

95% CI [39.50, 44.00] · σ 1.61

gemini-3.5-flash (Google)53.15%

95% CI [50.00, 56.75] · σ 1.89

deepseek-v4-pro (DeepSeek)73.65%

95% CI [71.50, 77.00] · σ 1.85

BY CATEGORY — 95% CI

LRM Autonomous89.75%

[85.83, 93.33]

Fuzzing-Based91.42%

[84.17, 95.00]

Agentic Chain65.12%

[57.50, 71.25]

Multi-Turn Deception58.94%

[54.37, 65.62]

CORRECTIONS

claims refuted or unverified after direct WebFetch audit 2026-06-01

01

✗PoisonedRAG: 97–99% ASR with 5 poisoned docs

✓Actual: 90% ASR — verbatim from arXiv:2402.07867 abstract

02

✗Cat 3 labeled "Token-Level Smuggling"

✓Renamed to "GCG / Adversarial Suffix" — Zou 2023 is gradient-based, not encoding

03

✗Constitutional Classifiers v1: "86% → 4.4% bypass"

✓Verified: 0.38% production refusal increase · 23.7% inference overhead · 3,000+ hours

04

✗Liu DRA: USENIX Security 2025

✓Actual: arxiv:2402.18104 Feb 2024 · USENIX venue not publicly verifiable

05

✗arXiv:2601.05504 cited as the MINJA paper

✓Actual: Devarangadi Sunil et al. Defense paper citing MINJA

06

✗Hagendorff dated "2026"

✓arxiv August 2025 · Nature Comms DOI 10.1038/s41467-026-69010-1 assigned

CITATIONS

every entry direct-WebFetch verified — see paper/references.bib

Cat 7Hagendorff, Derner, Oliver — Large Reasoning Models Are Autonomous Jailbreak Agents — arXiv:2508.04039 — Nature Comms 2026VERIFIED Cat 8Gohil — JBFuzz: Jailbreaking LLMs Using Fuzzing — arXiv:2503.08990v1 — 99% across 9 LLMsVERIFIED v1 Cat 5Liu et al. — Disguise and Reconstruction (DRA) — arXiv:2402.18104 — 91.1% ASR on GPT-4VERIFIED Cat 5Weng et al. — Foot-in-the-Door Multi-Turn — arXiv:2502.19820 — 94% avg across 7 modelsVERIFIED Cat 10W. Zou et al. — PoisonedRAG — arXiv:2402.07867 — 90% with 5 poisoned docs (corrected)VERIFIED Cat 2Brodt, Feldman, Schneier, Nassi — The Promptware Kill Chain — arXiv:2601.09625 — 7-stage frameworkVERIFIED Cat 10Pulipaka et al. — Hidden in Memory: Sleeper Memory Poisoning — arXiv:2605.15338 — 99.8% on GPT-5.5VERIFIED Cat 10Huang et al. — Blindfold: Embodied LLM Action-Level — arXiv:2603.01414 — 53% higher ASR · 6DoF armVERIFIED DefenseCunningham et al. — Constitutional Classifiers++ — arXiv:2601.04603 — 40× cost reduction · 0.05% production refusalVERIFIED Cat 5Russinovich et al. — Crescendo Multi-Turn — arXiv:2404.01833 — "100% ASR" claim not in abstractUNVERIFIED

A MECHANISM-GROUNDEDFRAMEWORK