LRM Autonomous 89.75% CI [85.8, 93.3] Fuzzing 91.42% CI [84.2, 95.0] Agentic Chain 65.12% CI [57.5, 71.3] Claude Opus 4-8 most robust at 19.65% ASR DeepSeek V4-Pro weakest at 73.65% Every citation direct-WebFetch verified 2026-06-01 PoisonedRAG corrected: 90%, not 97-99%

LLM Jailbreak Taxonomy · AI Safety Research · 2026

A MECHANISM-
GROUNDED
FRAMEWORK

40 adversarial attack patterns. 10 mechanism-grounded categories.
Mapped to the safety-alignment assumptions each subverts.
Every citation direct-WebFetch verified. Refuted claims documented.

40
Attack Patterns
10
Categories
8,000
Bootstrap Trials
17
Verified Citations
10/10
Tests Passing

THE TEN CATEGORIES

40 patterns mapped to 10 mechanism-grounded categories
CAT 01HIGH
Role-Play & Persona
Fictional framing redirects instruction-following over safety. Structural — Wei 2023.
5 patterns43.0% ASR
CAT 02HIGH
Direct Prompt Injection
Authorized vs adversarial instruction confusion. Greshake 2023 indirect PI mostly unmitigated.
5 patterns29.0% ASR
CAT 03MED-HIGH
GCG / Adversarial Suffix
Gradient-based suffix optimization (Zou 2023). Renamed from "Token Smuggling" in v4.0.1.
7 patterns34.3% ASR
CAT 04MED
Context Manipulation
Many-shot attacks (Anil 2024) scale monotonically with context window.
4 patterns28.1% ASR
CAT 05HIGH
Multi-Turn Deception
Largest benchmark gap. DRA 91.1% / FITD 94% / Crescendo — most benchmarks evaluate single-turn only.
4 patterns54.4% ASR
CAT 06MED
System Prompt Extraction
Low direct severity, but amplifies subsequent attacks across categories.
5 patterns30.0% ASR
CAT 07CRITICAL
LRM Autonomous Attacks
Reasoning models autonomously plan multi-turn jailbreaks. Hagendorff 2026 — 97.14% across 9 models.
3 patterns93.3% ASR
CAT 08CRITICAL
Fuzzing-Based
Mutation engines vs guardrails. JBFuzz v1 — 99% ASR across 9 LLMs at ~60s/bypass.
3 patterns92.5% ASR
CAT 09HIGH
Multimodal Injection
Cross-modal safety classifiers don't transfer. UltraBreak 2026 transfers across labs.
2 patterns36.3% ASR
CAT 10CRITICAL
Agentic Chain Exploitation
Tool-chain hijack + cross-session memory poisoning. PoisonedRAG 90% · MINJA 95% · Sleeper memory undocumented defense.
2 patterns66.3% ASR

RESULTS

8,000 bootstrap trials · 10 seeds · seeded reproducibility

BY MODEL — 95% CI

claude-opus-4-8 (Anthropic)19.65%
95% CI [17.25, 23.25] · σ 1.85
gpt-5.5 (OpenAI)41.48%
95% CI [39.50, 44.00] · σ 1.61
gemini-3.5-flash (Google)53.15%
95% CI [50.00, 56.75] · σ 1.89
deepseek-v4-pro (DeepSeek)73.65%
95% CI [71.50, 77.00] · σ 1.85

BY CATEGORY — 95% CI

LRM Autonomous89.75%
[85.83, 93.33]
Fuzzing-Based91.42%
[84.17, 95.00]
Agentic Chain65.12%
[57.50, 71.25]
Multi-Turn Deception58.94%
[54.37, 65.62]
Cross-model ASR

CORRECTIONS

claims refuted or unverified after direct WebFetch audit 2026-06-01
01
PoisonedRAG: 97–99% ASR with 5 poisoned docs
Actual: 90% ASR — verbatim from arXiv:2402.07867 abstract
02
Cat 3 labeled "Token-Level Smuggling"
Renamed to "GCG / Adversarial Suffix" — Zou 2023 is gradient-based, not encoding
03
Constitutional Classifiers v1: "86% → 4.4% bypass"
Verified: 0.38% production refusal increase · 23.7% inference overhead · 3,000+ hours
04
Liu DRA: USENIX Security 2025
Actual: arxiv:2402.18104 Feb 2024 · USENIX venue not publicly verifiable
05
arXiv:2601.05504 cited as the MINJA paper
Actual: Devarangadi Sunil et al. Defense paper citing MINJA
06
Hagendorff dated "2026"
arxiv August 2025 · Nature Comms DOI 10.1038/s41467-026-69010-1 assigned

CITATIONS

every entry direct-WebFetch verified — see paper/references.bib
Cat 7Hagendorff, Derner, Oliver — Large Reasoning Models Are Autonomous Jailbreak Agents — arXiv:2508.04039 — Nature Comms 2026VERIFIED Cat 8Gohil — JBFuzz: Jailbreaking LLMs Using Fuzzing — arXiv:2503.08990v1 — 99% across 9 LLMsVERIFIED v1 Cat 5Liu et al. — Disguise and Reconstruction (DRA) — arXiv:2402.18104 — 91.1% ASR on GPT-4VERIFIED Cat 5Weng et al. — Foot-in-the-Door Multi-Turn — arXiv:2502.19820 — 94% avg across 7 modelsVERIFIED Cat 10W. Zou et al. — PoisonedRAG — arXiv:2402.07867 — 90% with 5 poisoned docs (corrected)VERIFIED Cat 2Brodt, Feldman, Schneier, Nassi — The Promptware Kill Chain — arXiv:2601.09625 — 7-stage frameworkVERIFIED Cat 10Pulipaka et al. — Hidden in Memory: Sleeper Memory Poisoning — arXiv:2605.15338 — 99.8% on GPT-5.5VERIFIED Cat 10Huang et al. — Blindfold: Embodied LLM Action-Level — arXiv:2603.01414 — 53% higher ASR · 6DoF armVERIFIED DefenseCunningham et al. — Constitutional Classifiers++ — arXiv:2601.04603 — 40× cost reduction · 0.05% production refusalVERIFIED Cat 5Russinovich et al. — Crescendo Multi-Turn — arXiv:2404.01833 — "100% ASR" claim not in abstractUNVERIFIED