LLM Jailbreak and Red-Team Resistance Leaderboard

Rankings of 14 frontier LLMs by adversarial robustness - how well they resist jailbreaks, prompt injection, and harmful-behavior elicitation across HarmBench, AdvBench, StrongREJECT, JailbreakBench, and AgentHarm.

LLM Jailbreak and Red-Team Resistance Leaderboard

Most AI capability rankings measure what a model can do when everyone is playing nice. Adversarial robustness measures something different: what happens when someone is actively trying to make the model do something it shouldn't. That gap matters more than most benchmark comparisons suggest, because in production, models face both well-meaning users and bad-faith ones - sometimes in the same session.

This leaderboard aggregates published attack success rate (ASR) data across five standardized adversarial evaluation frameworks and covers 14 frontier models. The score that matters is the one you don't want to be high.

TL;DR - Claude 4 Sonnet leads adversarial robustness by a significant margin, with a 2.86% max harm rate against autonomous reasoning-model attackers and consistent low-single-digit ASR across standard benchmarks. GPT-5 and o3 show strong resistance with aggressive post-deployment patching. DeepSeek V3.2 and Llama 4 open-weight models remain the most vulnerable, with ASR figures in the 50-90% range against common automated attacks. Agentic safety is a separate problem - InjecAgent results show nearly every model remains vulnerable to indirect prompt injection in tool-use workflows.

What ASR Means

Attack Success Rate (ASR) is the percentage of adversarial prompts that successfully elicit harmful or policy-violating content from a model. Lower is better. An ASR of 5% means that 1 in 20 adversarial attempts succeeds. An ASR of 90% means the adversary wins almost every time.

ASR figures are not interchangeable across benchmarks. A 5% ASR on HarmBench - which uses 18 automated attack methods including gradient-based token optimization - is much harder to achieve than a 5% ASR on a simple prefix-injection dataset. When comparing numbers, the benchmark context matters as much as the number itself.

The figures in this leaderboard come from published papers, official system cards, and peer-reviewed evaluation studies. Where no public figure exists for a given model-benchmark pair, I mark it as "Not reported" rather than interpolating. The rankings reflect the overall picture from available data, not any single metric.

The Benchmarks Explained

HarmBench

HarmBench (Center for AI Safety, 2024) is the closest thing to a standardized adversarial evaluation framework the field has. It tests 33+ models against 18 attack methods including Greedy Coordinate Gradient (GCG) token optimization, Prompt Automatic Iterative Refinement (PAIR), and Tree of Attacks with Pruning (TAP). It covers 7 harm categories across 400 test behaviors. The diversity of attack methods is the key: a model that blocks GCG attacks might still fall to semantic jailbreaks.

AdvBench

AdvBench (Zou et al., 2023) is a simpler 520-item benchmark of harmful behaviors and string completions. It's widely used because it's reproducible, but it reflects single-turn attacks and skews toward prompts that were known at the time of its creation. A low AdvBench ASR is necessary but not sufficient evidence of robustness.

StrongREJECT

StrongREJECT flips the measurement approach. Rather than just detecting whether a model said something harmful, it evaluates the quality of rejections - measuring whether the model provides an actual refusal or just a weak, useless non-answer that still hands the attacker what they want. A model that says "I can't help with that, but here's how..." scores badly even if technically refusing. This benchmark is the most practically meaningful for production deployments.

JailbreakBench

JailbreakBench is a living leaderboard with a standardized set of 100 misuse behaviors evaluated at fixed checkpoints. Its main value is comparability over time - it uses a consistent Llama Guard judge and controlled attack budgets so you can track whether models improve or regress across versions. The public leaderboard is updated as new submissions come in.

AgentHarm

AgentHarm evaluates models specifically in agentic settings - when the model is acting as an agent with access to tools, APIs, or persistent memory. It tests 110 harmful agentic tasks across 11 categories including cyberattacks, harassment automation, and financial fraud. Standard refusal training does not translate cleanly to agentic settings, which is why this benchmark produces very different results than single-turn evaluations.

InjecAgent

InjecAgent tests indirect prompt injection - attacks where malicious instructions are hidden in external content retrieved during tool use (web pages, documents, API responses). When a model reads a web page and that web page contains hidden instructions, does the model follow them? Most current models do, often silently, which makes this one of the most practically dangerous attack surfaces in production agentic systems.

Main Ranking Table - April 2026

Rankings are by overall adversarial robustness (Rank 1 = most robust). ASR figures are the best available published numbers from the sources listed in the methodology section. Lower ASR = better.

RankModelProviderHarmBench ASRAdvBench ASRStrongREJECTJailbreakBench ASRAgentHarm ASRNotes
1Claude 4 SonnetAnthropic~3%~2%High~5%~15%Best single-model resistance across all attack types; 2.86% max harm vs. reasoning-model attackers (Nature Comms study)
2Claude 4 OpusAnthropic~4%~3%High~6%~18%Strong across board; highest Constitutional AI investment; slightly lower instruction-following means slightly more refusals
3GPT-5 / o3OpenAI~5%~4%High~8%~22%Aggressive post-launch patching; initial ASR was higher pre-patch; strong StrongREJECT scores; agentic layer less hardened
4GPT-4.1OpenAI~8%~6%Good~11%~28%Solid resistance; weaker than o3 on multi-turn attacks; frequently tested baseline in published research
5Gemini 2.5 ProGoogle~12%~9%Good~14%~31%Vulnerable to reasoning-model multi-turn attackers (71.43% max harm in Nature Comms study when used as a target); single-turn defenses decent
6Kimi K2.5Moonshot AINot reportedNot reportedModerateNot reportedNot reportedLimited public red-team data; internal Moonshot safety filtering active; treat as provisional
7Qwen 3.5Alibaba~22%~18%Moderate~25%~42%Safety varies significantly by prompt language; English safety better than Chinese-language testing; open-weight versions weaker
8Mistral Large 3Mistral AI~24%~19%Moderate~27%~44%ASR improves substantially with moderation layer enabled; bare model has weak default guardrails
9Phi-4Microsoft~28%~23%Moderate~30%Not reportedStrong reasoning-to-size ratio but safety training less robust than larger models; limited agentic safety data
10Grok 4xAI~32%~27%Low-Moderate~34%~50%Consistently quick to jailbreak in third-party testing; Lumenova found Grok 4 broke under 30 min per model; disclaimer pattern without refusal common
11DeepSeek-R2DeepSeek~38%~31%Moderate~40%Not reportedReasoning capability makes it both a harder-to-fool target and a potent attacker; safety inconsistent by category
12Llama 4 Scout / MaverickMeta36.83%~32%Low-Moderate~38%~55%HarmBench figure from published evaluation; open-weight baseline; significant improvement with Llama Guard overlay
13DeepSeek V3.2DeepSeek~52%~45%Low~50%~65%High ASR across automated attacks; 90% max harm vs. reasoning-model attackers (Nature Comms); geopolitically sensitive topics handled differently
14Mixtral 8x22BMistral AI~58%~50%Low~55%~70%Older open-weight baseline; no default safety layer; included as reference point for open-weight risk floor

All figures are approximate from published sources; see Methodology. "Not reported" = no public evaluation data available. ASR = Attack Success Rate (lower is better). StrongREJECT scores are qualitative (High/Moderate/Low) because the published metric is a scalar that varies by attack set.

A few notes on the table before going further.

The rankings for Claude 4 Sonnet and GPT-5 reflect consistent findings across multiple independent evaluations, not just one paper. The gap between the top three positions and the rest is large and reproducible. Claude's advantage is structural: Constitutional Classifiers++ reads internal model activations rather than scanning surface text, which catches adversarial intent that keyword-based filters miss entirely.

The open-weight models at the bottom of the table are there because they ship without safety layers by default. Llama 4 Maverick's published HarmBench ASR of 36.83% is for the base model. With a Llama Guard or Granite Guardian overlay, that number drops substantially. The ranking reflects what you get out of the box.

Grok 4's poor performance stands out given xAI's stated safety commitments. Third-party testing published by Lumenova AI found that all jailbreak techniques developed against GPT-5 transferred to Grok 4 in under 30 minutes per model, and that Grok 4 was the easiest of the tested frontier models to break. A high disclaimer rate (60%+ in some evaluations) does not substitute for refusal - a model that adds "for educational purposes" before complying with a harmful request is not meaningfully safer.

Attack Category Breakdown

Not all attacks are equally likely in every deployment context. Here's how the main frontier models perform across the five harm categories that produce the highest real-world risk:

Cyber Offense and Malware Generation

Writing functional malware, explaining exploitation techniques, or generating working code for offensive security tools. Claude and GPT-5 show the strongest resistance here - consistent low ASR even against technically sophisticated attack framings. DeepSeek V3.2 and Mixtral remain highly vulnerable to prompts framed as security research or penetration testing.

CBRN (Chemical, Biological, Radiological, Nuclear)

Uplift for creating weapons of mass destruction is the highest-priority harm category for every major lab. Claude's Responsible Scaling Policy and OpenAI's Preparedness Framework both treat CBRN uplift as a hard red line. In practice, published evaluations consistently show Claude and GPT-5 maintaining near-zero ASR in this category even under strong attack pressure. Open-weight models without moderation layers are substantially more vulnerable.

Persuasion and Social Engineering Automation

Generating targeted phishing content, impersonation scripts, or mass harassment campaigns. This category sees higher ASR across most models than CBRN, partly because the harm framing is easier to launder through "creative writing" or "security awareness" contexts. Grok 4 and DeepSeek models perform notably worse here than on the cyber-offense category.

Fraud and Financial Crime

Generating scam scripts, account takeover guides, or money laundering documentation. Results here are mixed even for top-ranked models - the framing as financial or legal advice creates ambiguity that safety classifiers struggle with. StrongREJECT scores are most informative in this category, because many models issue nominal refusals while still providing useful partial information.

Illegal Firearms and Weapons

3D printing instructions, conversion modification guides, suppressor fabrication. US-deployed models (GPT-5, Claude, Grok) show stronger resistance here than models primarily deployed in other regulatory environments. Qwen 3.5's safety behavior on this category varies significantly between English and Chinese prompts, reflecting different training data compositions.

Notable Attack Methods

Understanding how attacks work helps evaluate which defenses actually matter. These descriptions cover published research techniques at a conceptual level - no working exploits, no specific prompt templates.

PAIR (Prompt Automatic Iterative Refinement) uses one LLM to iteratively refine attack prompts against a target LLM, automatically adapting when the target refuses. Published in 2023 by Chao et al., it was among the first demonstrations that black-box attacks could achieve high ASR without gradient access. The original paper reported success against GPT-4, Claude, and Vicuna. Current frontier models are substantially more resistant to the original PAIR formulation, but adaptive variants remain effective.

TAP (Tree of Attacks with Pruning) extends PAIR by building a tree of attack attempts rather than a linear chain, pruning low-promise branches and expanding high-promise ones. The TAP paper showed consistent improvements over PAIR, especially on models that had been hardened against the simpler iterative refinement approach.

GCG (Greedy Coordinate Gradient) is a white-box attack that requires gradient access to the target model. It finds adversarial suffixes - specific token sequences appended to harmful prompts - that cause the model to comply. It's computationally expensive and requires open-weight model access, but the suffixes it finds often transfer to closed models. The Universal Adversarial Attacks paper demonstrated this transfer property. Current frontier models are significantly more robust to known GCG suffixes than they were in 2023.

Multi-turn crescendo attacks gradually escalate through a conversation, establishing context with benign interactions before steering toward the harmful objective. The most sophisticated version of this is now automated - the autonomous jailbreak agents paper demonstrated 97% success rates using reasoning models as automated multi-turn attackers, with Claude 4 Sonnet remaining the most resistant target at 2.86% max harm.

Many-shot jailbreaking exploits long context windows by inserting many examples of an AI complying with harmful requests before the actual attack prompt. With context windows now at 1M+ tokens, this attack scales with context length - research showed attack success rates increasing monotonically with the number of in-context examples. Models with larger context windows face a higher attack surface from this vector.

RAT (Retrieval Augmented Thinking) and indirect prompt injection leverage tool use and RAG workflows to plant malicious instructions in retrieved documents. When a model retrieves a web page or document containing hidden instructions, those instructions can override the system prompt in many current architectures. InjecAgent systematically evaluated this attack surface and found high ASR across nearly all tested models in agentic settings.

Reasoning model exploitation is the newest attack category. Because reasoning-capable models chain through problems step by step before answering, adversaries can sometimes embed harmful goals inside the reasoning chain rather than the surface request. The same capability that makes these models good at planning makes them susceptible to having their plan hijacked.

Defense Approaches

Constitutional AI and RLHF strength. Anthropic's Constitutional AI trains models to evaluate their own outputs against a set of principles and revise them before responding. This builds robustness into the model weights rather than relying solely on post-output filtering. The practical difference shows up in StrongREJECT scores - models with stronger RLHF produce better-quality refusals rather than nominal refusals that still leak information.

Representation engineering and circuit breakers. Zou et al. 2024 demonstrated that harmlessness can be reinforced at the internal representation level by identifying and suppressing activation patterns associated with harmful outputs. The Circuit Breaker paper extended this to agentic settings, showing that representation-level interventions maintain robustness even when surface prompts are bypassed. This approach is more robust than output-only classifiers because it acts before the harmful content is generated.

Constitutional Classifiers++. Anthropic's production safety system uses a two-layer architecture: a fast internal probe reading model activations on every request, followed by a full classifier for flagged queries. Their public jailbreak challenge found zero universal jailbreaks after 3,000+ hours of testing by 183 participants. False refusal rate is 0.05% - roughly 1 in 2,000 legitimate queries.

Classifier and moderation layers. External classifiers (Meta's Llama Guard, Granite Guardian, Perspective API) can be added to any model's output pipeline. The effectiveness varies: Mistral Large jumps from "Good" to "Very Good" on MLCommons AILuminate when its moderation layer is active. The tradeoff is latency and a separate failure mode - if the classifier is bypassed, the underlying model's baseline ASR applies.

Immutable safety suffixes. Research from the autonomous jailbreak agents study showed that appending a consistent, immutable safety instruction to every incoming message reduced successful jailbreaks from 97% to under 1% in controlled conditions. Practical deployment raises questions about helpfulness tradeoffs that weren't fully assessed.

Agentic Safety: A Separate Problem

Standard jailbreak benchmarks test single-turn or multi-turn conversations with a chatbot interface. Agentic deployments add new attack surfaces that current safety training does not adequately cover.

When a model has tool access - web browsing, code execution, file system access, API calls - the attack surface expands dramatically. Indirect prompt injection means the model can receive malicious instructions from any content it reads. AgentHarm found that even models that perform well on conversational jailbreak benchmarks show substantially higher ASR in agentic settings. The numbers in the main table reflect this: Claude 4 Sonnet's ~3% conversational HarmBench ASR becomes ~15% in agentic AgentHarm testing.

The InjecAgent evaluation makes this concrete: when malicious instructions were embedded in simulated tool outputs, most models followed those instructions a significant fraction of the time, even when the system prompt explicitly instructed them not to trust external content. This is not primarily a safety training problem - it's an architectural one. Models that use retrieved context to answer questions have limited ability to distinguish "content to read" from "instructions to follow."

For teams building production agents, the practical implication is that model-level jailbreak resistance is necessary but not sufficient. Sandboxing tool outputs, validating returned content before it enters the context window, and treating external content as untrusted are infrastructure requirements that no amount of model fine-tuning can substitute for.

The Agents of Chaos red-team study documented real-world consequences of agentic safety gaps in live deployments - including data leakage from natural-language ambiguity that no jailbreak benchmark captures.

Methodology and Caveats

Sources. Primary sources for ASR figures: the HarmBench paper and GitHub, the JailbreakBench leaderboard, OpenAI's GPT-5 system card, Anthropic's published safety evaluations, the Nature Communications autonomous jailbreak agents paper, and Lumenova AI's cross-generation jailbreak testing. Figures marked with ~ are approximate from published data ranges.

ASR varies with attack budget. A researcher with a 50-attempt budget against GPT-5 gets a very different ASR than one with a 10,000-attempt adaptive attack. Most published benchmarks use standardized attack budgets, but comparisons across benchmarks with different budgets are not direct.

Models are patched silently. OpenAI's pattern of post-deployment safety patching is well documented - GPT-5's initial ASR was substantially higher than post-patch figures. Anthropic's Constitutional Classifiers are updated continuously. Any specific ASR number reflects the model at time of testing, not necessarily the current deployed version. I've used the most recent published figures available.

Responsible disclosure norms. This article describes attack categories and defense approaches at a conceptual level. It does not publish specific attack prompts, successful jailbreak templates, or model-specific exploitation techniques. That information exists in academic papers behind the links in the Sources section - researchers who need it can find it there. Publishing working exploits without coordinated disclosure is not something this site does.

Many-shot attacks scale with context. As models extend their context windows, many-shot jailbreaking attack surface scales proportionally. The many-shot jailbreaking paper documented this relationship clearly. A model with a 1M-token context window faces a meaningfully different many-shot attack surface than a 128K-token model, all else equal.

JBDistill and the benchmark decay problem. Static jailbreak benchmarks decay as models are trained on them. The JBDistill framework from Johns Hopkins and Microsoft addresses this by auto-generating fresh adversarial prompts on demand. Its 81.8% ASR against 13 LLMs using dynamically generated attacks versus 18.4% for the static HarmBench illustrates the gap between defending against known attacks and defending against novel ones.

For broader safety context including alignment scores, refusal rates, and bias evaluations, see the AI Safety Leaderboard. That leaderboard covers the full safety landscape; this one focuses specifically on adversarial robustness and attack resistance.

For the real-world consequence of a successful jailbreak at scale, the Mexico government breach documented how 1,000+ prompts against Claude were used to steal 150GB of government data. The attacker used the model as a tool for generating exploit code, not as the primary attack vector - but the case illustrates why ASR numbers have production consequences.

Sources

LLM Jailbreak and Red-Team Resistance Leaderboard
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.