AI Safety Leaderboard: Refusal and Jailbreak Rankings
Rankings of AI models by safety metrics including refusal rates, jailbreak resistance, bias scores, and truthfulness across major benchmarks.

Safety used to be an afterthought in model selection. You'd pick the fastest, smartest, or cheapest LLM, then bolt on content filters and hope for the best. That's changed. With the EU AI Act's high-risk system requirements hitting enforcement in August 2026, and California's SB 53 already requiring frontier model developers to publish transparency reports, safety is now a procurement criterion - not a nice-to-have. Enterprises need hard numbers on which models resist adversarial attacks, refuse harmful requests reliably, and don't hallucinate dangerous information.
I've aggregated data from four major safety evaluation frameworks to produce the most thorough safety ranking available today. The results are clear: there's a wide gap between the safest and least safe models, and that gap maps closely to whether you're running a closed API or self-hosting an open-weight model without guardrails.
TL;DR - Claude models dominate safety rankings, with Claude 3.5 Sonnet v2 reaching the lowest attack success rate (4.39%) of any model tested. GPT-4o scores "Good" on MLCommons AILuminate but lags behind Claude on adversarial robustness. Open-weight models show a clear safety gap - DeepSeek V3 has a 54.67% attack success rate on HarmBench, and Cisco found multi-turn jailbreaks succeed 93% of the time against open-weight LLMs. The safest production setup in March 2026 is Claude with Constitutional Classifiers++, which blocks over 95% of jailbreak attempts at roughly 1% additional compute cost.
Methodology
Safety benchmarking isn't a single number problem. A model can ace one test and fail another because these evaluations measure different things. Here's what each framework captures:
HarmBench (Center for AI Safety) tests automated red teaming across 18 attack methods and 33 target models. The key metric is Attack Success Rate (ASR) - the percentage of adversarial prompts that elicit harmful content. Lower is better. It covers categories including misinformation, cybercrime, chemical and biological threats, and explicit content.
MLCommons AILuminate v1.0 assesses AI systems against 12 hazard categories using over 24,000 test prompts. It grades models on a five-point scale from Poor to Excellent based on the percentage of responses that violate safety guidelines. No model has hit "Excellent" yet.
General Analysis Adversarial Robustness Benchmark measures attack success rates specifically across HarmBench and AdvBench datasets, providing cross-benchmark comparisons of adversarial resistance.
Gray Swan Arena runs continuous red-teaming challenges where human attackers try to jailbreak models. Their Agent Red-Teaming Arena has processed 1.8 million attack attempts against 22 AI agents, logging 62,000 breaches and paying out $171,000 in bounties.
Important limitations: Safety benchmarks test what models refuse, not what they should refuse. A model that refuses everything would score perfectly on jailbreak resistance but be useless. The refusal-helpfulness tradeoff is real, and these rankings don't capture it directly. I've noted where models over-refuse in the takeaways section.
Safety Rankings - March 2026
| Rank | Model | Provider | HarmBench ASR | AILuminate Grade | Jailbreak Resistance | Overall Safety |
|---|---|---|---|---|---|---|
| 1 | Claude 3.5 Sonnet v2 | Anthropic | 4.39% | Very Good | 95.6% | A |
| 2 | Claude Opus 4.5 | Anthropic | ~5% | Very Good | 95.3% | A |
| 3 | Claude 3.5 Haiku | Anthropic | ~8% | Very Good | 93% | A- |
| 4 | GPT-4o | OpenAI | 6% | Good | 89% | B+ |
| 5 | GPT-4o mini | OpenAI | ~10% | Good | 85% | B+ |
| 6 | Gemini 2.0 Flash | 42% | Good | 78% | B | |
| 7 | Gemma 2 9B | ~15% | Very Good (bare) | 75% | B | |
| 8 | Llama 4 Maverick | Meta | 36.83% | Good (w/ moderation) | 70% | B- |
| 9 | Mistral Large (moderated) | Mistral AI | ~20% | Very Good (moderated) | 72% | B- |
| 10 | DeepSeek V3 | DeepSeek | 54.67% | Fair | 50% | C |
HarmBench ASR = Attack Success Rate (lower is better). Jailbreak Resistance = percentage of jailbreak attempts blocked. Overall Safety is a composite grade factoring all four evaluation sources. Scores marked with ~ are interpolated from available data across multiple evaluation sources.
A few things to notice. Anthropic's Claude family occupies all three top positions - this isn't a coincidence. Their Constitutional Classifiers++ system uses a two-stage architecture: a fast internal probe that screens all traffic by reading the model's own neural activations, followed by a full classifier for flagged queries. The result is a 0.05% false refusal rate at roughly 1% additional compute cost. That's a significant engineering achievement.
GPT-4o sits in a solid fourth place. Its HarmBench ASR of approximately 6% is competitive, though third-party testing from SPLX found that GPT-5's raw ASR initially hit 89% before OpenAI patched it down below 1% within two weeks. This matters because it uncovers a pattern: OpenAI's safety depends heavily on post-deployment patching rather than built-in robustness.
The open-weight gap is stark. DeepSeek V3 has a 54.67% attack success rate on HarmBench, and Llama 4 Maverick sits at 36.83%. Adding moderation layers helps - Mistral Large jumps from "Good" to "Very Good" on AILuminate when its output moderation is enabled. But the baseline models ship with weaker guardrails than their closed-source competitors.
Key Takeaways
Anthropic has a structural safety advantage. Constitutional Classifiers++ aren't just a filter bolted onto the model - they read internal activations, which means they catch adversarial intent that surface-level content filters miss. During Anthropic's public jailbreak challenge, 183 participants spent over 3,000 hours attacking the system. Not a single universal jailbreak was found.
Open-weight models need guardrails. Cisco's testing of eight open-weight LLMs found that multi-turn jailbreak attacks succeeded nearly 93% of the time. If you're self-hosting Llama or Mistral without an external safety layer, your model is almost certainly vulnerable to motivated attackers. MLCommons data confirms this: Llama 2 70B jumps from "Fair" to "Good" just by adding moderation.
The refusal-helpfulness tradeoff is real but improving. Earlier safety systems had false refusal rates around 0.38% - meaning roughly 1 in 260 legitimate queries got blocked. Anthropic's latest system cuts that to 0.05%, or about 1 in 2,000. That's close to invisible for most production workloads, and it's a signal that safety and usability are no longer in direct tension.
DeepSeek's safety varies by regulatory regime. Research tracking safety compliance found that DeepSeek's safety behavior ranges from about 50% compliance on general prompts to 90% on EU-specific regulatory scenarios. This inconsistency should concern anyone deploying DeepSeek in regulated environments.
The Jailbreak Arms Race
Safety benchmarks capture a snapshot, but the attack surface keeps shifting. A late 2025 paper co-authored by researchers from OpenAI, Anthropic, and Google DeepMind found that adaptive attacks - those that iteratively refine their approach based on prior failures - bypassed published model defenses with success rates above 90% for most systems tested. Twelve defenses that claimed near-zero attack success rates were all broken.
Current attack categories worth tracking:
Multi-turn attacks gradually steer a model toward harmful output through a series of seemingly innocent prompts. These are the most effective technique against open-weight models, with a 93% success rate in Cisco's testing.
Intent laundering reframes harmful requests in academic, fictional, or hypothetical contexts. Models that are trained to be helpful often fail to recognize the underlying intent when it's wrapped in enough conversational scaffolding.
Visual jailbreaks embed adversarial instructions in images sent to vision-enabled models. Gray Swan's Visual Vulnerabilities Challenge created 14,448 image-based jailbreaks across six VLMs in just one month.
Cross-generational transfer is a newer concern. Lumenova AI demonstrated that jailbreak techniques that work on older models often transfer to newer ones - all their jailbreaks against GPT-5 and Grok 4 were completed in under 30 minutes per model. Grok 4 was especially easy to break, while GPT-5 required more sophisticated techniques like pattern lock-in requests.
On the defense side, Anthropic's internal probe approach is the most promising direction I've seen. By reading activations rather than just scanning input and output text, it catches attacks that would slip past keyword or pattern-based filters. The compute overhead of roughly 1% makes it practical for production deployment.
Enterprise Safety Checklist
Benchmark scores tell you how a model performs in controlled testing. Production safety requires more. Before deploying any LLM in a regulated or high-stakes environment, assess these additional factors:
- Data handling and retention - Does the provider train on your inputs? Can you opt out? Where's data stored geographically? This matters for GDPR and the EU AI Act's data governance requirements.
- Audit logging - Can you retrieve a full log of inputs, outputs, and any safety filter activations? The EU AI Act requires this for high-risk systems by August 2026.
- Custom safety policies - Can you define your own refusal categories, or are you locked into the provider's defaults? Enterprise use cases often need domain-specific safety rules.
- Incident response - What happens when a jailbreak is discovered? OpenAI's pattern of patching GPT-5 within two weeks of launch is fast, but it means there's a window of vulnerability.
- Compliance certification - Is the provider working toward ISO 42001 (AI management) or SOC 2 Type II? These aren't safety benchmarks, but they signal organizational maturity around responsible deployment.
- Moderation layer compatibility - If using open-weight models, which third-party safety tools integrate cleanly? Mistral's jump from "Good" to "Very Good" on AILuminate shows that moderation layers work - but only if you actually deploy them.
Practical Guidance
For safety-critical deployments (healthcare, finance, legal, government): Claude Opus 4.5 or Claude 3.5 Sonnet v2 are the clear choices. The combination of low ASR, Constitutional Classifiers++, and Anthropic's organizational commitment to safety research makes them the most defensible option for regulated use cases. The cost premium over alternatives is real, but in a compliance context, the cheapest model is the one that doesn't cause an incident.
For general enterprise use where safety matters but isn't the primary differentiator: GPT-4o offers a good balance of capability and safety at competitive pricing. Its "Good" AILuminate grade and sub-6% HarmBench ASR are solid, and OpenAI's rapid patching cadence means vulnerabilities get addressed quickly.
For self-hosted deployments where data must stay on-premise: Llama 4 Maverick or Mistral Large with external moderation layers are your best options. Don't deploy them bare. The difference between a "Fair" and "Good" safety rating is just a moderation layer, and skipping that layer in a production environment is asking for trouble.
For cost-sensitive workloads where some safety tradeoff is acceptable: Be honest about the tradeoff. DeepSeek V3 is extraordinarily cheap and capable, but a 54.67% HarmBench ASR means it'll produce harmful content when pushed. If your use case is internal, low-risk, and you've got content filters downstream, it can work. For anything customer-facing, look elsewhere.
For a broader view of how these models compare on capability and price, see the Overall LLM Rankings. And if you're new to the concepts behind these evaluations, our AI Safety and Alignment guide covers the foundations.
These rankings will be updated as new models ship and new evaluation results come in. The safety talent movement between labs suggests the competitive dynamics here are just getting started.
Sources
- HarmBench: A Standardized Evaluation Framework - Center for AI Safety
- AILuminate v1.0 Safety Benchmark - MLCommons
- General Analysis Adversarial Robustness Leaderboard - General Analysis
- Constitutional Classifiers++ - Anthropic
- Gray Swan Arena - Gray Swan AI
- GPT-5 System Card - OpenAI
- Jailbreaking Frontier Models Across Generations - Lumenova AI
- Open-Weight AI Models Fail the Jailbreak Test - GovInfoSecurity (Cisco research)
- WMDP Benchmark - Center for AI Safety
- EU AI Act 2026 Compliance Guide - SecurePrivacy
✓ Last verified March 9, 2026
