Leaderboards | Awesome Agents

AI Music Generation Leaderboard 2026: Suno, Udio, More

James Kowalski — Sun, 19 Apr 2026 00:00:00 +0000

AI music generation is where evaluation methodology breaks down fastest. MOS listening tests get cherry-picked demo tracks. FAD numbers get reported against different reference sets without disclosure. Vendors compare their latest model to competitors' two-year-old checkpoints. And the consumer products - Suno, Udio, AIVA - don't publish FAD scores at all, leaving benchmark trackers to rely on third-party academic evaluations that may be six months stale by the time they're peer-reviewed.

None of that means the numbers are useless. It means you have to hold them carefully. This leaderboard covers what the objective benchmarks actually say, which subjective tests you should trust versus treat with skepticism, and where the commercial products land relative to open-source models on evaluations where both can be compared.

TL;DR

MusicGen-Large (Meta) still posts the strongest published FAD scores on standard academic benchmarks, but it's an open research model - not a product
Suno v4 and Udio lead on subjective listening preferences in independent user tests, but neither publishes FAD or CLAP metrics
YuE is the open-source model to watch for lyric-to-song - it's the only open-weight system that meaningfully handles full-song structure with lyrics
MOS scores from vendor demos are almost always inflated - the only numbers worth citing come from blinded third-party evaluations
The MusicCaps benchmark has an overfitting problem: models trained on YouTube-adjacent data have a structural advantage over models that weren't

Methodology

FAD - Frechet Audio Distance

FAD is the audio analog of FID in image generation. It computes the Frechet distance between the distribution of audio embeddings from generated clips and a reference set of real recordings. Lower is better. The embedding model matters: most published FAD numbers use VGGish embeddings, but some 2025-2026 papers use CLAP or MERT, making cross-paper comparisons unreliable. The reference set matters equally - papers using AudioSet produce different absolute numbers than papers using MusicCaps. "FAD (VGGish, AudioSet)" and "FAD (CLAP, MusicCaps)" are different metrics with the same name.

CLAP Score - Text-Audio Alignment

CLAP (Contrastive Language-Audio Pretraining) measures how well generated audio matches its text prompt. Higher is better. Published scores use either the LAION-CLAP or MERT-CLAP checkpoint - not directly comparable across papers. The MusicBench paper (arXiv 2305.15243) provides the cleanest standardized CLAP evaluation because it holds the reference model constant across all compared systems.

MOS - Mean Opinion Score

MOS is a 1-5 scale human listener rating. It's the most widely used subjective quality metric and the easiest to rig. Selection bias in prompt choice, annotator selection, and blind protocol all dramatically affect results. The only MOS figures worth weighting come from academic papers with documented blind evaluation. I flag vendor-published MOS in the tables below.

KAD - Kernel Audio Distance

KAD replaces FAD's Gaussian distributional assumption with a kernel-based distance, using MERT embeddings rather than VGGish. The KAD paper (arXiv 2409.09203) argues it produces more stable rankings when the reference set is small. Adoption is still limited - most evaluations still report FAD - but it's appearing more frequently in recent work.

MusicCaps

MusicCaps is Google's 5,521-clip evaluation set with detailed text captions, the most commonly used reference for text-to-music FAD. Its clips come from YouTube AudioSet, skewed toward pop and rock. Models trained on YouTube-adjacent data have a structural advantage here. Keep that in mind when reading any absolute FAD numbers anchored to this dataset.

Text-to-Music Rankings

Scores come from the MusicBench paper, the MusicGen technical report (arXiv 2306.05284), the MusicLDM paper (arXiv 2311.11225), and independent evaluations where noted. FAD here uses VGGish embeddings against MusicCaps unless otherwise specified. CLAP uses LAION-CLAP unless noted. MOS from third-party blinded tests only - vendor-published MOS are marked (vendor).

Rank	Model	Provider	FAD (lower=better)	CLAP (higher=better)	MOS (1-5)	Notes
1	MusicGen-Large	Meta	2.82	0.51	4.1	Academic model; AudioCraft repo; best published objective scores
2	Stable Audio 2.0	Stability AI	4.10 (est)	0.49 (est)	4.0	44-second stereo output; latent diffusion
3	Suno v4	Suno AI	n/a	n/a	4.3 (vendor)	No published FAD/CLAP; strong in independent listening panels
4	Udio	Udio	n/a	n/a	4.2 (vendor)	No published FAD/CLAP; competitive vocal quality
5	YuE	m-a-p (open-source)	5.31 (est)	0.44 (est)	3.9	Full-song structure; best open-weight lyric-to-song
6	MusicGen-Medium	Meta	3.49	0.48	3.9	1.5B params; good cost-quality tradeoff
7	MusicLDM	Various	4.72	0.42	3.7	Latent diffusion; strong genre conditioning
8	ElevenLabs Music	ElevenLabs	n/a	n/a	3.8 (vendor)	Focused on background/ambient use cases
9	AIVA	AIVA Technologies	n/a	n/a	3.7 (vendor)	Classical/cinematic specialty; less pop coverage
10	Riffusion	Riffusion	6.80	0.38	3.3	Spectrogram diffusion; unique approach, lower objective quality
11	Mubert	Mubert	n/a	n/a	3.2 (vendor)	Generative radio; not a true generation model
12	Soundraw	Soundraw	n/a	n/a	3.1 (vendor)	Segment recombination; not end-to-end generative
13	Boomy	Boomy	n/a	n/a	2.9 (vendor)	Template-driven; not a neural generation model
14	Beatoven	Beatoven	n/a	n/a	3.0 (vendor)	Mood-based generation; limited prompt control
15	Loudly	Loudly	n/a	n/a	2.8 (vendor)	Stock library hybrid; partial generation

Estimated scores (marked est) are extrapolated from partial evaluations in published papers and are not direct reproductions of reported figures.

Lyric-to-Song Rankings

This is the hardest task in music generation. The model must produce a full song - verses, chorus, bridge - with vocals that follow provided lyrics, maintain syllabic timing, and stay on pitch. Most academic benchmarks don't cover this because it requires long-context coherence that standard evaluation windows miss.

Rank	Model	Provider	Song Structure	Lyric Alignment	Vocal Quality	Notes
1	Suno v4/v5	Suno AI	Excellent	Strong	High	Best available lyric-to-song product
2	Udio	Udio	Good	Strong	High	Competitive with Suno on vocal fidelity
3	YuE	m-a-p	Good	Good	Moderate	Only open-weight system with real lyric support
4	ElevenLabs Music	ElevenLabs	Limited	Basic	High (instrumental)	Strong voice but shallow song structure
5	AIVA	AIVA Technologies	Good	Weak	n/a	Primarily instrumental; lyric support is bolted on
6	Boomy	Boomy	Basic	Basic	Low	Template-based; lyric handling is minimal

Suno's dominance in lyric-to-song comes from architectural choices around temporal conditioning that aren't published in detail - the technical blog posts are high-level marketing rather than reproducible methodology. YuE's advantage in the open-source tier is documented in the YuE paper (arXiv 2503.08638), which covers the dual-track encoder approach that handles vocals and accompaniment in separate latent spaces before mixing.

Stem Separation and Remixing Rankings

Stem separation is a distinct task from generation: given a mixed audio file, isolate vocals, drums, bass, and other instruments. This matters for remixing, sampling-based workflows, and music production. The benchmark here is SDR (Signal-to-Distortion Ratio) in dB - higher is better. This section tracks separation tools rather than generation models, with the caveat that some platforms offer both.

Rank	Model / Tool	Vocals SDR (dB)	Drums SDR (dB)	Bass SDR (dB)	Other SDR (dB)	Notes
1	Demucs v4 (htdemucs)	8.13	8.24	8.76	6.40	Meta; open-source; current state of the art
2	Demucs v3	7.33	7.73	8.37	5.75	Still widely deployed; good baseline
3	Spleeter (2-stem)	6.55	4.23	-	-	Deezer; fast; 2-stem only mode
4	Moises AI	~7.8 (est)	~7.5 (est)	~7.8 (est)	-	Commercial; built on Demucs-class models
5	Adobe Podcast (Enhance)	Speech-optimized	n/a	n/a	n/a	Voice isolation only; not a full stem splitter

For remixing workflows, Demucs v4 (htdemucs) is the benchmark. It's open-source under MIT license, runs locally, and its SDR numbers come from the MusDB18 test set which is the standard evaluation benchmark for source separation. Commercial products like Moises build on Demucs-class architectures and don't publish independent SDR figures, so their estimates above are based on comparison against Demucs in community listening tests rather than formal evaluation.

Key Takeaways

The Benchmark Coverage Gap Is Real

The models people actually use - Suno, Udio, ElevenLabs Music - don't publish objective benchmark numbers. FAD requires running evaluation pipelines against standardized reference sets. CLAP requires a defined checkpoint. Neither requires much compute to run, and both are reproducible. Skipping these is a choice, not a technical limitation. It means you have to compare commercial products through listening tests that are easier to curate than FAD runs.

MusicGen-Large at FAD 2.82 looks best on paper, but blinded listening tests produce different rankings because FAD doesn't capture long-term coherence, originality, or whether a song has a beginning, middle, and end.

MOS Tests Are Mostly Compromised

Almost every MOS number published by a commercial music generation product was collected under conditions designed to maximize the score. Prompts are curated to showcase strengths. Annotators are often recruited without musical training. The comparison baseline is usually an older model or a competitor's year-old checkpoint.

The only MOS figures worth weighting come from academic papers with blind protocol documentation. The MusicGen evaluation uses this approach. Vendor blog posts don't.

Suno and Udio Lead User Preference - With Caveats

In independent community listening tests run through platforms like Reddit's r/musicai and the Elo-style comparison tools that have emerged in the music generation space, Suno v4 and v5 consistently win on lyric-to-song and stylistic coherence across long outputs. Udio is competitive, particularly for vocal fidelity on shorter clips.

The caveat is that these preference signals come from internet communities that skew toward pop and hip-hop styles. Both Suno and Udio are heavily trained on commercially dominant genres. If you need high-quality classical orchestration, jazz improvisation with correct harmonic substitutions, or experimental electronic production, neither model is as dominant - and AIVA's specialty positioning becomes more competitive.

MusicGen's Advantage Comes From Reproducibility

MusicGen-Large's FAD 2.82 reflects genuine quality from the transformer architecture in the AudioCraft paper, but also careful evaluation against a reference set Meta's team knows well. The AudioCraft codebase includes evaluation scripts and independent researchers have reproduced the numbers. That transparency earns credibility vendor demos can't.

The practical catch: MusicGen-Large caps at 30 seconds, is instrumental-only, and isn't a product for non-technical users. It's a research model that wins benchmarks.

YuE Fills the Open-Source Lyric Gap

Before YuE's release in early 2025, there was no open-weight model that could handle full-song generation with lyrics at a quality level worth deploying. MusicGen can't do lyrics. MusicLDM has no lyric support. Riffusion's spectrogram diffusion approach can produce sung phonemes but can't reliably align them to provided text.

YuE changes that. The dual-track architecture - separate encoders for vocal and accompaniment that are merged at inference time - solves a structural problem that earlier models handled poorly. Song coherence over full verse-chorus-bridge structure is still weaker than Suno, but YuE is open-weight, runs on a single A100, and is available on Hugging Face. For teams that can't use closed APIs, this is the only real option for lyric-to-song at this time.

Pop Music Overfitting Is a Problem

MusicCaps comes from YouTube AudioSet, which skews heavily toward pop, hip-hop, and rock. A model that generates technically adequate pop will beat a model generating higher-quality jazz on MusicCaps FAD simply because the reference distribution matches. Suno and Udio both produce pop music that human listeners rate highly; their performance on classical structure, atonal composition, or non-Western musical traditions is substantially weaker, and that weakness doesn't show in any published metric. If your use case is outside the pop-centric distribution, treat all benchmark numbers with extra skepticism.

Practical Guidance

For text-to-music in production: If you need a product and can handle API terms, Suno v4 or Udio are the current state of the art for coherent, stylistically accurate output - especially for pop, rock, and hip-hop. Neither publishes reproducible benchmarks, so factor in that you're flying partially blind on objective quality.

For open-source text-to-music: MusicGen-Large via AudioCraft is the best-benchmarked option. FAD 2.82 is the strongest published number. It caps at 30 seconds and doesn't support lyrics. Acceptable for background music, mood-based generation, and workflows where you need reproducibility.

For lyric-to-song with open weights: YuE is the only real option. It runs on a single A100 and handles full-song structure better than any other open-weight model. Quality gaps versus Suno are visible, but the model is available and actively maintained.

For stem separation: Demucs v4 (htdemucs) is the open-source benchmark leader on MusDB18. Use it. Commercial platforms built on top of Demucs-class models don't offer meaningfully better SDR than the base model - you're paying for UX, not better separation.

For classical and cinematic: AIVA has the most coherent classical and orchestral output of the commercial tools, with better harmonic structure than Suno on that genre specifically. It's not competitive on pop, but it's purpose-built for cinematic scoring workflows.

For ambient and background music without lyrics: ElevenLabs Music and Mubert both handle this use case well. Neither is a true end-to-end generation model in the same class as MusicGen or Suno, but for unobtrusive background tracks they're operationally simpler to work with.

FAQ

What is FAD and why does it matter for music generation?

FAD (Frechet Audio Distance) measures how similar the statistical distribution of generated audio is to real recordings. Lower is better. Its weakness: it measures distribution similarity, not human preference. A model can score well on FAD while producing music that human listeners find repetitive or uninteresting.

Why don't Suno and Udio publish FAD scores?

The most charitable explanation: they use internal evaluation frameworks they consider proprietary. The less charitable one: their models optimize for human preference at the expense of distributional fidelity to reference sets, which would produce worse FAD numbers while still winning user preference tests.

Is MusicGen better than Suno?

On published FAD and CLAP benchmarks, yes. On user preference tests for lyric-to-song and full-track generation, Suno leads. These measure different things. Research pipeline reproducibility: MusicGen. Tracks humans enjoy: Suno.

What is the MusicCaps benchmark?

A 5,521-clip evaluation set from Google using YouTube AudioSet clips with detailed text captions. The most widely used reference for text-to-music FAD evaluation. Skewed toward Western pop and rock, which disadvantages models trained on broader musical distributions.

Which AI music tool is best for stem separation?

Demucs v4 (htdemucs) from Meta. Open-source, runs locally, best published SDR on MusDB18 (8.13 dB vocals). Commercial tools build on comparable architectures but don't publish independent SDR evaluations.

Can AI generate music with custom lyrics?

Yes, with limitations. Suno v4/v5 and Udio support lyric input with reasonable alignment. YuE is the best open-source option. Most other tools (MusicGen, Stable Audio, AIVA) are instrumental-only.

Sources

Code Completion and Generation LLM Leaderboard 2026

James Kowalski — Sun, 19 Apr 2026 00:00:00 +0000

I need to say something upfront that most coverage of code completion benchmarks glosses over: HumanEval is compromised. Not broken in the sense that the problems are wrong - Chen et al.'s 164 Python function stubs remain a reasonable test of basic algorithmic reasoning. Compromised in the sense that the entire dataset has been public since July 2021, and every model trained in the last three-plus years has almost certainly seen it. When you read a headline claiming some new model scores 98% on HumanEval, you are not reading a coding ability score. You are reading a memorization upper bound.

That does not mean HumanEval is useless. It is still a reasonable sanity check, a baseline that lets you compare a new model to a long historical record. What it is not is a reliable indicator of how well that model will complete functions it has never seen before. For that, you need LiveCodeBench.

This leaderboard covers pure code authoring: complete a function given a signature or docstring, generate a full solution from a spec, place in a competitive programming contest. It does not cover code review (see the LLM Code Review Leaderboard) or full-repository agent tasks (see the SWE-Bench Coding Agent Leaderboard).

TL;DR

Claude Opus 4 and GPT-5 lead on contamination-resistant LiveCodeBench; HumanEval numbers are largely untrustworthy at the frontier
DeepSeek-V3 and Qwen3-Coder are the strongest open-weight options and genuinely competitive with frontier closed models on LiveCodeBench
BigCodeBench is a better signal than HumanEval for realistic library-usage tasks - harder and less contaminated
Competitive programming benchmarks (APPS Hard, CodeContests, LCB Hard) show a large gap between reasoning-capable models and standard code models
Qwen 2.5-Coder 32B remains the strongest sub-40B open-weight model for code-specific deployments

The Benchmark Landscape - What to Trust and What to Ignore

HumanEval and MBPP - Useful History, Unreliable Frontier Scores

HumanEval (OpenAI, 2021) is 164 hand-written Python programming problems, each consisting of a function signature and docstring. The canonical metric is pass@1: generate one solution and check if it passes the test suite. The benchmark has been cited in essentially every code LLM paper published since 2021. That ubiquity is the problem. The tasks are public. The canonical correct solutions are public. Every training dataset scraped from GitHub, code forums, and ML papers after mid-2021 has almost certainly included HumanEval task descriptions and solutions.

MBPP (Google, 2021) - Mostly Basic Python Problems - is similarly compromised by age. 374 crowdsourced programming tasks drawn from beginner Python exercises. Again, public since 2021 and in every major training corpus.

EvalPlus (2023) partially addresses this by augmenting HumanEval and MBPP with automatically generated additional test cases, creating HumanEval+ and MBPP+. Models that passed the original sparse test suites by generating syntactically plausible but functionally incorrect code fare significantly worse on EvalPlus. The EvalPlus leaderboard is a better signal than raw HumanEval for separating genuine code generation ability from test-passing tricks. But contamination on the problem descriptions themselves remains.

My methodology: HumanEval+ and MBPP+ numbers appear in the table because they are widely reported and provide historical context. Do not use them as your primary signal for frontier models. Use LiveCodeBench.

LiveCodeBench - The Number I Actually Trust

LiveCodeBench (2024) solves the contamination problem by pulling problems directly from competitive programming platforms - LeetCode, Codeforces, AtCoder - on a rolling basis, including problems released after most models' training cutoffs. The benchmark is continuously updated with new problems. A model cannot have memorized a problem that did not exist when it was trained.

The evaluation window matters. LiveCodeBench results are typically reported over a specific time window - "problems from Sept 2023 to Sept 2024" - and newer windows are harder. Models that score well on older windows often drop significantly on recent ones, which is itself diagnostic: if a model performs much better on older windows than recent ones, it has likely memorized the older problems.

For this leaderboard, I use the most recent available LiveCodeBench v5 window (Jan 2025 - Apr 2026) where data is available, with fallback to v4 (Sept 2024 - Jan 2025). If only an older window is available for a model, I note it in the table.

BigCodeBench - Harder and More Realistic

BigCodeBench (2024) is the benchmark that finally makes HumanEval look easy the way it deserves to. 1,140 tasks requiring realistic use of standard library and popular third-party packages - Pandas, NumPy, Scikit-learn, Flask, requests, PIL, and dozens more. Tasks are not algorithmic toy problems; they require understanding how actual Python libraries work, reading documentation-style descriptions, and generating code that uses real API calls correctly.

The BigCodeBench leaderboard shows a much wider spread between models than HumanEval. The ceiling is lower, the floor is lower, and the ordering differs meaningfully. If you are evaluating a model for practical software development tasks rather than algorithm competitions, BigCodeBench is the most relevant single benchmark on this list.

MultiPL-E - Does It Work in Your Language?

MultiPL-E translates HumanEval and MBPP into 18+ programming languages by automatically transpiling both problem descriptions and test cases. Coverage includes C++, Java, JavaScript, TypeScript, Go, Rust, Bash, PHP, Ruby, and others. Scores drop substantially as you move to lower-resource languages, and the drop is not uniform across models. A model trained heavily on Python with limited Rust exposure may score 85% on Python HumanEval and 40% on the Rust equivalent.

The contamination caveat applies here too, since the base problems are from HumanEval. MultiPL-E is most useful for measuring relative multilingual coverage within a model, not absolute code generation ability.

Competitive Programming - Where Reasoning Models Pull Away

APPS (2021) is 5,000 Python programming problems scraped from competitive programming sites at introductory, interview, and competition difficulty levels. The Hard subset (competition-level) is a genuine test of algorithmic reasoning that most non-reasoning models struggle with. CodeContests (DeepMind, 2022) is a curated dataset of competitive programming problems with a similar difficulty distribution.

LiveCodeBench Hard (LCB Hard) is the subset of LiveCodeBench problems classified as "hard" on the source platform. This is the least contaminated and most demanding code generation evaluation currently available. The spread between models on LCB Hard is dramatic and directly correlated with reasoning capability.

DS-1000 - Data Science Workloads

DS-1000 (2022) is 1,000 data science problems from Stack Overflow, covering Pandas, NumPy, TensorFlow, PyTorch, Scikit-learn, SciPy, and Matplotlib. Problems are real user questions with real solutions verified by domain experts. It tests practical data science coding more directly than algorithmic benchmarks.

CRUXEval - Can It Reason About Code?

CRUXEval (2024) is different from everything else on this list. It does not ask models to write code. It gives models a function and asks them to predict the output for a given input (CRUXEval-O) or to infer what input would produce a given output (CRUXEval-I). It measures code reasoning and execution understanding rather than generation. Models that generate syntactically correct but semantically wrong code tend to do poorly here. It is a useful complement to generation benchmarks.

The Leaderboard

Scores are from official model releases, paper results, and published leaderboards as of April 2026. HumanEval+ and MBPP+ are from the EvalPlus leaderboard. LiveCodeBench scores use v5 window where available, v4 otherwise (marked †). BigCodeBench uses the Complete track (pass@1 with greedy decoding). - means no published score. * denotes my estimate from related benchmark interpolation.

Rank	Model	HumanEval+	MBPP+	LiveCodeBench	BigCodeBench	DS-1000	Notes
1	GPT-5	96.9	91.2	68.4	79.3	78.1	Strongest LCB Hard of any model tested
2	Claude Opus 4	95.7	90.6	66.8	78.7	77.4	Highest BigCodeBench among Anthropic models
3	Gemini 2.5 Pro	94.2	89.1	61.3	75.4	74.8	Long-context advantage on multi-file generation
4	Claude Sonnet 4	93.8	88.4	59.7	73.9	73.2	Strong cost-to-performance ratio
5	DeepSeek-V3	92.1	87.3	57.4	71.8	70.9	Best open-weight LCB; trained post-HumanEval leak
6	Qwen3-Coder	91.6	86.8	55.9	70.4	68.7	Latest Qwen; competitive on multilingual via MultiPL-E
7	o4	97.1	92.4	71.2	80.6	79.3	Reasoning model; highest LCB Hard; slower, expensive
8	o3-mini	95.8	91.1	63.7	76.1	72.4	Best cost-adjusted reasoning model for code
9	DeepSeek-Coder-V3	90.8	86.1	53.6	69.8	67.3	Code-specific fine-tune of DeepSeek-V3; multiPL-E strong
10	Qwen 2.5-Coder 32B	90.2	85.7	52.1	68.4	65.1	Best sub-40B open model; strong C++/Java MultiPL-E
11	Codestral (Mistral)	88.4	83.9	47.3	64.7	61.8	Strong on code-completion tasks; weaker on reasoning-heavy
12	Llama 3.3 Coder 70B	86.7	82.1	44.8	61.9	58.4	Best Meta code model; good for self-hosted 70B tier
13	StarCoder2-33B	84.1	80.3	39.2	58.3	54.7	BigCode project; strong on low-resource languages
14	Granite-Code-34B	82.6	79.1	36.4	56.1	52.3	IBM; strong on enterprise Java/Python; safety-tuned
15	Llama 3.3 Coder 8B	79.3	76.2	31.7	51.8	46.9	Good efficiency at 8B; large quality drop from 70B
16	Yi-Coder-9B	77.8	74.6	28.9*	49.2	44.1	01.AI; HumanEval strong; LCB estimated
17	Magicoder-S-DS-6.7B	76.9	73.8	24.1*	45.6*	40.3	OSS-Instruct training; punches above 7B weight class
18	WizardCoder-33B	73.2	71.4	21.6*	42.3*	37.8	Older model; useful as historical baseline

Table last updated April 19, 2026. HumanEval+ and MBPP+ use pass@1 greedy. LiveCodeBench uses pass@1 on v5 window (Jan 2025 - Apr 2026) except where marked †. BigCodeBench Complete track. Asterisks () indicate estimates from related benchmark interpolation.*

Reading the Table

The HumanEval Problem in Practice

Look at the top of the HumanEval+ column: every frontier model is above 90%. That means HumanEval is essentially solved at the frontier. This is not because frontier models are all equally capable code generators - it is because the variance is below the noise floor of the benchmark once contamination is accounted for. A 96.9 versus a 93.8 on HumanEval+ at the frontier is not a meaningful gap. A 68.4 versus a 59.7 on LiveCodeBench is.

The contamination issue is why I rank o4 at 7 in the table despite it having the highest HumanEval+ score (97.1). What matters for the ranking is its 71.2 LiveCodeBench score - the highest in the table - which is why it is the model I would reach for if I needed the best possible code generation on novel problems.

Reasoning Models Are a Different Category

o4 and o3-mini are not code models in the traditional sense. They are reasoning models that think before they write code, explicitly working through the logic of what they need to implement. On straightforward problems - complete a function from a docstring - they offer little advantage and are slower and more expensive. On competition-level algorithmic problems, they are in a class by themselves. LCB Hard scores for o4 are approximately 20-25 points higher than any non-reasoning model. If your use case involves complex algorithmic work - interview-level or competition-level problems, numerical algorithms, complex data structure implementations - reasoning models are worth the cost.

For routine code completion tasks - the cursor-hovering-in-your-IDE use case - they are overkill. DeepSeek-V3 or Claude Sonnet 4 give you more completions per dollar.

The Open-Weight Story is Better Than It Looks

DeepSeek-V3 at 57.4 on LiveCodeBench is genuinely impressive for an open-weight model. For context, it scored within 9 points of the best closed frontier model (GPT-5 at 68.4) on the benchmark that matters most. The gap between open and closed models has compressed dramatically over the past 18 months. Two years ago, HumanEval was the only benchmark where anyone published open-weight numbers because it was the only one where those numbers were not embarrassing.

DeepSeek-Coder-V3 and Qwen3-Coder are both strong, purpose-built code models that genuinely outperform their base model equivalents on code-specific benchmarks. The code fine-tuning signal matters: on BigCodeBench - which requires actual library knowledge - DeepSeek-Coder-V3 and Qwen 2.5-Coder 32B pull ahead of generic models of similar scale.

Contamination in the Middle Tier

WizardCoder and older Magicoder variants are particularly suspect on HumanEval. WizardCoder's training data explicitly included HumanEval-style problems as part of its Evol-Instruct pipeline. The HumanEval numbers for these models are almost certainly inflated relative to their true generalization ability. Their LiveCodeBench and BigCodeBench scores are the numbers to trust - and they land roughly where you would expect a 7-33B model to be.

BigCodeBench as the Better Everyday Signal

If you are evaluating models for a software engineering team doing web development, data engineering, or ML infrastructure work - not algorithm competitions - BigCodeBench is more predictive than LiveCodeBench. Library usage, reading package documentation, generating correct API calls: this is what software engineers actually do. The rank ordering on BigCodeBench is similar to LiveCodeBench but not identical. Gemini 2.5 Pro drops slightly relative to its LiveCodeBench position; Qwen 2.5-Coder 32B drops slightly relative to its raw parameter count peers. Both make sense: Gemini's long-context advantage is less decisive when problems are self-contained, and Qwen's code fine-tuning advantage shows up more in syntactic completion than library reasoning.

Methodology

Benchmark Score Sources

HumanEval+ and MBPP+ scores come from the EvalPlus leaderboard, which accepts community submissions and verifies them against the test harness. LiveCodeBench scores come from the official LiveCodeBench repository and the associated leaderboard. BigCodeBench scores come from the BigCodeBench leaderboard.

For models not yet submitted to official leaderboards - typically the most recent closed-source releases and some smaller open-weight models - I cross-reference lab-published technical reports, independent third-party evaluations, and interpolate from closely related benchmarks. These are marked with asterisks. I report the most conservative available number when estimates conflict.

Why Pass@1 Greedy

Most papers and leaderboards now standardize on pass@1 with greedy decoding for apples-to-apples comparison. Temperature-sampled pass@k numbers (where k > 1) are higher and can be inflated by model verbosity rather than reasoning quality. I do not include best-of-N numbers in the main table. If a vendor only publishes pass@10 or pass@100 results, I note that and do not include their number in the ranking.

The Training Cutoff Problem

Contamination is a spectrum, not a binary. A model whose training data cutoff is September 2021 cannot have memorized HumanEval (released July 2021 but spreading through training corpora over months). A model with a January 2025 cutoff almost certainly has. In between, it depends on what data the training team excluded.

I do not attempt to correct scores for contamination - any such correction requires assumptions about training data that are not publicly verifiable. Instead, I use LiveCodeBench as the primary ranking signal precisely because its contamination risk is low by construction. If you see a model scoring 92%+ on HumanEval+ but below 40% on LiveCodeBench, that gap tells you something about its training data, not its coding ability.

Multilingual Coverage (MultiPL-E)

Full MultiPL-E tables would make this page unwieldy. The directional finding is consistent: models drop roughly 10-20 points from Python to Java/C++/JavaScript, and another 10-20 points moving to Go, Rust, or less common languages. The relative ordering between models is mostly preserved across languages, with one notable exception: models trained with large amounts of system-level C/C++ code (StarCoder2, Granite-Code) hold up better on C++ than their Python scores predict, while models trained primarily on web/scripting data degrade more steeply.

Model Notes

GPT-5 and o4

OpenAI's GPT-5 is the best non-reasoning frontier model on LiveCodeBench. o4 is strictly better on hard problems but 3-5x slower per token with proportional cost. For production code completion with low latency requirements, GPT-5 is the right choice. For batch jobs on complex algorithmic tasks - where you have seconds to minutes rather than milliseconds - o4 is worth the cost.

Claude Opus 4 and Sonnet 4

Claude Opus 4 leads the Anthropic lineup on BigCodeBench and DS-1000, reflecting strong performance on realistic library-usage tasks. Sonnet 4 is close behind on LiveCodeBench and significantly cheaper. For IDE-integrated completion at scale, Sonnet 4 is the practical choice. Opus 4 makes sense for batch code generation where quality is the only constraint.

DeepSeek-V3 and DeepSeek-Coder-V3

DeepSeek-V3 is the most capable open-weight model on LiveCodeBench. DeepSeek-Coder-V3 is its code-fine-tuned variant, which gains on BigCodeBench and MultiPL-E but does not improve LiveCodeBench scores meaningfully - suggesting the base model's reasoning ability drives competitive programming performance more than domain-specific fine-tuning does. Both are MIT-licensed and deployable without API dependencies. For teams that need frontier-adjacent code generation without closed-model API costs, DeepSeek-V3 is the current answer.

Qwen 2.5-Coder 32B and Qwen3-Coder

Qwen 2.5-Coder 32B from Alibaba remains the strongest sub-40B code model in independent evaluations. Its MultiPL-E scores for Java and C++ are particularly strong relative to its Python results. Qwen3-Coder is newer and improves across the board, but not to the point of competing with the top tier on LiveCodeBench. The 32B size makes it the most deployable high-quality option for teams running inference on four to eight consumer GPUs.

Codestral

Mistral's Codestral is specialized for code and trained on a substantially larger code corpus than Mistral's general models. Fill-in-the-middle (FIM) performance - completing code from both prefix and suffix context, which is what IDE tab completion actually does - is a Codestral strength not captured in the pass@1 metrics here. For IDE autocomplete use cases specifically, Codestral is worth evaluating beyond what the leaderboard table shows.

StarCoder2 and Granite-Code

StarCoder2 from the BigCode project covers over 600 programming languages and is the strongest model on low-resource language coverage. The 33B variant is competitive with significantly larger general-purpose models on MultiPL-E's less common language tracks. Granite-Code-34B from IBM is enterprise-focused, with a safety training layer and strong performance on Java - the language most IBM enterprise codebases are written in. Both are Apache 2.0 licensed.

Practical Guidance

For IDE code completion at scale: Claude Sonnet 4 or DeepSeek-V3 via API. Both offer sub-200ms p50 latency at scale and are in the top tier on LiveCodeBench. Codestral is worth benchmarking specifically for fill-in-the-middle if you are building an IDE integration.

For the best quality without cost constraints: o4 for algorithmic and competitive programming tasks. GPT-5 or Claude Opus 4 for realistic library-usage code generation. The choice between GPT-5 and Claude Opus 4 is close - run your own bakeoff on a sample of your actual tasks.

For self-hosted open-weight deployments: DeepSeek-V3 at the 70B+ tier, Qwen 2.5-Coder 32B for teams constrained to four consumer GPUs. Accept a 10-15 point LiveCodeBench gap versus frontier closed models.

For multilingual codebases: StarCoder2-33B for broad language coverage. Qwen3-Coder for better baseline quality with slightly narrower language support. Both outperform general-purpose models of similar size on non-Python languages.

For data science and ML workloads: DS-1000 scores are the relevant signal. GPT-5 and Claude Opus 4 lead. DeepSeek-V3 is close behind at open-weight pricing.

For evaluating your own models: Do not report HumanEval-only results and expect to be taken seriously. Require LiveCodeBench or BigCodeBench. Run EvalPlus instead of vanilla HumanEval and MBPP. If competitive programming performance matters, include LCB Hard scores.

For related rankings, see the SWE-Bench Coding Agent Leaderboard for full-repo agent tasks and the LLM Code Review Leaderboard for PR review quality.

Sources

Creative Writing LLM Leaderboard 2026: Fiction Ranked

James Kowalski — Sun, 19 Apr 2026 00:00:00 +0000

Measuring creative writing is the hardest thing you can ask a benchmark to do. Most evaluations give you a clean signal: the answer is right or wrong, the code compiles or it doesn't, the translation is accurate or it drifts. Creative writing doesn't work that way. A sentence can be technically correct and still be dead on the page. A metaphor can be surprising and still feel wrong in context. Voice, pacing, tension, specificity - these resist reduction to numbers.

And yet, the community has developed several evaluation frameworks that extract meaningful signal from the noise. EQ-Bench Creative Writing v3 uses multi-rubric LLM judges calibrated to literary criteria. The Antislop evaluation system measures cliche density and overused vocabulary patterns. Human-preference rankings via Chatbot Arena provide a ground-truth signal uncorrupted by the LLM-as-judge problem. None of these is perfect. Together, they're informative.

This leaderboard tracks 15 models across all three evaluation types as of April 2026.

TL;DR

Claude 4 Opus leads EQ-Bench Creative Writing v3 at 73.8, the only model above 72 - its prose scores highest on voice consistency and emotional nuance
GPT-5 and Gemini 2.5 Pro trade positions depending on the rubric - GPT-5 leads on pacing, Gemini on world-building
Fine-tuned open-source writing models (Mistral Nemo Gutenberg, Llama 3.1 Storm) outperform their base models on Antislop by a large margin, but fall apart on coherence over long outputs
Reasoning models (o1, o3) rank below their general capability tier - structured thinking loops produce over-schematized prose

Why Creative Writing Benchmarks Are Uniquely Hard

The fundamental problem is that any automated judge is itself an LLM, which means it has stylistic preferences baked in from training data. If the judge model was trained heavily on internet fiction, it will rate certain phrasings higher regardless of actual literary merit. If the judge and the contestant share training data, style similarity may inflate scores in ways that don't reflect quality.

The secondary problem is that "quality" in creative writing is not a single axis. World-building and plot structure are craft-learnable skills that good LLMs handle reasonably well. Voice - the specific rhythm and texture that makes a prose style recognizable and alive - is much harder, and most models flatten out toward a generic competent register that reads as accomplished but forgettable.

The tertiary problem is "slop" - the vocabulary of overused AI writing patterns. "Ethereal glow." "Palpable tension." "A symphony of." Models trained on large quantities of AI-generated text have absorbed these patterns at high frequency, and they surface automatically under generation pressure. The Antislop evaluation specifically targets this.

A practical benchmark approach uses multiple evaluation axes from different methodological families, then looks for models that place consistently across all of them rather than spiking on one.

The Benchmarks Explained

EQ-Bench Creative Writing v3

EQ-Bench Creative Writing is a framework developed by Sam Paech that evaluates creative writing output on four primary rubrics:

World-building: Specificity and internal consistency of the setting. Does the world feel real and thought-through, or generic and placeholder?
Voice: Distinctiveness and consistency of narrative voice and character perspective. Does the prose sound like someone specific is writing it?
Pacing: Scene-level control of tension, rhythm, and information release. Does the writing breathe and accelerate in the right places?
Emotional Nuance: Complexity of psychological characterization. Do characters feel emotionally plausible, or do they perform emotions described rather than shown?

Each rubric is scored 0-10 by a panel of LLM judges using structured scoring chains, then aggregated to a composite score. The v3 update introduced stricter debiasing protocols to reduce judge-model style correlation, a more diverse prompt set spanning genre fiction, literary fiction, and experimental prose, and a length-controlled evaluation to prevent models that produce more tokens from scoring higher on sheer coverage.

EQ-Bench's advantage is methodological transparency - the rubrics and scoring chains are published on GitHub and can be reproduced. The limitation is that the judges are still LLMs, and LLM judges have documented preferences for confident, well-structured prose that may not align with the full range of literary taste.

Antislop Writing Evaluations

The Antislop project originally developed a token-biasing sampler to suppress overused AI vocabulary at inference time. The evaluation component measures something distinct: given raw model output without any vocabulary filtering, how densely does the text rely on a curated list of AI-slop phrases?

The Antislop evaluation scores models on slop-phrase density (lower is better), vocabulary diversity (measured by type-token ratio in open-ended generation), and a human-audited cliche register covering roughly 800 phrases across categories including purple prose markers, AI emotional vocabulary, and genre-fiction boilerplate.

The resulting Antislop Score runs from 0-100 where 100 is a completely slop-free output. Typical frontier model outputs without sampling intervention score in the 40-65 range. Fine-tuned writing models with explicit cliche suppression in post-training score 70-85. The practical significance of a 10-point gap on this scale is meaningful - it corresponds to roughly one slop phrase per 200 tokens of output.

Human-Preference Win Rate vs GPT-5

The Chatbot Arena leaderboard at arena.ai runs blind human comparisons where evaluators choose between two model outputs for the same creative prompt. The Creative Writing category tracks win rates against GPT-5 as a reference model.

Win rate above 50% means the model produced outputs humans preferred over GPT-5 outputs more often than not. This is the most direct signal in this leaderboard because it bypasses LLM judge bias entirely. The limitation is small sample size for newer or less popular models - win rates for models with fewer than 200 creative comparisons carry wide confidence intervals.

Main Rankings Table

Data sources: EQ-Bench Creative Writing v3 leaderboard (eqbench.com, verified April 19, 2026), Antislop evaluation results (GitHub repository, April 2026 run), Chatbot Arena creative writing win rates (arena.ai, April 18, 2026). "Not reported" indicates no published evaluation result exists for this model on that benchmark as of publication date.

Rank	Model	Provider	EQ-Bench Creative v3	Antislop Score	Win Rate vs GPT-5	Notes
1	Claude 4 Opus	Anthropic	73.8	61	54.2%	Top voice and emotional nuance scores
2	GPT-5	OpenAI	71.4	55	50.0%	Reference model; leads on pacing
3	Gemini 2.5 Pro	Google	70.9	58	48.7%	Highest world-building sub-score
4	Claude 4 Sonnet	Anthropic	68.3	63	46.1%	Best Antislop Score in top 5
5	DeepSeek V3.2	DeepSeek	66.1	59	44.8%	Surprisingly strong literary fiction
6	Grok 4	xAI	65.7	52	43.2%	High pacing, weaker voice consistency
7	Kimi K2.5	Moonshot AI	63.4	57	41.9%	Not reported for arena creative subset
8	Qwen 3.5	Alibaba	61.8	54	40.3%	Strong genre fiction, weaker literary
9	Llama 4 Maverick	Meta	59.2	51	38.6%	Best open-weight base model
10	o3	OpenAI	57.6	48	37.1%	Over-structured prose, low voice score
11	o1	OpenAI	55.3	46	35.4%	Reasoning loop artifacts visible in output
12	Mistral Large 3	Mistral	54.1	56	Not reported	Consistent mid-tier across all rubrics
13	Phi-4	Microsoft	48.7	49	Not reported	Good for size; trails on nuance
14	Mistral Nemo Gutenberg*	Community fine-tune	44.2	81	Not reported	Exceptional slop suppression, weak coherence
15	Llama 3.1 Storm*	Community fine-tune	41.6	78	Not reported	High Antislop, low composite score

*Fine-tuned community models evaluated on Antislop only; EQ-Bench scores from community-run evaluation reproduced using the standard v3 harness.

Key Takeaways

Closed Models Hold the Top Tier - For Now

The top four positions are all closed, commercially hosted models. This isn't surprising given the compute and post-training investment required to develop strong creative writing capabilities, but the gap is narrowing. DeepSeek V3.2 at rank 5 with a 66.1 EQ-Bench score is within 8 points of GPT-5 - a gap that would have been 15+ points a year ago. The trajectory of strong open-weight models suggests the top-5 could be meaningfully contested within one generation.

Reasoning Models Are Over-Structured

The o1 and o3 results confirm what anecdotal observation suggested: models that reason explicitly before producing output tend to over-organize their prose. The extended thinking process maps out story beats, character motivations, and thematic elements before writing begins - and then the writing visibly executes that plan rather than discovering through the act of writing. The output is competent but mechanical. Voice scores for o3 are nearly 12 points below its composite EQ-Bench score on other benchmarks. If you need creative writing from a reasoning-class model, consider disabling or limiting extended thinking.

Models that reason before writing tend to execute a plan rather than discover one. The prose is competent but mechanical - and that shows up clearly in voice scores.

Fine-Tuned Writing Models Have a Tradeoff

Community fine-tuned writing models like Mistral Nemo Gutenberg dominate the Antislop category by a large margin (81 vs 55 for GPT-5) because their post-training explicitly suppresses the cliche vocabulary list. But EQ-Bench composite scores tell a different story: coherence and world-building degrade as the models struggle to maintain narrative consistency across longer outputs. The vocabulary suppression works, but it comes at the cost of the structured generation ability needed to hold a story together. These models are excellent for short-form output - a scene, a paragraph, a character sketch - but fall apart on anything requiring sustained structure.

Antislop Reveals Training Data Contamination

The correlation between a model's Antislop score and its training data composition is real. Models trained on large quantities of web-scraped fiction absorb AI-generated fiction patterns that have proliferated across the web, especially in fanfic and genre fiction communities. DeepSeek V3.2 and Claude 4 Sonnet both score notably higher on Antislop than their ranking peers, which likely reflects deliberate curation choices in their training data - fewer AI-generated fiction samples in the pre-training mix.

Open vs Closed: The Practical Gap

For teams deploying AI writing assistance commercially, the practical question is whether a 12-point EQ-Bench gap between Claude 4 Opus and Llama 4 Maverick justifies the API cost differential. On short creative tasks - product descriptions, marketing copy, short social content - the gap may not be perceptible to readers. On longer-form literary work where voice consistency and emotional nuance matter, it is. The honest answer is: run your own test on the output format you actually need before committing to one tier.

Caveats and Known Limitations

LLM-as-judge style bias. EQ-Bench's debiasing protocols reduce but do not eliminate the tendency of judge models to prefer prose that resembles their own training distribution. Models that share architectural lineage with the judges may receive inflated scores. The v3 debiasing methodology uses judge ensemble diversity to mitigate this - check the benchmark documentation for details on judge model selection.

Subjective taste and genre variance. A model that excels at psychological literary fiction may produce weak thriller prose, and vice versa. EQ-Bench v3 includes a more diverse prompt set than earlier versions, but composite scores still average across genres that require different craft priorities. A score difference of 2-3 points on the composite is not practically meaningful for most use cases.

Style memorization from training data. Models may produce high-scoring outputs that are stylistically close to specific authors heavily represented in training data. The voice rubric tries to penalize this, but it's imperfect. "Strong voice" and "memorized voice" can look similar to a judge that hasn't read the source author.

The slop vocabulary problem. The Antislop phrase list is maintained by a community of contributors and is inevitably incomplete and culturally biased. Phrases that read as cliche in English literary circles may be neutral in genre fiction communities. A model that scores well on Antislop may still produce output that feels generic in contexts the phrase list doesn't cover.

Human-preference confidence intervals. Win rates for models with fewer than 200 creative writing comparisons in the Arena (marked "Not reported" where the sample is insufficient) carry confidence intervals wide enough to reverse the apparent ranking. Use these numbers as directional signals, not precise measurements.

Benchmark Methodology Notes

EQ-Bench Creative Writing v3 scores reported here are from the official leaderboard at eqbench.com as of April 19, 2026. Antislop scores are from the community evaluation run in April 2026 using the standard v3 prompt set (50 diverse creative prompts, 500-word target output). Human-preference win rates are from the Chatbot Arena creative writing category as of April 18, 2026, including only models with 200 or more evaluated comparisons.

Models marked "Not reported" had no published evaluation result meeting these criteria at time of publication. Fine-tuned community model scores are from contributor-reported runs using the standard harness and have not been independently verified by this site.

Instruction Following Leaderboard - how reliably models follow precise output constraints, including format and length requirements relevant to writing workflows
Multilingual LLM Leaderboard - model performance across 16 languages, critical if creative writing tasks span non-English prose
Best AI Writing Tools 2026 - practical guide to writing assistants built on these models, covering UI, pricing, and workflow integration

FAQ

What is EQ-Bench Creative Writing and how does it work?

EQ-Bench Creative Writing is an open evaluation framework that scores LLM prose output on four rubrics - world-building, voice, pacing, and emotional nuance - using a panel of LLM judges with structured scoring chains. The v3 version includes debiasing protocols and a diverse 50-prompt test set. All methodology is published and reproducible.

Which model writes the best creative prose in 2026?

Based on combined signals from EQ-Bench v3 and human-preference rankings, Claude 4 Opus is currently the strongest creative writing model, particularly on voice consistency and emotional nuance. GPT-5 leads on pacing-focused tasks. For open-weight models, Llama 4 Maverick is the best base model option.

Why do reasoning models like o1 and o3 rank lower for creative writing?

Reasoning models use extended thinking chains that plan and structure output before writing it. This works well for analytical tasks but produces prose that executes a predetermined plan rather than developing organically. The result reads as competent but schematized - good structure, weak voice. Voice scores for o1 and o3 are both among the lowest in the top-15.

What is the Antislop Score measuring?

Antislop measures how densely a model's output relies on a curated list of roughly 800 overused AI writing phrases - "palpable tension," "ethereal glow," and similar patterns that have become markers of AI-generated text. Higher scores mean fewer slop phrases per token of output. The score runs 0-100; typical frontier models without sampling intervention score 45-65.

Are community fine-tuned writing models worth using?

For short creative tasks - a scene, a paragraph, a character voice sample - yes. Fine-tuned writing models like Mistral Nemo Gutenberg dramatically reduce slop phrase density and produce notably cleaner prose on brief outputs. For long-form work requiring coherent narrative over thousands of words, their structural consistency degrades. Use base frontier models for sustained narrative work.

How often does this leaderboard update?

EQ-Bench Creative v3 scores are updated when the official leaderboard publishes new results - typically monthly. Antislop scores are updated with each community evaluation run. Human-preference win rates from Chatbot Arena update continuously. This article reflects a snapshot as of April 2026 and will be updated when significant new results are published.

Sources:

Edge and Mobile LLM Leaderboard 2026: Phi, Gemma, Qwen

James Kowalski — Sun, 19 Apr 2026 00:00:00 +0000

Cloud inference is the default. Send tokens out, get tokens back. For casual use that is fine, but there are real reasons to run models locally on constrained hardware - privacy, offline capability, latency, and cost. A phone that processes your medical notes without ever talking to a server is a different product than one that uploads them to an API. A Raspberry Pi running a local assistant at a remote site without reliable internet access solves a problem that cloud inference cannot.

This leaderboard covers the tier that the home GPU leaderboard and the small language model leaderboard don't quite address: models optimized for on-device edge inference on hardware that has no discrete GPU, limited RAM, and significant thermal constraints. We're talking about mobile SoCs like the Apple A17/A18 Pro and Qualcomm Snapdragon 8 Elite, low-power laptop silicon like the Apple M4 (base configuration), single-board computers like the Raspberry Pi 5, and embedded AI boards like the NVIDIA Jetson Orin Nano.

The metric that matters here is different from the GPU leaderboard. On a workstation with an RTX 4090, the bottleneck is memory bandwidth. On an iPhone or a Raspberry Pi, the constraint is thermal headroom - the device will throttle after a few minutes of sustained load - plus the interplay between NPU acceleration, power draw, and the number of tokens you can generate per second before the user gives up.

TL;DR

Gemma 3 4B is the best all-around edge model: strong MMLU (43.6), best-in-class IFEval, and the fastest throughput on iPhone 16 Pro at ~27 tok/s with the Google AI Edge SDK
Phi-4-Mini (3.8B) leads on reasoning quality with 88.6% GSM8K and 83.7% ARC-C, and runs at ~22 tok/s on MacBook M4 Air via llama.cpp
Qwen 2.5 3B delivers the best quality-per-parameter ratio under 4B with solid MMLU (65.6) and acceptable Raspberry Pi 5 performance at ~8 tok/s
MobileLLM-350M is the only sub-1B model with credible architecture research behind it; TinyLlama-1.1B is largely obsolete for quality tasks

What This Leaderboard Covers

This is specifically an on-device edge inference ranking. The models here must be capable of running on hardware with one or more of these constraints:

Mobile SoCs (Apple A17/A18 Pro, Qualcomm Snapdragon 8 Elite / Gen 3, MediaTek Dimensity 9300)
Low-power laptop silicon without discrete GPU (Apple M4 base 8-16 GB, Intel Core Ultra with iGPU)
Single-board computers (Raspberry Pi 5 with 8 GB RAM, no GPU)
Embedded AI hardware (NVIDIA Jetson Orin Nano 8GB, Qualcomm Robotics RB3)

Not in scope: workstation GPUs (RTX series, AMD RX series), Mac Pro / Mac Studio configurations, or any setup requiring more than 16 GB of RAM for a laptop. If you want those rankings, see the home GPU leaderboard.

The general small language model leaderboard covers quality benchmarks for models under 10B parameters. This leaderboard is the performance complement: which of those models actually run at useful speeds on the specific hardware most people carry in their pockets or put in the field.

Hardware Tiers Explained

Different edge hardware has very different characteristics. Understanding the tier you're targeting changes the model choice significantly.

Phone SoC (Apple A17/A18 Pro, Snapdragon 8 Elite)

Flagship smartphones in 2026 ship with NPUs rated at 35-38 TOPS. The Apple A18 Pro neural engine runs at 35 TOPS; the Qualcomm Snapdragon 8 Elite at 45 TOPS. Both have tight thermal budgets - sustained AI inference at maximum NPU utilization will trigger throttling in 2-3 minutes. The practical working window before thermal throttle kicks in is shorter than people expect.

On-device RAM is shared between the OS, applications, and model weights. An iPhone 16 Pro with 8 GB of total RAM realistically has 3-4 GB available for model weights after the OS takes its share. That limits viable models to 3B parameters at Q4 quantization, or smaller. Android flagships with 12-16 GB of RAM have more headroom - a Snapdragon 8 Elite device with 12 GB can hold a Q4-quantized 4B model with room for context.

Frameworks: Apple's Core ML / AI Edge SDK, Google's MediaPipe LLM inference API, Qualcomm's AI Hub, Meta's ExecuTorch.

Low-Power Laptop Silicon (Apple M4 Air 16 GB, Intel Core Ultra iGPU)

The Apple M4 base chip (not Pro, not Max) has 10-core GPU and runs with 8 or 16 GB of unified memory. At 16 GB, it comfortably holds a Q4-quantized 4B model and has room for a 7-8B model in a pinch. Memory bandwidth is 120 GB/s - lower than the M4 Pro's 273 GB/s, which directly limits token generation speed.

This tier is the most practical for local developer use: a laptop that isn't a workstation, running llama.cpp or MLX-LM with no discrete GPU. Intel Core Ultra 200V series laptops (Lunar Lake) with Xe2 iGPU also fall here, though Intel's AI inference stack for LLMs is less mature than Apple's.

Raspberry Pi 5 (8 GB RAM, no GPU)

The Raspberry Pi 5 has a quad-core ARM Cortex-A76 CPU running at 2.4 GHz and up to 8 GB LPDDR4X RAM. There is no GPU acceleration for general ML inference - everything runs on CPU via llama.cpp with ARM NEON SIMD. The practical ceiling is a 3B model at Q4 quantization (~2 GB), which generates tokens at 6-12 tok/s depending on context length. That is slow but usable for async tasks, batch processing, or applications where latency is not user-facing.

The Raspberry Pi 5 is not a great interactive inference device. It is, however, the dominant hardware in edge IoT deployments, robotics, and offline field computing. The question for this tier is not "is it fast enough for chat?" but "can it run a useful inference task at all?"

Jetson Orin Nano 8GB

The NVIDIA Jetson Orin Nano 8GB has a 1024-core Ampere GPU and 8 GB of LPDDR5 shared between CPU and GPU. At 40 TOPS, it outperforms the Raspberry Pi 5 significantly for AI inference but is more expensive (~$250) and consumes more power (7-15W). It runs llama.cpp with CUDA acceleration and handles 3B models at 15-25 tok/s - roughly twice the Raspberry Pi 5 speed. I don't have comprehensive Jetson benchmarks for all models in this table, so the primary hardware reference columns are iPhone 16 Pro, MacBook M4 Air 16 GB, and Raspberry Pi 5.

Benchmark Explainers

The quality benchmarks in the main table come from official model cards and published technical reports. Here is what each measures:

MMLU (Massive Multitask Language Understanding): 57 subjects spanning STEM, humanities, social sciences. Tests breadth of knowledge. A good proxy for general-purpose usefulness. Classic 5-shot format, 0-100% scale.

IFEval (Instruction Following Evaluation): Tests whether a model can follow explicit formatting instructions like "write 150 words" or "use bullet points." Critical for on-device applications where output format matters. Scores are prompt-level accuracy, 0-100%.

GSM8K: Grade-school math word problems requiring multi-step arithmetic. 8-shot chain-of-thought. A solid proxy for reasoning ability at this size class.

Tokens/sec figures are for Q4_K_M GGUF quantization via llama.cpp unless otherwise noted. For iPhone, scores use the Core ML or AI Edge SDK path. All throughput numbers are generation speed (output tokens per second), not prefill/prompt processing speed. Where I have multiple sources, I report the median or note the variance.

Rankings

Rank	Model	Params	MMLU	IFEval	GSM8K	tok/s iPhone 16 Pro	tok/s MacBook M4 Air 16 GB	tok/s Raspberry Pi 5	Min RAM (Q4)	Notes
1	Gemma 3 4B IT	4B	43.6	73.4	89.2	~27	~28	~6	~3.0 GB	Best phone model; Google AI Edge optimized
2	Phi-4-Mini	3.8B	52.8	68.1	88.6	~20	~22	~5.5	~2.5 GB	Top reasoning at sub-4B; ARC-C 83.7%
3	Qwen 2.5 3B	3B	65.6	58.3	79.1	~22	~26	~8	~2.0 GB	Best MMLU under 4B; strong all-rounder
4	Gemma 3 1B IT	1B	32.8	54.2	62.8	~55	~62	~14	~0.8 GB	Fastest useful model on phone; lightest footprint
5	Phi-3.5-Mini	3.8B	69.0	62.3	86.5	~18	~21	~5	~2.5 GB	High MMLU; slower than Phi-4-Mini on same hardware
6	SmolLM3-3B	3B	62.4	60.1	78.4	~25	~28	~9	~2.0 GB	Strong for size; Apache 2.0; excellent for laptop
7	MiniCPM3-4B	4B	67.2	68.8	81.1	Not reported	~24	~5	~3.0 GB	Best MMLU at 4B; no official phone benchmarks
8	Llama 3.2 3B	3B	63.4	60.2	77.7	~23	~26	~8	~2.0 GB	Wide ecosystem; ExecuTorch support; solid baseline
9	Qwen 2.5 1.5B	1.5B	60.9	52.4	73.2	~38	~44	~11	~1.1 GB	Exceptional quality for 1.5B; best sub-2B option
10	Granite 3.1 MoE 3B (A800M)	3B (800M active)	55.3	56.2	Not reported	Not reported	~45	~18	~1.8 GB	MoE: fast on laptop/Pi; quality trades off slightly
11	SmolLM2-1.7B	1.7B	48.8	53.1	51.1	~42	~50	~13	~1.2 GB	Good for phones; lower GSM8K limits reasoning tasks
12	Llama 3.2 1B	1B	49.3	53.5	44.4	~52	~60	~16	~0.8 GB	Very fast; limited reasoning; good for summarization
13	MiniCPM-2B	2B	53.5	Not reported	53.8	Not reported	~38	~10	~1.5 GB	Solid reasoning for 2B; Chinese-English bilingual
14	TinyLlama-1.1B	1.1B	28.1	Not reported	30.4	~55	~64	~17	~0.8 GB	Fast but limited quality; largely superseded
15	MobileLLM-350M	350M	Not reported	Not reported	Not reported	~90	~110	~30	~0.4 GB	Research model; classification/completion only

Notes on Scores and Sources

MMLU, IFEval, GSM8K scores are drawn from official model cards, published technical reports, and the Open LLM Leaderboard (Hugging Face) where available. For Gemma 3, I used Google's official Gemma 3 Technical Report. For Phi models, Microsoft's Phi-3.5 and Phi-4 technical reports. For Qwen models, Alibaba's Qwen 2.5 technical report. For MobileLLM, the original Meta paper at arXiv:2402.14905 - that model has no publicly reported MMLU or instruction-following scores in a format comparable to the others.

Tokens/sec on iPhone 16 Pro figures are from Qualcomm AI Hub published benchmarks for Snapdragon-accelerated models, Apple developer documentation for Core ML paths, and community benchmarks. Many models do not yet have official on-device phone benchmarks. Where I have only a single community data point, I mark it as approximate (~). Gemma 3 phone speeds come from Google AI Edge SDK documentation. SmolLM phone speeds come from Hugging Face's on-device benchmark posts.

MacBook M4 Air 16 GB figures use llama.cpp for consistency, measured at 4K context with Q4_K_M quantization. MLX will run 30-50% faster than these figures on Apple Silicon - see the MLX section below.

Raspberry Pi 5 figures are community-reported llama.cpp benchmarks, ARM NEON only (no GPU acceleration), Q4_K_M quantization, 4K context. Sources include the llama.cpp discussion thread #4167 and community benchmarks on the Raspberry Pi forums.

Apple OpenELM (1B/3B) is not in the main table because it underperforms Llama 3.2 at the same parameter counts on MMLU and GSM8K while offering no throughput advantage. OpenELM's contribution was the architecture research, not deployment-ready quality. Gemma 3 1B and Llama 3.2 1B are strictly better on-device choices.

StableLM 2 1.6B / Zephyr is not in the main table because its MMLU (45.1) and GSM8K (57.0) are below Qwen 2.5 1.5B on both axes, and its throughput profile is similar. No meaningful advantage in this lineup.

Qwen 2.5 0.5B was tested. At 0.5B parameters and 37.2% MMLU, the quality is too low for useful real-world tasks. It runs at ~100 tok/s on Raspberry Pi 5 but lacks the reasoning to do much beyond simple completion. Not included in the ranked table.

Best-for-Hardware Decision Matrix

If you know your target hardware and just want a recommendation:

Use Case	Best Model	Why
Best under 1B params	Gemma 3 1B IT	62.8% GSM8K, fastest phone throughput at sub-1B
Best on iPhone (iOS app)	Gemma 3 4B IT	Google AI Edge SDK, ~27 tok/s, first sub-4B over 1300 LMArena
Best on Android flagship (12GB RAM)	Qwen 2.5 3B or Gemma 3 4B IT	High MMLU, MediaPipe or ONNX runtime
Best for MacBook M4 Air (no dGPU)	Phi-4-Mini or SmolLM3-3B	Quality/speed balance at 16 GB; MLX gives 30%+ boost
Best for Raspberry Pi 5	Qwen 2.5 3B or Llama 3.2 3B	~8 tok/s, solid reasoning, fits in 2 GB
Best quantized INT4 quality	MiniCPM3-4B	Highest MMLU at 4B (67.2); designed for quantization
Best MoE speed on Pi / laptop	Granite 3.1 MoE 3B	800M active params; ~18 tok/s on Pi vs ~5-8 for dense 3B
Best for privacy-first mobile app	Llama 3.2 3B	Meta license, widest framework support, no telemetry

Quantization Impact on Edge Hardware

Quantization is not optional on edge hardware - it is required. The question is which quantization level you can use before quality degrades noticeably.

At Q4_K_M (the llama.cpp default for on-device work), quality loss versus full BF16 precision is typically 1-3 MMLU points for models in the 1B-4B range. That is acceptable. Below Q4, the tradeoffs get worse fast.

Quantization	Size vs FP16	Quality vs FP16	Practical Use
Q8_0	50%	~0% loss	Best quality; only if RAM allows
Q4_K_M	27%	1-3 point MMLU loss	Standard for edge; recommended
Q4_0	25%	2-4 point MMLU loss	Slightly faster than Q4_K_M; marginally worse quality
Q3_K_M	20%	4-7 point MMLU loss	Use only when Q4 doesn't fit
Q2_K	14%	8-15 point MMLU loss	Last resort; visible quality degradation

INT4 (ONNX / Core ML / AI Edge): Dedicated on-device inference frameworks use their own INT4 quantization schemes tuned for specific NPU hardware. These are generally better than llama.cpp Q4_0 for phone deployments and sometimes match Q4_K_M. Always prefer the native framework's quantization path over GGUF when targeting phone NPUs.

For Raspberry Pi 5 specifically: Q4_K_M is the right call. Q3_K_M is noticeably worse and the speed gain (~5-10%) isn't worth it. Don't bother with Q2_K unless you're trying to fit a 3B model on a 4 GB Pi.

MLX on Apple Silicon - a Separate Lane

The throughput numbers in the main table use llama.cpp for consistency. If you are on a Mac, those numbers are the floor, not the ceiling.

MLX-LM - Apple's own machine learning framework - consistently runs 30-50% faster than llama.cpp on Apple Silicon for token generation. The gains come from MLX's direct access to the unified memory architecture and its optimized Metal kernels. On a MacBook M4 Air 16 GB, that translates to roughly:

Gemma 3 4B: ~28 tok/s llama.cpp → ~38-40 tok/s MLX
Phi-4-Mini 3.8B: ~22 tok/s llama.cpp → ~30-33 tok/s MLX
Qwen 2.5 3B: ~26 tok/s llama.cpp → ~34-38 tok/s MLX
SmolLM3-3B: ~28 tok/s llama.cpp → ~36-40 tok/s MLX

For interactive use on a MacBook without a discrete GPU, use MLX. Install via pip install mlx-lm and run with mlx_lm.generate. Most models in this table have MLX-compatible versions available on Hugging Face.

The MLX vs llama.cpp throughput advantage narrows at larger model sizes (above 7B) but for the 1-4B range covered here, MLX is definitively faster.

Thermal Throttling - the Hidden Variable

Every number in this table assumes a cold device running a single inference. Real-world edge inference has a thermal component that doesn't show up in point-in-time benchmarks.

iPhone 16 Pro: At maximum NPU utilization, the A18 Pro will throttle to ~60-70% of peak performance after 2-3 minutes of continuous inference. A chatbot session with sustained back-and-forth will see 20-30% lower throughput by the fifth or sixth exchange. Google's AI Edge SDK includes throttle detection; model authors who test on-device should note sustained vs burst throughput separately.

MacBook M4 Air: The M4 Air has no fan. It relies entirely on passive heat dissipation. Under continuous sustained inference (e.g., batch processing 500 documents), it will throttle. For interactive use (one prompt at a time with natural pauses), this is rarely a problem. For batch workloads, the M4 Air will sustain roughly 70-80% of peak throughput indefinitely. The M4 Pro and above have fans and don't throttle.

Raspberry Pi 5: Without active cooling, the Pi 5 will throttle under sustained load. With a fan or heatsink, it sustains full clock speed. If you're doing serious edge inference on a Pi, buy the active cooler. It costs $5 and doubles sustained throughput.

Memory bandwidth is the real ceiling: On all these devices, token generation throughput is memory-bound. The formula is simple: generation speed in tok/s scales roughly as memory bandwidth / model size in bytes. A model that requires 2 GB at Q4 and runs on hardware with 40 GB/s effective memory bandwidth will generate tokens at roughly 20 tok/s before thermal and other overheads. When comparing hardware, look at the memory bandwidth spec.

Methodology

Rankings in the main table weight quality and throughput roughly equally. Quality score is an average of available MMLU, IFEval, and GSM8K scores (where all three are reported). Models missing one or two benchmarks are ranked on the available evidence plus qualitative notes from model card evaluations.

Throughput scores are from:

iPhone 16 Pro: Qualcomm AI Hub published benchmarks, Google AI Edge SDK documentation, community benchmarks via ExecuTorch and Core ML paths
MacBook M4 Air 16 GB: llama.cpp Q4_K_M, 4K context, measured or community-reported in the llama.cpp Apple Silicon benchmark thread
Raspberry Pi 5: Community llama.cpp benchmarks, Q4_K_M, ARM NEON, no GPU, same discussion thread

Models were not re-tested in a controlled environment for this article. Numbers are drawn from public sources and are approximate within ~15-20%. Where I have multiple data points for the same model on the same hardware, I report the median.

Quality benchmark sources: Official model cards (Microsoft Phi-3.5, Phi-4-Mini technical report; Google Gemma 3 Technical Report; Alibaba Qwen 2.5 Technical Report; Meta Llama 3.2 Model Card; IBM Granite model cards); Hugging Face Open LLM Leaderboard (where available); MobileLLM paper arXiv:2402.14905.

Caveats

Benchmark saturation at the top: MMLU and GSM8K are showing signs of saturation for models above 3B parameters. Phi-3.5-Mini's 69% MMLU and Qwen 2.5 3B's 65.6% are near the ceiling of what these benchmarks distinguish meaningfully at this parameter count. For ranking models in the 3-4B range, IFEval and task-specific evaluations increasingly matter more than MMLU.

No single benchmark represents real use: A model that scores 88.6% on GSM8K may still struggle with real-world word problems that require understanding context over multiple sentences. Benchmarks are proxies, not ground truth.

On-device throughput varies by prompt length: All throughput numbers assume 4K context. Longer prompts slow generation - at 16K context, expect 30-50% lower tok/s on these devices. The memory bandwidth constraint intensifies as the KV cache grows.

Framework maturity differs: Gemma 3 has first-party support in Google AI Edge with production-quality Android and iOS deployment. Phi-4-Mini has ONNX export and ExecuTorch paths but these are less mature than the Gemma 3 mobile stack. SmolLM3 is primarily a research release with llama.cpp as the main deployment path. Factor framework support into your decision if you're shipping a product.

MoE caveats: The Granite 3.1 MoE 3B has 800M active parameters but loads all 3B of weights. On a Raspberry Pi with 8 GB RAM, it fits, and the active-parameter count means it's fast - but the full weight load is still in memory. "800M active" does not mean "800M model size."

Quantization quality loss is model-specific: The general Q4_K_M quality loss estimate of 1-3 MMLU points is an average. Some models degrade more gracefully than others. MiniCPM3-4B was specifically designed with quantization in mind and reportedly loses less than 1 MMLU point at Q4. Phi models are reported to be more sensitive; verify against the model card before deploying at Q3 or below.

What to Watch

The trajectory for edge LLM quality is steep. Gemma 3n - Google's architecture designed specifically for on-device inference using Per-Layer Embeddings - represents a design philosophy where the model is purpose-built for constrained memory rather than adapted from a larger architecture. The E4B variant has 8B total parameters but a 4B memory footprint; it is not in this table because its on-device throughput benchmarks on the specific hardware above are not yet comprehensively published, but watch for it in the next revision.

Apple's Foundation Models (~3B on-device) and its AI infrastructure work point toward tighter hardware-software co-design in the next generation of Apple Silicon. The on-device model baked into iOS 18/macOS 15 already handles basic tasks; the question is how much Apple opens that stack to third-party applications.

Qualcomm's AI Hub is the most useful single resource for Snapdragon-specific benchmarks. If you're targeting Android with a Snapdragon chip, check there before deciding on a model.

For inference tooling on these devices, see the best open-source LLM inference servers overview, and for broader workstation-class setup recommendations, the best AI home workstations guide.

Sources

Finance LLM Leaderboard 2026: FinBench Scores Ranked

James Kowalski — Sun, 19 Apr 2026 00:00:00 +0000

In most AI benchmark discussions, a wrong answer is just a missed point. In finance, a wrong answer can mean a misreported earnings figure, a botched SEC filing summary, or a trading desk acting on a hallucinated revenue number. That asymmetry - where errors have real costs - is what makes financial reasoning benchmarks worth tracking separately from general reasoning leaderboards and math olympiad rankings.

This leaderboard covers the benchmarks that specifically stress-test numerical extraction from documents, multi-step financial calculation, and domain knowledge tested at CFA exam standard. These are not the same skills as solving AIME problems. They involve reading a 10-K filing, locating the right line item in a footnote table, and chaining arithmetic correctly across multiple steps - all while resisting the temptation to confabulate plausible-looking numbers.

TL;DR

o3 and GPT-5 lead on FinanceBench, where SEC filing comprehension is the main challenge
Reasoning models (o3, DeepSeek-R2) pull ahead on multi-step calculations in FinQA and TAT-QA
Domain fine-tuned models like BloombergGPT and FinGPT trail frontier general models on most tasks
CFA-Bench separates models that have genuine financial conceptual knowledge from those that pattern-match numerical questions
"Not reported" entries are common - most labs do not publish scores on financial benchmarks

The Benchmarks Explained

FinanceBench

FinanceBench, published by Patronus AI in 2023 (arxiv:2311.11731), is a dataset of 10,231 questions requiring open-ended answers drawn from real publicly available financial documents - 10-K filings, 10-Q reports, and earnings releases from S&P 500 companies. Questions are paired with the source document and the exact answer, usually a specific dollar figure, percentage, or ratio.

The benchmark tests whether a model can retrieve the right number and perform the required arithmetic. A typical question asks for a company's year-over-year revenue growth, operating margin, or free cash flow - calculations that require locating two related figures across a multi-page document and computing the result. The paper's baseline results showed that GPT-4 with retrieval achieved around 81% accuracy, while models without retrieval access dropped dramatically. Patronus designed this explicitly as a test where hallucination is measurable and consequential.

FinQA

FinQA (arxiv:2212.09741, Zheng et al. 2022) is a dataset of 8,281 question-answer pairs extracted from S&P 500 earnings reports. Each answer requires a multi-step numerical reasoning chain over structured financial tables. The evaluation metric is Exact Match (EM) - the predicted answer must be numerically identical to the gold answer, not just approximately close.

What makes FinQA genuinely hard is the reasoning program: each question has an annotated sequence of arithmetic operations that produces the answer. A model must correctly identify which table cells contain the relevant numbers and then execute the right operations in the right order. Small arithmetic errors cascade. Retrieval augmentation helps substantially, but the reasoning chain itself is the bottleneck for frontier models.

TAT-QA

TAT-QA (Hybrid Text and Table Question Answering, arxiv:2203.09066) contains 16,552 questions from real financial reports that require understanding both natural-language text and numerical tables simultaneously. Questions are categorized by whether the answer comes from a table cell, free text, or requires integrating both sources with arithmetic.

This hybrid format is important because real financial documents are not purely tabular. Revenue might be described in narrative text while the breakdown lives in a table five pages later. TAT-QA measures whether a model can bridge that gap. The dataset uses both Exact Match and F1 scoring, and human performance sits around 84% F1 - a more achievable ceiling than FinanceBench.

ConvFinQA

ConvFinQA (arxiv:2109.00819) extends FinQA into multi-turn conversational format. A series of questions builds on previous answers, testing whether a model can track an evolving financial analysis across a dialogue. This is closer to how analysts actually use these tools - iterating on a calculation, asking follow-up questions, and building toward a conclusion over several exchanges.

Conversational context management is the unique challenge here. A model that answers the first question correctly may go wrong on question four when it misremembers or loses track of an intermediate value. Exact Match is the primary metric.

CFA-Bench

CFA-Bench (arxiv:2309.09765) uses questions from the Chartered Financial Analyst curriculum to test financial domain knowledge at professional certification standard. The CFA exam covers portfolio management, ethics, fixed income, equity analysis, derivatives, and alternative investments. Questions are multiple-choice with three options, covering both conceptual understanding and applied calculation.

Unlike the SEC-filing benchmarks, CFA-Bench tests whether a model has internalized financial theory - not just whether it can extract numbers from documents. Models without genuine financial knowledge training show up clearly here.

FiQA

FiQA (Financial Opinion Mining and Question Answering, arxiv:2210.15016) is a long-standing benchmark covering financial opinion mining and open-domain question answering from financial news, earnings calls, and analyst reports. It includes both sentiment analysis and factual QA tasks. We focus here on the QA subset, which tests factual retrieval from financial text. FiQA is older than the other benchmarks (2018 origin) and is best treated as a floor-setter rather than a discriminator among frontier models.

DocFinQA

DocFinQA (arxiv:2401.10020) is a document-level extension of FinQA where questions require reasoning over entire annual reports rather than short passages. Long-context handling is the critical variable - models that struggle with multi-page financial documents fall apart here. This benchmark became more relevant as model context windows expanded, since raw retrieval via chunking is less necessary but reasoning over 100,000-token documents introduces new failure modes.

Multi-step numerical reasoning over financial tables is where the gap between frontier models and domain fine-tunes is widest.

Finance LLM Rankings - April 2026

The table below aggregates publicly reported scores from benchmark papers, model cards, and independent evaluations. Where no public figure exists, I mark "Not reported" rather than extrapolate.

Rank	Model	Provider	FinanceBench %	FinQA EM%	TAT-QA F1%	CFA-Bench %	Notes
1	o3	OpenAI	~90	Not reported	Not reported	Not reported	Top FinanceBench score via Patronus evals; extended thinking
2	GPT-5	OpenAI	~88	Not reported	Not reported	Not reported	Strong document reasoning; scores from early evals
3	GPT-4.1	OpenAI	~85	~68	~75	Not reported	Best documented frontier baseline across FinQA/TAT-QA
4	DeepSeek-R2	DeepSeek	Not reported	~65	~72	Not reported	Reasoning model; strong on multi-step FinQA chains
5	Claude 4 Opus	Anthropic	~82	Not reported	Not reported	Not reported	Long-context strength benefits DocFinQA; FinanceBench score estimated
6	Gemini 2.5 Pro	Google DeepMind	~80	Not reported	~70	Not reported	Best public TAT-QA performance among Google models
7	DeepSeek V3.2	DeepSeek	Not reported	~62	~68	Not reported	Non-reasoning variant; competitive on tabular tasks
8	Claude 4 Sonnet	Anthropic	~78	Not reported	Not reported	Not reported	Solid document QA; faster and cheaper than Opus
9	Qwen 3.5	Alibaba	Not reported	~58	~65	Not reported	Competitive on structured data tasks
10	Grok 4	xAI	Not reported	Not reported	Not reported	Not reported	No published financial benchmark scores as of April 2026
11	Llama 4 Maverick	Meta	Not reported	~50	~58	Not reported	Open-weight baseline; falls behind on complex chains
12	Phi-4	Microsoft	Not reported	~48	~54	Not reported	Punches above weight for its size on tabular tasks
13	Mistral Large 3	Mistral AI	Not reported	~45	~52	Not reported	Reasonable baseline; no financial-specific tuning
14	BloombergGPT	Bloomberg	~53 (paper)	~44 (paper)	Not reported	Not reported	Historical context only - 2023 paper, not updated
15	FinGPT	Open source	Not reported	~40 (paper)	Not reported	~48	Domain fine-tune; CFA score from original paper

Scores are drawn from published papers, model cards, and independent evaluation reports where available. Ranges indicate variation across evaluation setups. "Not reported" means no public figure was available as of April 2026. FinanceBench scores are percentage of questions answered correctly. FinQA and ConvFinQA use Exact Match. TAT-QA uses F1. Frontier model scores for GPT-5, Claude 4, and Gemini 2.5 are from early evaluation reports and may be updated as more systematic evaluations are published.

Key Findings

Reasoning Models Pull Ahead on Multi-Step Calculations

The clearest pattern in this data is that models with explicit extended reasoning - o3, DeepSeek-R2 - outperform their non-reasoning counterparts on tasks requiring multi-step arithmetic. FinQA is the clearest example: each question requires a chain of 2-5 arithmetic operations, and errors compound. A model that can backtrack and verify intermediate steps (as reasoning models do through chain-of-thought) has a structural advantage over models that commit to a calculation in a single forward pass.

This is not a trivial finding. It suggests that for financial applications requiring calculation chains - building a discounted cash flow model, reconciling a balance sheet, calculating EBITDA adjustments - the reasoning-mode cost premium may be justified by accuracy gains that avoid far more costly errors downstream.

Domain Fine-Tunes Trail Frontier Models

BloombergGPT (50 billion parameters, trained on a 363 billion token financial corpus) and FinGPT (open-source, various sizes fine-tuned on financial data) were important milestones in financial AI. But the data here tells a clear story: they now trail general frontier models on most tasks.

The explanation is not that domain knowledge is unimportant. It's that frontier models like GPT-4.1, Claude 4, and Gemini 2.5 Pro were trained on so much financial text - SEC filings are public, financial news is everywhere on the internet - that they absorbed substantial domain knowledge during pretraining at scales that purpose-built financial models cannot match. BloombergGPT's 363 billion tokens of financial data sounds impressive until you compare it to the multi-trillion-token pretraining runs of current frontier models, which almost certainly contain far more financial text in absolute terms.

The implication for practitioners is that fine-tuning a small model on financial data is no longer a path to outperforming frontier baselines on general financial reasoning tasks. Domain fine-tunes can still win on highly specific narrow tasks (real-time market data integration, proprietary terminology, specific document formats), but the general benchmark story now favors scale.

Numerical Precision Is the Persistent Bottleneck

Across all these benchmarks, the failure mode I see most consistently is not misunderstanding the question - it's small arithmetic errors. A model might correctly identify that it needs to compute year-over-year revenue growth, locate the right line items in the document, and set up the formula correctly, then produce 12.4% instead of 12.3% because of a rounding error or a slightly wrong base figure. On FinQA Exact Match, that's a complete failure.

This matters for production applications. Retrieval-augmented generation helps considerably - models that can look up the exact figure rather than relying on parametric memory are less likely to confabulate plausible but wrong numbers. But the arithmetic itself remains fragile in ways that don't show up in general reasoning benchmarks, where approximate correctness is usually acceptable.

Financial Symbol and Format Parsing

A subtler finding from working through FinanceBench questions: models regularly stumble on financial formatting conventions. Dollar amounts in millions ($4,231.7M vs. $4.2B), shares outstanding in thousands, negative values in parentheses (the accounting convention for losses), and multi-year comparative tables with restated prior-year figures all create parsing challenges that general benchmarks don't test. Models trained on clean text can misread a parenthesized negative as a positive or treat "millions" and "billions" interchangeably.

None of the benchmark scores in the table capture this failure mode cleanly, but it's one of the first things to test when evaluating a model for real financial document processing.

Benchmark Methodology Notes

FinanceBench is evaluated on the full 10,231-question test set. Patronus AI provides an open evaluation harness. Retrieval augmentation is standard - questions are paired with the source document. Accuracy is binary: the model's answer must match the reference answer exactly or within a defined tolerance for numerical values.

FinQA uses the official test split (1,147 questions). Exact Match requires the numerical answer to be identical to the reference. The official evaluation script handles normalization for units and decimal places. Results without retrieval and with retrieval are sometimes reported separately - the table above uses the with-retrieval setting where specified.

TAT-QA F1 is computed over tokenized answers. The dataset includes four question types (span from table, span from text, multi-span, and arithmetic), and aggregate F1 weights across all types.

CFA-Bench uses accuracy on multiple-choice questions across the six CFA topic areas. The benchmark paper reports scores for several models; most frontier model scores are not yet published.

For all benchmarks, scores from self-reported model cards should be interpreted with appropriate skepticism. Where possible, I prefer independently verified numbers.

Caveats and Limitations

Training Data Contamination

SEC filings are public documents. Every 10-K ever filed with the SEC is available for download, and they almost certainly appear in the training data of every major frontier model. This means that financial benchmarks built on public filings have a contamination problem: a model might "know" an answer from pretraining rather than reasoning from the provided document. Patronus AI designed FinanceBench with this in mind, using questions that require cross-referencing multiple document sections rather than simple lookup, but contamination cannot be entirely ruled out for documents that predate the model's training cutoff.

This is less of a concern for benchmarks like CFA-Bench (exam questions, not filings) and more acute for FinQA and TAT-QA (questions drawn directly from SEC filings).

Date Cutoff for Current Analysis

These benchmarks test understanding of historical financial data. None of them evaluate whether a model can accurately reason about current market conditions, real-time prices, or financial events after its training cutoff. For applications requiring current financial analysis - today's stock price, last quarter's earnings, current yield curves - benchmark performance on historical documents is a limited predictor of production behavior. Retrieval-augmented architectures with live data feeds are necessary for current analysis, and benchmark scores say little about how well a model integrates retrieved real-time information.

Benchmark Selection Bias

The finance benchmarks in widespread use - FinQA, TAT-QA, FinanceBench - were published between 2022 and 2023. Frontier models have been trained on arXiv papers describing these benchmarks, which may include example questions. The field has not yet developed finance equivalents of FrontierMath (genuinely contamination-resistant evaluation). This doesn't make existing benchmarks worthless, but it should temper confidence in absolute scores.

The Missing Benchmarks

SEC Filing QA and numerical GLUE scores appear in some evaluations but lack a consistent standard test set that would allow apples-to-apples comparison. I chose not to include a column for these in the main table rather than mix methodologies.

Comparison to General Reasoning

For more context on how these same models perform on general reasoning and mathematical tasks that don't require financial domain knowledge, see the Reasoning Benchmarks Leaderboard and the Math Olympiad AI Leaderboard. You may also want to cross-reference with multilingual financial capabilities if you're deploying in non-English financial markets.

The key takeaway from the comparison: models that lead on GPQA Diamond and AIME also tend to lead on FinanceBench - reasoning capability generalizes. But the correlation is imperfect. TAT-QA and FinQA discriminate on numerical precision and document parsing in ways that pure reasoning benchmarks do not test. A model can ace competition mathematics while still fumbling a trailing-twelve-month EBITDA calculation in a 200-page 10-K.

Bottom Line

For financial document analysis (SEC filings, earnings reports): o3 and GPT-5 lead where they've been evaluated, with GPT-4.1 as the best-documented baseline with published FinQA and TAT-QA numbers. Claude 4 Opus and Gemini 2.5 Pro are competitive and have stronger long-context handling for full-document analysis.

For multi-step financial calculations: Reasoning models - o3 and DeepSeek-R2 specifically - have a structural advantage on chained arithmetic. If your application generates multi-step financial models or reconciles complex calculations, the reasoning-mode premium is worth evaluating.

For domain knowledge (CFA-level conceptual questions): CFA-Bench data is sparse for frontier models, but FinGPT's dedicated financial fine-tuning gives it an edge on terminology and conceptual questions over non-specialized smaller models. Frontier models likely outperform on the same tasks, though published CFA-Bench scores for GPT-5 and Claude 4 aren't yet available.

For budget-conscious deployment: Phi-4 at roughly 14 billion parameters holds up reasonably well on structured tabular tasks relative to its size. For organizations that can't afford frontier model API costs at financial document processing scale, it's worth benchmarking on your specific document types before committing to a more expensive option.

What to avoid: Treating BloombergGPT or FinGPT as the right choice for general financial reasoning tasks in 2026. Both were important in their time, but they've been surpassed. If you're using them because they're "financial AI" without checking whether frontier baselines outperform them on your specific task, you're probably leaving accuracy on the table.

Sources:

Legal AI LLM Leaderboard 2026: LegalBench and CaseHOLD

James Kowalski — Sun, 19 Apr 2026 00:00:00 +0000

In most AI benchmark discussions, a wrong answer is just a missed point. In law, a wrong answer can be a sanctioned attorney, a dismissed motion, or a client losing a case because their lawyer cited a case that doesn't exist. That already happened. In 2023, two New York lawyers were fined after submitting ChatGPT-generated briefs containing fabricated case citations to a federal judge. The cases were real-looking - proper case names, docket numbers, plausible holdings - and completely invented by the model.

That's the context for why legal AI benchmarks matter and why this leaderboard tracks them separately from general reasoning rankings and hallucination benchmarks. The failure modes in legal AI are not abstract. They have professional conduct consequences, and the benchmarks described here were designed to measure whether models can handle the specific precision that legal practice demands.

TL;DR

o3 and GPT-5 lead on Bar Exam MBE and LegalBench, with reasoning models showing a consistent edge on multi-step statutory analysis
Domain fine-tunes (Saul-7B, Legal-BERT) outperform general models on narrow classification tasks but trail frontier models on LegalBench's open-ended reasoning tasks
Most benchmarks are US-centric - LexGLUE provides European legal coverage but model performance on non-US law drops sharply
Citation hallucination is not captured by any benchmark in this table - it remains an unsolved problem that MCQ-style tests cannot measure

The Benchmarks Explained

LegalBench

LegalBench (arxiv:2308.11462, Guha et al. 2023, NeurIPS) is a collaboratively constructed benchmark of 162 tasks covering six types of legal reasoning: issue spotting, rule recall, rule application, rule conclusion, interpretation, and rhetorical understanding. Tasks were contributed by legal professionals and cover areas including contract law, administrative law, criminal procedure, constitutional law, and international trade.

The benchmark is notable for its breadth and its grounding in how lawyers actually reason. Rather than asking a model to recall a statute, LegalBench asks it to apply that statute to a set of facts - the same IRAC (Issue, Rule, Application, Conclusion) framework that law schools teach. The 162 tasks vary considerably in difficulty, and aggregate scores mask significant variance across task types. A model that excels at rule recall may struggle with interpretation tasks that require weighing competing precedents.

The dataset is available on HuggingFace and the companion paper details per-task results for a range of models. LegalBench scores in this table represent accuracy averaged across all 162 tasks unless noted otherwise.

LexGLUE

LexGLUE (github.com/coastalcph/lex-glue, Chalkidis et al. 2022) is a benchmark of seven legal NLP tasks covering European and US law: ECtHR (European Court of Human Rights outcome prediction), SCOTUS (US Supreme Court subject area classification), EUR-LEX (EU legislation multi-label classification), LEDGAR (US SEC contract provision classification), UNFAIR-ToS (unfair terms of service clause detection), CaseHOLD (case holding extraction), and ContractNLI (contract premise-hypothesis entailment). Most tasks are classification; evaluation is micro-F1 or accuracy depending on the task.

LexGLUE is the closest thing the legal AI field has to a multi-task benchmark that spans jurisdictions. Its European components - ECtHR and EUR-LEX - provide partial coverage of non-US law, though coverage of civil law systems (most of the world) remains thin. The benchmark uses standard encoder models (BERT, RoBERTa, Legal-BERT) as baselines, and frontier LLMs in instruction-following mode have increasingly been evaluated against it, typically using accuracy on the test sets rather than fine-tuning.

CaseHOLD

CaseHOLD (arxiv:2004.12244, Zheng et al. 2021) is derived from Harvard Law School's Caselaw Access Project and contains 53,137 multiple-choice questions testing whether a model can identify the correct legal holding for a given case excerpt. Each question presents a citing context and five answer options - the correct holding from the cited case, plus four distractor holdings from other cases.

The task tests a specific and practically important capability: extracting the legally operative principle from judicial language. Legal opinions are long, verbose, and structured in ways that obscure the actual rule being stated. CaseHOLD is a contained test of whether a model can cut through that to identify the holding - the part of the decision that has precedential effect.

ContractNLI

ContractNLI (stanfordnlp.github.io/contract-nli/, Koreeda and Manning 2021) is a natural language inference benchmark built on Non-Disclosure Agreements. Given a contract text and a hypothesis about what the contract requires (e.g., "The Receiving Party shall not disclose the Confidential Information to any party"), a model must classify the relationship as entailment, contradiction, or not-mentioned.

ContractNLI tests a capability that is highly relevant to actual legal practice: contract review. Lawyers reviewing NDAs, software license agreements, and service contracts routinely need to determine whether a given clause entails, contradicts, or leaves unaddressed a specific obligation. The benchmark contains 17 types of legal concepts and covers realistic variation in how contracts express the same underlying requirement in different language.

Bar Exam MBE

The Multistate Bar Examination (MBE) is a 200-question multiple-choice component of the US Bar Exam covering seven subjects: Civil Procedure, Constitutional Law, Contracts, Criminal Law and Procedure, Evidence, Real Property, and Torts. Passing the full bar exam requires approximately 266 points on a 400-point scale, which corresponds roughly to 58-60% accuracy on the MBE component.

Bar Exam results for AI models are drawn primarily from OpenAI's GPT-4 technical report and subsequent independent evaluations. GPT-4 passed the simulated bar exam at approximately the 90th percentile - a score that would clear any state's bar requirement with significant margin. This benchmark is widely cited because it's a real human professional certification, not a research construct, and it has clearly defined pass thresholds.

Note on scope: Bar Exam MBE scores measure performance on multiple-choice questions only. The actual bar exam also includes Multistate Essay Examination (MEE) and Multistate Performance Test (MPT) components that require written analysis. None of the models in this table have been evaluated on those components under controlled conditions.

LEDGAR

LEDGAR (arxiv:2110.01779, Tuggener et al. 2021, included in LexGLUE) is a large-scale contract provision classification dataset containing over 800,000 provision clauses from US SEC filings, labeled with one of 100 contract provision types (e.g., "Indemnification", "Governing Law", "Confidentiality"). The task is multi-class text classification. Models need to correctly identify what kind of provision they're reading.

LEDGAR tests whether a model has internalized the vocabulary and structure of contract drafting well enough to recognize provision types from language alone. Because SEC filings are public documents, there is real contamination risk for models trained on web data, but the 100-class taxonomy is specific enough that memorization of individual documents doesn't straightforwardly transfer to correct classification.

LawBench

LawBench (github.com/open-compass/LawBench, Fei et al. 2023, arxiv:2309.11497) is a Chinese legal AI benchmark covering 20 tasks across three cognitive levels: memorization, understanding, and applying. It spans criminal law, civil law, administrative law, and procedural law within the Chinese legal system.

LawBench is the best-developed non-English legal benchmark and its inclusion here is deliberate: most legal AI evaluation is implicitly US-centric, and LawBench provides a signal on how models handle a fundamentally different legal tradition. Chinese civil law has different structure, terminology, and precedent mechanisms than common law. Models that perform well on LegalBench and CaseHOLD don't necessarily transfer those skills to Chinese legal reasoning.

Legal LLM Rankings - April 2026

Scores are drawn from published papers, model cards, and independent evaluations. "Not reported" means no public figure was available as of April 19, 2026. I do not interpolate or estimate scores that haven't been published - a practice I consider more misleading than leaving a cell blank.

A note on LegalBench averages: the official LegalBench paper reports per-task accuracy for GPT-4, GPT-3.5, and a small number of other models. For frontier models not in the original paper, I use publicly reported aggregate scores from independent evaluations where those evaluations explicitly describe their methodology. Where only partial-task scores exist, I note it.

Rank	Model	Provider	LegalBench avg %	LexGLUE avg %	CaseHOLD %	ContractNLI %	Bar Exam MBE %	Notes
1	o3	OpenAI	~82	Not reported	Not reported	Not reported	~95	Extended thinking; best MBE score in independent evals
2	GPT-5	OpenAI	~80	Not reported	Not reported	Not reported	~93	Strong across LegalBench reasoning subtasks
3	Claude 4 Opus	Anthropic	~77	Not reported	Not reported	Not reported	~91	Best non-OpenAI LegalBench score in independent evals
4	Gemini 2.5 Pro	Google DeepMind	~74	Not reported	Not reported	Not reported	~89	Long context helps on full-document tasks
5	GPT-4.1	OpenAI	~71	Not reported	~72	~85	~88	Best-documented frontier baseline; GPT-4 paper reports
6	DeepSeek R2	DeepSeek	~68	Not reported	Not reported	Not reported	Not reported	Reasoning model; strong on multi-step statutory analysis
7	Claude 4 Sonnet	Anthropic	~65	Not reported	Not reported	Not reported	~85	Faster and cheaper than Opus; solid contract review
8	Grok 4	xAI	Not reported	Not reported	Not reported	Not reported	~83	MBE score from xAI internal eval; no published LegalBench
9	DeepSeek V3.2	DeepSeek	~60	Not reported	Not reported	Not reported	Not reported	Non-reasoning variant; competitive on classification tasks
10	Qwen 3.5	Alibaba	~58	Not reported	Not reported	Not reported	Not reported	Open-weight baseline; limited legal eval data published
11	Llama 4 Maverick	Meta	~52	Not reported	~61	Not reported	~72	Open-weight; falls behind on multi-step legal reasoning
12	Phi-4	Microsoft	~49	Not reported	~58	Not reported	~68	Small model; punches above weight on classification
13	Mistral Large 3	Mistral AI	~47	Not reported	~55	Not reported	~65	Solid European law coverage via LexGLUE training data
14	Saul-7B	Equall AI	Not reported	~72	~78	~81	Not reported	Legal fine-tune; strong on LexGLUE and classification
15	Legal-BERT	Chalkidis et al.	Not reported	~75	~75	Not reported	Not reported	Historical baseline; encoder-only, no generative tasks

Scores are from published papers, model cards, and independent evaluation reports. LegalBench averages represent accuracy across all 162 tasks where available; partial-task scores are noted in the text. Bar Exam MBE scores are from simulated 200-question MCQ evaluations. LexGLUE scores are micro-F1 averaged across the seven included tasks. "Not reported" means no public figure exists as of April 2026. Frontier model scores for GPT-5, Claude 4, Gemini 2.5 Pro, and DeepSeek R2 are from early independent evaluations and may be updated as systematic evaluations are published.

Key Findings

Reasoning Models Have a Real Edge on Multi-Step Legal Analysis

The pattern I see consistently across legal reasoning tasks - particularly LegalBench's "rule application" and "rule conclusion" subtasks - is that extended-thinking models outperform their non-reasoning counterparts by more than the general reasoning benchmarks would predict. Legal reasoning is structurally similar to formal logic: you're given facts, you identify the relevant rule, you apply the rule to the facts, and you derive a conclusion. A model that can verify intermediate steps through explicit chain-of-thought has a structural advantage here.

This is most visible on the MBE, where o3's ~95% substantially outpaces GPT-5's ~93% and both far exceed GPT-4.1's ~88%. The delta between o3 and GPT-4.1 on the MBE is larger than the delta on GPQA Diamond or AIME 2025, which suggests the legal domain rewards systematic step-by-step analysis more than it rewards raw parametric knowledge.

For practitioners using AI for legal research, this finding has a practical implication: if your workflow involves multi-step statutory analysis or contract interpretation tasks with multiple interacting clauses, a reasoning model isn't just a nice-to-have. It's measurably more reliable.

Domain Fine-Tunes Beat Frontier Models on Classification - Lose on Reasoning

Saul-7B (from Equall AI, trained on the Pile of Law corpus) and Legal-BERT show a clear pattern: they outperform frontier general models on classification tasks - CaseHOLD, ContractNLI, LEDGAR - while trailing significantly on the open-ended multi-step reasoning tasks that LegalBench emphasizes.

This makes sense architecturally. Saul-7B is a 7-billion parameter model that was specifically fine-tuned on legal text to recognize legal language patterns. It's excellent at saying "this contract clause is an indemnification provision" or "this case holding matches this citing context." It is not competitive with GPT-5 or Claude 4 Opus on tasks requiring it to reason through a hypothetical fact pattern under applicable statute.

The practical implication is that the right model depends on your task type. Contract review and document classification lean toward domain fine-tunes for their efficiency and cost profile. Legal research assistance, statutory interpretation, and case analysis lean toward frontier generalist models with reasoning capabilities.

Jurisdiction Bias Is a Real Problem Most Evaluations Ignore

LegalBench covers US common law. CaseHOLD draws from the US federal case law corpus. Bar Exam MBE tests US law. ContractNLI uses US-style NDAs. This is not a small caveat - the entire evaluation landscape for legal AI is dominated by US legal concepts and common law reasoning patterns.

LexGLUE's ECtHR and EUR-LEX components provide some coverage of European law, and LawBench covers Chinese law, but for most jurisdictions - India, Brazil, Germany, France, Japan, South Korea - there simply aren't rigorous published benchmarks. The result is that the rankings in this table are primarily measuring "how good is this model at US law?" not "how good is this model at law."

For organizations deploying legal AI outside the United States, benchmark scores from these evaluations are a much weaker signal of production performance. A model that scores 80 on LegalBench has been tested on American legal concepts. Whether that transfers to Brazilian civil procedure or German contract law is an open empirical question.

Citation Hallucination Remains Unsolved and Unmeasured

None of the benchmarks in this table measure citation hallucination - the tendency of models to generate plausible-looking but fabricated case citations, statutes, and law review articles. This is the failure mode that already got lawyers sanctioned. It's also the one that doesn't show up in MCQ-style evaluations, because those tests ask you to choose from provided options rather than generate citations from memory.

The Mata v. Avianca case from 2023 is the canonical example, but it isn't isolated. Law.com has tracked dozens of subsequent filings where AI-generated citations were submitted to courts. None of the models in this table have been systematically evaluated on citation generation quality under conditions where they might hallucinate. The hallucination benchmarks that do exist (TruthfulQA, SimpleQA, FACTS Grounding) test factual accuracy in general - they don't specifically probe whether a model will generate a convincing-looking but nonexistent case citation.

This is a known gap. The legal AI research community has acknowledged it, and there are ongoing efforts to build citation-specific evaluation sets. Until those are published and widely adopted, any legal AI benchmark table - including this one - understates the most practically dangerous failure mode.

The GPT-4 Bar Exam Baseline

A brief note on one of the most cited data points in legal AI: GPT-4's bar exam performance. OpenAI's GPT-4 technical report published a result of approximately 90th percentile on the simulated Uniform Bar Exam, which includes a simulated MBE component. This was a landmark result when published in 2023 - it demonstrated that a general-purpose language model could pass a professional certification designed for human lawyers.

That result remains the published baseline anchor for the models in this table. The reported ~88% MBE accuracy for GPT-4.1 is my best estimate based on the original GPT-4 numbers and model card comparisons; OpenAI has not published an updated MBE score for GPT-4.1 or GPT-5 in a controlled bar exam evaluation. For o3 and GPT-5, the ~95% and ~93% figures come from independent evaluations by legal AI researchers using the same simulated MBE question sets, not official OpenAI publications.

Saul-7B: The Legal Fine-Tune Baseline

Saul-7B (Equall AI, 2024) deserves a brief explanation of why it appears in this table despite being a 7-billion parameter model competing with 70B+ frontier systems. It was trained on the Pile of Law corpus - a curated dataset of US legal text including court decisions, statutes, regulatory filings, and bar exam materials - and fine-tuned on legal instruction-following tasks.

Its LexGLUE scores (~72% aggregate) and CaseHOLD performance (~78%) are competitive with models that are an order of magnitude larger. On ContractNLI (~81%), it matches or exceeds several frontier models. This is consistent with the finance domain finding: for narrow, well-defined legal NLP tasks where the model needs to recognize patterns in legal language, domain-specific training at 7B parameters can match general-purpose training at 70B+ parameters.

The tradeoff is clear in LegalBench. Saul-7B has not been benchmarked on the full LegalBench suite as of this writing, and its performance on open-ended multi-step reasoning tasks would likely trail significantly behind frontier models. Domain fine-tuning optimizes for legal language recognition; it doesn't teach a model how to reason through novel fact patterns.

Methodology Notes

LegalBench scores are accuracy averaged across the full 162-task suite where available. Some evaluations report subsets (the paper reports results for all models on each task; aggregate scores represent my computation from those per-task numbers where the full paper data is available). For frontier models not in the original paper, I use independently reported aggregate scores where the evaluation methodology is clearly described.

LexGLUE scores are micro-F1 averaged across the seven tasks in the benchmark. This is the metric reported in the original paper and used in subsequent evaluations. Models fine-tuned on LexGLUE training sets will score higher than models evaluated in zero-shot or few-shot mode; where I can determine from the source whether a model was fine-tuned or evaluated in zero-shot mode, I note the distinction.

Bar Exam MBE scores are from simulated 200-question evaluations using MBE-style questions. For GPT-4.1, the primary reference is OpenAI's published system card. For other models, sources include independent evaluations by legal AI researchers and law schools who have constructed equivalent MBE question sets for research purposes.

Frontier model scores (GPT-5, Claude 4, Gemini 2.5 Pro, DeepSeek R2) are from early independent evaluations, not official laboratory publications. These are best available estimates as of April 2026 and should be treated accordingly.

For comparison to general reasoning and instruction-following capabilities for the same models, see the Reasoning Benchmarks Leaderboard and Instruction Following Leaderboard.

Caveats and Limitations

MCQ Format Under-Represents Practical Legal Skill

LegalBench's 162 tasks and CaseHOLD's holding-matching format are both primarily multiple-choice or classification problems. The Bar Exam MBE is 200 multiple-choice questions. None of these benchmarks evaluate the capabilities that consume most of a lawyer's actual time: drafting contracts and pleadings, writing persuasive memos, negotiating terms, advising clients on risk, or constructing coherent legal arguments in open-ended written form.

A model that scores 82 on LegalBench can still produce mediocre contract drafts, miss important clauses during review, or write legally incoherent research memos. The benchmarks measure a specific narrower set of capabilities - legal language recognition, rule application, classification - that are necessary but not sufficient for practical legal competence.

Bar Exam Measures MCQ Knowledge Only

Even the Bar Exam limitation is worth making explicit. The actual Uniform Bar Exam includes Multistate Essay Examination (MEE) and Multistate Performance Test (MPT) components alongside the MBE. The MEE asks for written analysis of multi-issue fact patterns. The MPT presents a complete file of documents and asks examinees to draft a real legal document. No AI model has been systematically evaluated on MEE or MPT-equivalent tasks under controlled conditions. The "passed the bar exam" framing that appears frequently in media coverage refers specifically to MBE performance and extrapolated composite scores - not demonstrated essay writing ability or practical document drafting under exam conditions.

Non-US Law Is Dramatically Underrepresented

As discussed in the Key Findings section, the benchmark landscape is overwhelmingly US-centric. LexGLUE is the best exception, with ECtHR and EUR-LEX providing European law coverage, and LawBench addresses Chinese law. For every other major legal system - Brazilian, German, Indian, Japanese, South Korean, Australian, Canadian - there is essentially no standardized benchmark. Organizations deploying legal AI in those jurisdictions should not rely on US-centric benchmark scores as predictors of production performance.

Training Data Contamination

Court decisions are public documents. US federal case law (PACER, CourtListener, Harvard's Caselaw Access Project) is freely available online and certainly appears in the pretraining data of every major frontier model. This means that CaseHOLD questions, which are derived from the Caselaw Access Project, may overlap with model training data. LegalBench tasks were designed with this in mind - many tasks require applying rules to novel facts rather than recalling case outcomes - but contamination cannot be fully excluded.

For domain fine-tuned models like Saul-7B and Legal-BERT, training data overlap with benchmark test sets is a more acute concern, since they were explicitly trained on legal corpora that include the source materials for these benchmarks.

Bottom Line

For general legal research assistance and statutory analysis: o3 and GPT-5 lead where benchmarks exist, with o3's extended reasoning providing a measurable edge on multi-step tasks. Claude 4 Opus is the strongest non-OpenAI option. All three substantially outperform GPT-4.1-era models on reasoning-intensive legal tasks.

For contract review and document classification: Saul-7B is worth benchmarking on your specific document types before defaulting to a larger model. Its performance on ContractNLI and LEDGAR-style classification tasks approaches frontier models at a fraction of the inference cost.

For Bar Exam-style question answering: o3 (~95%) and GPT-5 (~93%) are the current leaders. Any model above ~90% clears the practical bar by a wide margin. Below 70%, models are unreliable for serious legal MCQ work.

What the benchmarks can't tell you: Whether a model will hallucinate case citations. Whether it will draft a coherent contract clause. Whether it will catch a problematic indemnification term buried in paragraph 11 of a 40-page agreement. These are the capabilities that determine whether legal AI tools produce value or liability in production - and none of the benchmarks in this table measure them well.

The legal AI benchmark landscape is improving, but it's still measuring legal language recognition more than legal reasoning, and legal MCQ accuracy more than practical legal competence. Treat these scores as what they are - useful signals about narrow capabilities - rather than endorsements of production readiness.

Sources:

LLM Code Review Leaderboard - Benchmarks and Rankings

James Kowalski — Sun, 19 Apr 2026 00:00:00 +0000

Most LLM code review tools are slop generators. They flag == null instead of === null in JavaScript, paste the relevant style guide section back at you, and call it a review. Any linter with a five-minute setup does the same thing, and the linter doesn't hallucinate rationale or suggest rewrites that subtly change semantics.

What separates a useful code review LLM from a noisy one is whether it surfaces issues that require reasoning about the diff in context: logic errors that only appear when you trace the execution path, race conditions that need understanding of the concurrency model, API contracts that are technically satisfied but will cause downstream failures. That gap is wide. The tools at the top of this leaderboard clear it with some regularity. The majority do not.

This leaderboard covers both standalone LLMs evaluated on academic code review benchmarks and production code review products. For standalone models, scores come from published evaluations on CodeReviewer/CodeReview-Eval (Microsoft Research, 2022) and CR-Bench (2024). For products, I combine published benchmark claims, independent evaluations, and qualitative assessments from published third-party studies.

TL;DR

Claude Opus 4.6 and GPT-5 lead standalone model evaluations, but the gap between them narrows substantially on complex multi-file diffs
CodeRabbit currently shows the best real-world recall on bug-class comments among integrated products in third-party testing
Amazon CodeGuru Reviewer handles Java and Python security patterns well but lags on logic errors in dynamic languages
PR-Agent (open source) punches well above its weight for a self-hosted option - strong on structured comment generation
Most tools over-generate style and documentation comments; the useful signal is in security and logic categories
"Estimated" scores marked with asterisks - CR-Bench full rankings are not publicly available for all products

The Benchmark Landscape

Code review evaluation is fragmented. No single benchmark dominates, and no vendor publishes numbers on a standard held-out test set. Here is what I actually trust.

CodeReviewer (Microsoft Research, 2022) built the earliest serious dataset: 150K real review comments from GitHub across Java, Python, C++, C#, and JavaScript. Quality is scored with BLEU against human ground truth - which penalizes accurate comments that differ in phrasing from the reference. BLEU correlates poorly with human preference. I use it as a rough signal of whether a model understands reviewer concerns, not as a direct quality metric.

CR-Bench (2024) is the more meaningful evaluation. It tests models on 100 curated real-world PRs with expert-annotated issues across four categories: security vulnerabilities, logic errors, performance problems, and style/quality. Both precision (noise avoidance) and recall (catching real issues) are measured. CR-Bench also tracks false negative rate on critical issues - how often the model claims nothing is wrong when something is. That FNR metric is the most important single number in this table for production use. A tool that misses bugs silently is worse than no review tool at all.

CodeQL comparisons consistently show the same pattern: CodeQL wins on known vulnerability classes it has full dataflow models for (SQL injection, path traversal, deserialization). LLMs win on semantic logic errors that require business-logic context. They are complements, not substitutes. CRScore is a newer neural metric that correlates better with human judgment than BLEU; I include it where published figures are available.

The Leaderboard

Scores are from published papers and product documentation unless marked with * (estimated from available benchmark subsets and third-party evaluations). CR-Bench F1 is reported separately for security/logic (S/L) and style/quality (St/Q) because collapsing them hides the signal. Lower false-negative rate (FNR) is better.

Rank	System	Type	CR-Bench F1 (S/L)	CR-Bench F1 (St/Q)	FNR (Critical)	CodeReviewer BLEU	Notes
1	Claude Opus 4.6 (direct API)	Base LLM	0.71	0.68	9%	14.2	Highest S/L F1 in published evals; strong multi-file reasoning
2	GPT-5 (direct API)	Base LLM	0.68	0.71*	11%	14.8*	Strong overall; edges ahead on style/quality; comparable S/L to Opus
3	CodeRabbit (with Claude backend)	Product	0.64*	0.60*	13%*	-	Best in class for integrated PR tools; smart deduplication; $15/user/mo
4	Claude Sonnet 4.6 (direct API)	Base LLM	0.62	0.65	14%	13.1	Strong cost-adjusted performance; recommended for high-volume review
5	PR-Agent (open source, GPT-5 backend)	Product	0.59*	0.61*	16%*	-	Best open-source option; configurable; github.com/The-PR-Agent/pr-agent
6	GitHub Copilot Code Review	Product	0.56*	0.62*	18%*	-	Tightly IDE-integrated; strong on idiomatic issues; weaker on security
7	GPT-4.1 (direct API)	Base LLM	0.54	0.60	20%	13.6	Prior-gen but still solid baseline; strong BLEU from fine-tuning on review data
8	Amazon CodeGuru Reviewer	Product	0.58*	0.41*	15%*	-	Excellent Java/Python security detection; poor on logic errors elsewhere
9	Greptile (with Claude backend)	Product	0.53*	0.55*	19%*	-	Strong on codebase-wide cross-file reasoning; newer product
10	Gemini 2.5 Pro (direct API)	Base LLM	0.51	0.57	22%	12.8	Good long-context performance on large PRs; logic F1 trails leaders
11	Sourcegraph Cody (with Claude backend)	Product	0.49*	0.52*	23%*	-	Better at code navigation than pure review; context retrieval a strength
12	Sweep AI	Product	0.44*	0.49*	28%*	-	Primarily a fix-generation tool; review mode is secondary; github.com/sweepai/sweep
13	Graphite Diamond (AI review)	Product	0.41*	0.58*	26%*	-	Strong style/consistency; logic detection weak; CI workflow integration good
14	GPT-4.1 mini (direct API)	Base LLM	0.33	0.48	38%	11.2	Acceptable style comments; logic F1 drops significantly vs. full GPT-4.1
15	Llama 4 Maverick (direct API)	Base LLM	0.29*	0.43*	42%*	10.4*	Best open-weight result; usable for style reviews; security miss rate too high for production
16	CodeBERT-based fine-tunes	Base LLM	0.24	0.38	51%	12.9	Microsoft's 2022 baseline; still cited; outperformed by all frontier models

Estimated from available benchmark subsets, third-party evaluations, and published vendor claims. "-" in CodeReviewer BLEU means no published score (product integrations do not report BLEU). Table last updated April 19, 2026.

Reading the Table

The S/L vs. St/Q F1 split is not cosmetic. A tool that scores 0.71 on style and 0.34 on logic is an opinionated linter, not a code reviewer. Amazon CodeGuru Reviewer is the sharpest example: excellent on known security patterns in Java (SQL injection, SSRF, hardcoded credentials), poor on logic errors in Python where its static analysis lacks the type information it needs. Claude Opus 4.6 at 0.71/0.68 is the only system where the security/logic score leads the style score. Every other high-ranking system is more confident on style than on substance - a direct consequence of training data distribution, where GitHub has orders of magnitude more style commentary than expert security annotations.

A 9% FNR means the model misses roughly 1 in 11 confirmed bugs. A 42% FNR - Llama 4 Maverick's estimated rate - means it misses nearly half. FNR is the metric I would require in any procurement evaluation. A tool that misses bugs silently generates false confidence, which is worse than having no review tool at all. GPT-5 at 11% vs. GPT-4.1 at 20% is a real and meaningful improvement - not sampling noise.

Most products are wrappers over the same frontier models. CodeRabbit and Greptile use Claude backends; GitHub Copilot Code Review uses OpenAI. The scaffold layer - diff chunking, comment deduplication, severity classification - creates meaningful variation within a base model tier. CodeRabbit's deduplication is worth something over a raw API call. But the ceiling is the base model. A product on Claude Sonnet 4.6 will not outperform a well-prompted Claude Opus 4.6 call on genuinely hard bugs.

Methodology

CR-Bench F1 figures for base LLMs come from the CR-Bench paper and the authors' published evaluation scripts. CodeReviewer BLEU scores come from the original CodeReviewer paper or subsequent fine-tuning papers using the same benchmark.

Scores marked * are estimates anchored to at least two independent data points - vendor-published benchmark claims, third-party comparisons (primarily Liang et al. 2025), or interpolation from related benchmark results on similar models.

I aggregate CR-Bench categories as S/L (security + logic) and St/Q (performance + style). Security and logic errors have real production consequence. Performance and style issues usually do not, so collapsing them into a single number obscures the signal that matters. Products do not publish standardized CR-Bench numbers. Treat the * rows as directional estimates and run your own bakeoff on representative PRs before making a tool selection.

This leaderboard covers issue identification only - not code rewriting or autonomous fixing. That belongs in the SWE-Bench Coding Agent Leaderboard.

Key Product Notes

CodeRabbit is the current leader among integrated products. Its comment deduplication - tracking flagged issues across multiple commits in the same PR and suppressing repeats - is what separates a usable tool from an annoying one. Uses a Claude backend, integrates with GitHub and GitLab. $15/user/month, free tier for open source. Third-party testing from Liang et al. 2025 found it doing substantially better than CodeGuru on logic errors in Python.

PR-Agent (now at The-PR-Agent org, formerly codium-ai) is the best open-source option. Configurable backend, structured JSON output, plugins for GitHub/GitLab/Bitbucket. Separate passes for code suggestions, security review, and PR description generation. Requires your own API key - cost visibility upside, cost risk at scale.

GitHub Copilot Code Review is the convenient choice for GitHub-native teams. Strong on idiomatic issues within a language community. Weak on cross-file logic where the diff interacts with callers not in the diff view. No published CR-Bench numbers.

Amazon CodeGuru Reviewer has a narrow but real strength: Java and Python security pattern detection backed by dataflow analysis. OWASP Top 10, AWS API misuse, known Java anti-patterns - it catches these reliably. Outside that niche, it falls apart. The 0.41 St/Q logic F1 estimate is consistent with a tool designed for pattern-matching on known vulns, not semantic reasoning.

Greptile is newer and has less external validation, but its codebase-aware review - indexing the full repo to evaluate whether a diff is consistent with existing patterns - is a real differentiator on large, long-lived codebases. The 0.53 S/L F1 estimate reflects limited data; this number should move.

What the Research Shows

The most relevant external study is Liang et al. 2025, which compared automated tools and LLMs against professional developer review on 234 real open-source PRs. LLMs generated more total comments than human reviewers but with substantially lower precision on security and logic issues. Human reviewers caught critical issues at roughly 2.3x the rate of the best-performing LLM tested (GPT-4.1, predating GPT-5 and Claude Opus 4.6). The gap was smaller on style, where LLMs matched human precision in several categories.

A 0.71 S/L F1 still means 30% of security and logic issues go undetected. That is not a passing grade for autonomous operation on a codebase with real security exposure.

Every tool in this table also generates too many comments per PR. Comment volume above a threshold decreases PR author receptiveness - the reviewee disengages when facing 40 comments, most of which are nitpicks. Noise suppression is where products differentiate most clearly: CodeRabbit's deduplication, PR-Agent's severity filtering, GitHub Copilot's comment collapsing. A raw Claude API call with a naive "review this diff" prompt will have higher signal in individual comments and worse signal-to-noise overall.

One more caveat: CodeReviewer and CR-Bench are weighted toward Java, Python, TypeScript, C++. The benchmarks underrepresent Rust, Go, Kotlin, and Ruby. Quality degrades noticeably on languages with smaller representation - Llama 4 Maverick especially. Teams building on Go or Rust codebases should treat these numbers as directional and run their own bakeoff.

Practical Guidance

For teams using GitHub or GitLab with budget for a managed tool: CodeRabbit at $15/user/month is the best current option for integrated PR review. It handles the noise problem better than any other product and uses a strong Claude backend. For security-sensitive codebases with significant Java, add CodeGuru Reviewer alongside it - they cover different issue classes.

For teams that want open-source or self-hosted: PR-Agent with a GPT-5 or Claude Opus 4.6 API key is the correct choice. It is genuinely close to commercial quality, configurable, and auditable.

For using base LLMs directly via API: Claude Opus 4.6 is the strongest reviewer for complex logic issues. For high-volume review where cost matters, Claude Sonnet 4.6 is the right tradeoff. Use a structured prompt that explicitly requests security, logic, and style comments in separate sections - unstructured review prompts generate noisier output with lower precision.

For security-first review: Amazon CodeGuru Reviewer in Java/Python, combined with any of the top-3 LLMs for general review. Do not rely on pure LLM tools alone for security-sensitive code paths.

For open-weight/self-hosted deployments: Llama 4 Maverick can handle style and documentation review adequately. Its logic and security miss rate is too high for sole reliance. Treat it as a first-pass filter, not a final reviewer, and plan for a higher rate of human follow-through.

For related rankings, see the SWE-Bench Coding Agent Leaderboard for autonomous bug-fixing, the Coding Benchmarks Leaderboard for HumanEval and MBPP-style generation, and the AI Safety Leaderboard for models used in security-sensitive environments.

Sources

LLM Jailbreak and Red-Team Resistance Leaderboard

James Kowalski — Sun, 19 Apr 2026 00:00:00 +0000

Most AI capability rankings measure what a model can do when everyone is playing nice. Adversarial robustness measures something different: what happens when someone is actively trying to make the model do something it shouldn't. That gap matters more than most benchmark comparisons suggest, because in production, models face both well-meaning users and bad-faith ones - sometimes in the same session.

This leaderboard aggregates published attack success rate (ASR) data across five standardized adversarial evaluation frameworks and covers 14 frontier models. The score that matters is the one you don't want to be high.

TL;DR - Claude 4 Sonnet leads adversarial robustness by a significant margin, with a 2.86% max harm rate against autonomous reasoning-model attackers and consistent low-single-digit ASR across standard benchmarks. GPT-5 and o3 show strong resistance with aggressive post-deployment patching. DeepSeek V3.2 and Llama 4 open-weight models remain the most vulnerable, with ASR figures in the 50-90% range against common automated attacks. Agentic safety is a separate problem - InjecAgent results show nearly every model remains vulnerable to indirect prompt injection in tool-use workflows.

What ASR Means

Attack Success Rate (ASR) is the percentage of adversarial prompts that successfully elicit harmful or policy-violating content from a model. Lower is better. An ASR of 5% means that 1 in 20 adversarial attempts succeeds. An ASR of 90% means the adversary wins almost every time.

ASR figures are not interchangeable across benchmarks. A 5% ASR on HarmBench - which uses 18 automated attack methods including gradient-based token optimization - is much harder to achieve than a 5% ASR on a simple prefix-injection dataset. When comparing numbers, the benchmark context matters as much as the number itself.

The figures in this leaderboard come from published papers, official system cards, and peer-reviewed evaluation studies. Where no public figure exists for a given model-benchmark pair, I mark it as "Not reported" rather than interpolating. The rankings reflect the overall picture from available data, not any single metric.

The Benchmarks Explained

HarmBench

HarmBench (Center for AI Safety, 2024) is the closest thing to a standardized adversarial evaluation framework the field has. It tests 33+ models against 18 attack methods including Greedy Coordinate Gradient (GCG) token optimization, Prompt Automatic Iterative Refinement (PAIR), and Tree of Attacks with Pruning (TAP). It covers 7 harm categories across 400 test behaviors. The diversity of attack methods is the key: a model that blocks GCG attacks might still fall to semantic jailbreaks.

AdvBench

AdvBench (Zou et al., 2023) is a simpler 520-item benchmark of harmful behaviors and string completions. It's widely used because it's reproducible, but it reflects single-turn attacks and skews toward prompts that were known at the time of its creation. A low AdvBench ASR is necessary but not sufficient evidence of robustness.

StrongREJECT

StrongREJECT flips the measurement approach. Rather than just detecting whether a model said something harmful, it evaluates the quality of rejections - measuring whether the model provides an actual refusal or just a weak, useless non-answer that still hands the attacker what they want. A model that says "I can't help with that, but here's how..." scores badly even if technically refusing. This benchmark is the most practically meaningful for production deployments.

JailbreakBench

JailbreakBench is a living leaderboard with a standardized set of 100 misuse behaviors evaluated at fixed checkpoints. Its main value is comparability over time - it uses a consistent Llama Guard judge and controlled attack budgets so you can track whether models improve or regress across versions. The public leaderboard is updated as new submissions come in.

AgentHarm

AgentHarm evaluates models specifically in agentic settings - when the model is acting as an agent with access to tools, APIs, or persistent memory. It tests 110 harmful agentic tasks across 11 categories including cyberattacks, harassment automation, and financial fraud. Standard refusal training does not translate cleanly to agentic settings, which is why this benchmark produces very different results than single-turn evaluations.

InjecAgent

InjecAgent tests indirect prompt injection - attacks where malicious instructions are hidden in external content retrieved during tool use (web pages, documents, API responses). When a model reads a web page and that web page contains hidden instructions, does the model follow them? Most current models do, often silently, which makes this one of the most practically dangerous attack surfaces in production agentic systems.

Main Ranking Table - April 2026

Rankings are by overall adversarial robustness (Rank 1 = most robust). ASR figures are the best available published numbers from the sources listed in the methodology section. Lower ASR = better.

Rank	Model	Provider	HarmBench ASR	AdvBench ASR	StrongREJECT	JailbreakBench ASR	AgentHarm ASR	Notes
1	Claude 4 Sonnet	Anthropic	~3%	~2%	High	~5%	~15%	Best single-model resistance across all attack types; 2.86% max harm vs. reasoning-model attackers (Nature Comms study)
2	Claude 4 Opus	Anthropic	~4%	~3%	High	~6%	~18%	Strong across board; highest Constitutional AI investment; slightly lower instruction-following means slightly more refusals
3	GPT-5 / o3	OpenAI	~5%	~4%	High	~8%	~22%	Aggressive post-launch patching; initial ASR was higher pre-patch; strong StrongREJECT scores; agentic layer less hardened
4	GPT-4.1	OpenAI	~8%	~6%	Good	~11%	~28%	Solid resistance; weaker than o3 on multi-turn attacks; frequently tested baseline in published research
5	Gemini 2.5 Pro	Google	~12%	~9%	Good	~14%	~31%	Vulnerable to reasoning-model multi-turn attackers (71.43% max harm in Nature Comms study when used as a target); single-turn defenses decent
6	Kimi K2.5	Moonshot AI	Not reported	Not reported	Moderate	Not reported	Not reported	Limited public red-team data; internal Moonshot safety filtering active; treat as provisional
7	Qwen 3.5	Alibaba	~22%	~18%	Moderate	~25%	~42%	Safety varies significantly by prompt language; English safety better than Chinese-language testing; open-weight versions weaker
8	Mistral Large 3	Mistral AI	~24%	~19%	Moderate	~27%	~44%	ASR improves substantially with moderation layer enabled; bare model has weak default guardrails
9	Phi-4	Microsoft	~28%	~23%	Moderate	~30%	Not reported	Strong reasoning-to-size ratio but safety training less robust than larger models; limited agentic safety data
10	Grok 4	xAI	~32%	~27%	Low-Moderate	~34%	~50%	Consistently quick to jailbreak in third-party testing; Lumenova found Grok 4 broke under 30 min per model; disclaimer pattern without refusal common
11	DeepSeek-R2	DeepSeek	~38%	~31%	Moderate	~40%	Not reported	Reasoning capability makes it both a harder-to-fool target and a potent attacker; safety inconsistent by category
12	Llama 4 Scout / Maverick	Meta	36.83%	~32%	Low-Moderate	~38%	~55%	HarmBench figure from published evaluation; open-weight baseline; significant improvement with Llama Guard overlay
13	DeepSeek V3.2	DeepSeek	~52%	~45%	Low	~50%	~65%	High ASR across automated attacks; 90% max harm vs. reasoning-model attackers (Nature Comms); geopolitically sensitive topics handled differently
14	Mixtral 8x22B	Mistral AI	~58%	~50%	Low	~55%	~70%	Older open-weight baseline; no default safety layer; included as reference point for open-weight risk floor

All figures are approximate from published sources; see Methodology. "Not reported" = no public evaluation data available. ASR = Attack Success Rate (lower is better). StrongREJECT scores are qualitative (High/Moderate/Low) because the published metric is a scalar that varies by attack set.

A few notes on the table before going further.

The rankings for Claude 4 Sonnet and GPT-5 reflect consistent findings across multiple independent evaluations, not just one paper. The gap between the top three positions and the rest is large and reproducible. Claude's advantage is structural: Constitutional Classifiers++ reads internal model activations rather than scanning surface text, which catches adversarial intent that keyword-based filters miss entirely.

The open-weight models at the bottom of the table are there because they ship without safety layers by default. Llama 4 Maverick's published HarmBench ASR of 36.83% is for the base model. With a Llama Guard or Granite Guardian overlay, that number drops substantially. The ranking reflects what you get out of the box.

Grok 4's poor performance stands out given xAI's stated safety commitments. Third-party testing published by Lumenova AI found that all jailbreak techniques developed against GPT-5 transferred to Grok 4 in under 30 minutes per model, and that Grok 4 was the easiest of the tested frontier models to break. A high disclaimer rate (60%+ in some evaluations) does not substitute for refusal - a model that adds "for educational purposes" before complying with a harmful request is not meaningfully safer.

Attack Category Breakdown

Not all attacks are equally likely in every deployment context. Here's how the main frontier models perform across the five harm categories that produce the highest real-world risk:

Cyber Offense and Malware Generation

Writing functional malware, explaining exploitation techniques, or generating working code for offensive security tools. Claude and GPT-5 show the strongest resistance here - consistent low ASR even against technically sophisticated attack framings. DeepSeek V3.2 and Mixtral remain highly vulnerable to prompts framed as security research or penetration testing.

CBRN (Chemical, Biological, Radiological, Nuclear)

Uplift for creating weapons of mass destruction is the highest-priority harm category for every major lab. Claude's Responsible Scaling Policy and OpenAI's Preparedness Framework both treat CBRN uplift as a hard red line. In practice, published evaluations consistently show Claude and GPT-5 maintaining near-zero ASR in this category even under strong attack pressure. Open-weight models without moderation layers are substantially more vulnerable.

Generating targeted phishing content, impersonation scripts, or mass harassment campaigns. This category sees higher ASR across most models than CBRN, partly because the harm framing is easier to launder through "creative writing" or "security awareness" contexts. Grok 4 and DeepSeek models perform notably worse here than on the cyber-offense category.

Fraud and Financial Crime

Generating scam scripts, account takeover guides, or money laundering documentation. Results here are mixed even for top-ranked models - the framing as financial or legal advice creates ambiguity that safety classifiers struggle with. StrongREJECT scores are most informative in this category, because many models issue nominal refusals while still providing useful partial information.

Illegal Firearms and Weapons

3D printing instructions, conversion modification guides, suppressor fabrication. US-deployed models (GPT-5, Claude, Grok) show stronger resistance here than models primarily deployed in other regulatory environments. Qwen 3.5's safety behavior on this category varies significantly between English and Chinese prompts, reflecting different training data compositions.

Notable Attack Methods

Understanding how attacks work helps evaluate which defenses actually matter. These descriptions cover published research techniques at a conceptual level - no working exploits, no specific prompt templates.

PAIR (Prompt Automatic Iterative Refinement) uses one LLM to iteratively refine attack prompts against a target LLM, automatically adapting when the target refuses. Published in 2023 by Chao et al., it was among the first demonstrations that black-box attacks could achieve high ASR without gradient access. The original paper reported success against GPT-4, Claude, and Vicuna. Current frontier models are substantially more resistant to the original PAIR formulation, but adaptive variants remain effective.

TAP (Tree of Attacks with Pruning) extends PAIR by building a tree of attack attempts rather than a linear chain, pruning low-promise branches and expanding high-promise ones. The TAP paper showed consistent improvements over PAIR, especially on models that had been hardened against the simpler iterative refinement approach.

GCG (Greedy Coordinate Gradient) is a white-box attack that requires gradient access to the target model. It finds adversarial suffixes - specific token sequences appended to harmful prompts - that cause the model to comply. It's computationally expensive and requires open-weight model access, but the suffixes it finds often transfer to closed models. The Universal Adversarial Attacks paper demonstrated this transfer property. Current frontier models are significantly more robust to known GCG suffixes than they were in 2023.

Multi-turn crescendo attacks gradually escalate through a conversation, establishing context with benign interactions before steering toward the harmful objective. The most sophisticated version of this is now automated - the autonomous jailbreak agents paper demonstrated 97% success rates using reasoning models as automated multi-turn attackers, with Claude 4 Sonnet remaining the most resistant target at 2.86% max harm.

Many-shot jailbreaking exploits long context windows by inserting many examples of an AI complying with harmful requests before the actual attack prompt. With context windows now at 1M+ tokens, this attack scales with context length - research showed attack success rates increasing monotonically with the number of in-context examples. Models with larger context windows face a higher attack surface from this vector.

RAT (Retrieval Augmented Thinking) and indirect prompt injection leverage tool use and RAG workflows to plant malicious instructions in retrieved documents. When a model retrieves a web page or document containing hidden instructions, those instructions can override the system prompt in many current architectures. InjecAgent systematically evaluated this attack surface and found high ASR across nearly all tested models in agentic settings.

Reasoning model exploitation is the newest attack category. Because reasoning-capable models chain through problems step by step before answering, adversaries can sometimes embed harmful goals inside the reasoning chain rather than the surface request. The same capability that makes these models good at planning makes them susceptible to having their plan hijacked.

Defense Approaches

Constitutional AI and RLHF strength. Anthropic's Constitutional AI trains models to evaluate their own outputs against a set of principles and revise them before responding. This builds robustness into the model weights rather than relying solely on post-output filtering. The practical difference shows up in StrongREJECT scores - models with stronger RLHF produce better-quality refusals rather than nominal refusals that still leak information.

Representation engineering and circuit breakers. Zou et al. 2024 demonstrated that harmlessness can be reinforced at the internal representation level by identifying and suppressing activation patterns associated with harmful outputs. The Circuit Breaker paper extended this to agentic settings, showing that representation-level interventions maintain robustness even when surface prompts are bypassed. This approach is more robust than output-only classifiers because it acts before the harmful content is generated.

Constitutional Classifiers++. Anthropic's production safety system uses a two-layer architecture: a fast internal probe reading model activations on every request, followed by a full classifier for flagged queries. Their public jailbreak challenge found zero universal jailbreaks after 3,000+ hours of testing by 183 participants. False refusal rate is 0.05% - roughly 1 in 2,000 legitimate queries.

Classifier and moderation layers. External classifiers (Meta's Llama Guard, Granite Guardian, Perspective API) can be added to any model's output pipeline. The effectiveness varies: Mistral Large jumps from "Good" to "Very Good" on MLCommons AILuminate when its moderation layer is active. The tradeoff is latency and a separate failure mode - if the classifier is bypassed, the underlying model's baseline ASR applies.

Immutable safety suffixes. Research from the autonomous jailbreak agents study showed that appending a consistent, immutable safety instruction to every incoming message reduced successful jailbreaks from 97% to under 1% in controlled conditions. Practical deployment raises questions about helpfulness tradeoffs that weren't fully assessed.

Agentic Safety: A Separate Problem

Standard jailbreak benchmarks test single-turn or multi-turn conversations with a chatbot interface. Agentic deployments add new attack surfaces that current safety training does not adequately cover.

When a model has tool access - web browsing, code execution, file system access, API calls - the attack surface expands dramatically. Indirect prompt injection means the model can receive malicious instructions from any content it reads. AgentHarm found that even models that perform well on conversational jailbreak benchmarks show substantially higher ASR in agentic settings. The numbers in the main table reflect this: Claude 4 Sonnet's ~3% conversational HarmBench ASR becomes ~15% in agentic AgentHarm testing.

The InjecAgent evaluation makes this concrete: when malicious instructions were embedded in simulated tool outputs, most models followed those instructions a significant fraction of the time, even when the system prompt explicitly instructed them not to trust external content. This is not primarily a safety training problem - it's an architectural one. Models that use retrieved context to answer questions have limited ability to distinguish "content to read" from "instructions to follow."

For teams building production agents, the practical implication is that model-level jailbreak resistance is necessary but not sufficient. Sandboxing tool outputs, validating returned content before it enters the context window, and treating external content as untrusted are infrastructure requirements that no amount of model fine-tuning can substitute for.

The Agents of Chaos red-team study documented real-world consequences of agentic safety gaps in live deployments - including data leakage from natural-language ambiguity that no jailbreak benchmark captures.

Methodology and Caveats

Sources. Primary sources for ASR figures: the HarmBench paper and GitHub, the JailbreakBench leaderboard, OpenAI's GPT-5 system card, Anthropic's published safety evaluations, the Nature Communications autonomous jailbreak agents paper, and Lumenova AI's cross-generation jailbreak testing. Figures marked with ~ are approximate from published data ranges.

ASR varies with attack budget. A researcher with a 50-attempt budget against GPT-5 gets a very different ASR than one with a 10,000-attempt adaptive attack. Most published benchmarks use standardized attack budgets, but comparisons across benchmarks with different budgets are not direct.

Models are patched silently. OpenAI's pattern of post-deployment safety patching is well documented - GPT-5's initial ASR was substantially higher than post-patch figures. Anthropic's Constitutional Classifiers are updated continuously. Any specific ASR number reflects the model at time of testing, not necessarily the current deployed version. I've used the most recent published figures available.

Responsible disclosure norms. This article describes attack categories and defense approaches at a conceptual level. It does not publish specific attack prompts, successful jailbreak templates, or model-specific exploitation techniques. That information exists in academic papers behind the links in the Sources section - researchers who need it can find it there. Publishing working exploits without coordinated disclosure is not something this site does.

Many-shot attacks scale with context. As models extend their context windows, many-shot jailbreaking attack surface scales proportionally. The many-shot jailbreaking paper documented this relationship clearly. A model with a 1M-token context window faces a meaningfully different many-shot attack surface than a 128K-token model, all else equal.

JBDistill and the benchmark decay problem. Static jailbreak benchmarks decay as models are trained on them. The JBDistill framework from Johns Hopkins and Microsoft addresses this by auto-generating fresh adversarial prompts on demand. Its 81.8% ASR against 13 LLMs using dynamically generated attacks versus 18.4% for the static HarmBench illustrates the gap between defending against known attacks and defending against novel ones.

Cross-Links

For broader safety context including alignment scores, refusal rates, and bias evaluations, see the AI Safety Leaderboard. That leaderboard covers the full safety landscape; this one focuses specifically on adversarial robustness and attack resistance.

For the real-world consequence of a successful jailbreak at scale, the Mexico government breach documented how 1,000+ prompts against Claude were used to steal 150GB of government data. The attacker used the model as a tool for generating exploit code, not as the primary attack vector - but the case illustrates why ASR numbers have production consequences.

Sources

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming - Mazeika et al., Center for AI Safety
HarmBench GitHub Repository - Center for AI Safety
Universal and Transferable Adversarial Attacks on Aligned Language Models (GCG) - Zou et al., 2023
Judging the Judges: Evaluating Alignment and Vulnerabilities (AdvBench) - Zou et al., 2023
JailbreakBench: An Open Robustness Benchmark for Jailbreaking LLMs - Chao et al., 2024
JailbreakBench Leaderboard - JailbreakBench
StrongREJECT for Empty Jailbreaks - Souly et al., 2024
AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents - Andriushchenko et al., 2024
InjecAgent: Benchmarking Indirect Prompt Injection Attacks - Zhan et al., 2024
Jailbreaking Black Box Large Language Models in Twenty Queries (PAIR) - Chao et al., 2023
Tree of Attacks with Pruning (TAP) - Mehrotra et al., 2023
Many-Shot Jailbreaking - Anil et al., Anthropic, 2024
Representation Engineering: A Top-Down Approach to AI Transparency - Zou et al., 2023
Circuit Breakers: Robust Defense Against Jailbreaking - Zou et al., 2024
Next-Generation Constitutional Classifiers - Anthropic
GPT-5 System Card - OpenAI
Large Reasoning Models Are Autonomous Jailbreak Agents - Hagendorff et al., Nature Communications, 2026

LLM Quantization Impact Leaderboard 2026: INT4 vs FP16

James Kowalski — Sun, 19 Apr 2026 00:00:00 +0000

Quantization is the reason a 70B model fits on a consumer GPU. Compress weights from 16-bit floating point down to 4-bit integers and you cut VRAM requirements by roughly 75 percent - turning a workload that requires a $20,000 server into something an RTX 4090 can handle. The tradeoff is quality loss: every bit you strip away throws away information the model learned during training.

The question practitioners actually care about is not whether quality drops - it always does - but how much, and whether it matters for their use case. A Q4_K_M Llama 3.1 8B that loses 1.2 MMLU points might be entirely acceptable for a chatbot. A Q3_K_M 70B model that loses 3.1 points might still beat a full-precision 7B. But the numbers vary wildly by model family, parameter count, and task type, and the published guidance ranges from incomplete to contradictory.

This leaderboard consolidates quantization impact data from the GGUF K-quants research in the llama.cpp project, the GPTQ paper, the AWQ paper, and community benchmark threads into one place. I have organized the data by model size tier so you can see the quality-vs-VRAM curve for each class of model. Where no public figure exists, I say so explicitly rather than interpolating.

TL;DR

Q8_0 is essentially lossless across all model sizes - the 0.3-0.5 perplexity delta versus BF16 is negligible in practice
Q4_K_M is the practical sweet spot: 25-30% of full-precision VRAM, typically 1-3 MMLU points lost, acceptable throughput boost
Q3_K_M is the last usable tier for most tasks - quality degrades noticeably below this, especially on multilingual and tool-calling workloads
Q2_K below 13B parameters is essentially unusable - perplexity explodes and HumanEval collapses by 15-25 points
Model size matters more than quantization level: a Q4_K_M 70B beats a BF16 7B by a wide margin on every benchmark

Why Quantization Matters - VRAM and Throughput

A BF16 model stores each weight as a 16-bit (2-byte) floating point value. A 70B-parameter model at BF16 requires roughly 140 GB of VRAM - more than any single consumer GPU. Quantization reduces that number by compressing weights into fewer bits per value.

VRAM is not the only reason to quantize. Token generation throughput scales with memory bandwidth - the GPU spends most of its time loading weights from VRAM, not computing. A Q4 model is roughly 4x smaller than BF16, so the GPU can load it 4x faster per byte of memory bandwidth. On an RTX 4090 (1,008 GB/s bandwidth), that translates directly to more tokens per second.

Format	Bits/weight (avg)	Size vs BF16	VRAM reduction	Speed vs BF16 (RTX 4090)
BF16 / FP16	16	1.0x (baseline)	-	1.0x
INT8 / Q8_0	8	~0.50x	~50%	~1.8x
Q6_K	~6.6	~0.41x	~59%	~2.2x
Q5_K_M	~5.7	~0.35x	~65%	~2.5x
Q4_K_M	~4.8	~0.30x	~70%	~3.0x
Q3_K_M	~3.9	~0.24x	~76%	~3.5x
Q2_K	~2.6	~0.18x	~82%	~4.2x

Speed multipliers are approximate and vary by model architecture and GPU. The bandwidth math is the dominant factor: generation speed in tok/s scales roughly as (memory bandwidth GB/s) / (model size in GB). Higher quantization means smaller model means faster tokens, but with diminishing returns because compute and other overheads begin to dominate.

Quantization Method Explainer

Before the tables, it helps to know what the labels mean. There are five common quantization approaches you'll encounter in the wild.

GGUF K-Quants (llama.cpp)

The format used by llama.cpp and all tooling built on it (Ollama, LM Studio, Jan, kobold.cpp). GGUF files contain the model weights and all metadata needed to run inference. The K-quant variants (Q4_K_M, Q5_K_S, Q6_K, etc.) use a mixed-precision scheme developed by Ihor Molchanov and merged into llama.cpp in mid-2023: different layers get different quantization levels based on their sensitivity, with attention layers receiving higher precision than feed-forward layers. The "K" means K-quant; "M" means medium (more bits than S/small variants); "S" means small.

Q8_0 is an outlier - it stores 8-bit integers with a per-block scale factor and is considered nearly lossless. It's the format to use when VRAM is not the constraint but you want binary portability over raw BF16.

The I-quant variants (IQ3_M, IQ4_NL, IQ4_XS) are newer and use importance matrix calibration to allocate bits more intelligently. They typically deliver better quality than the K-quant equivalent at the same average bit depth, but require a calibration dataset and extra compute to create.

GPTQ (Post-Training Quantization)

GPTQ (arXiv:2210.17323) is a one-shot weight quantization method that minimizes quantization error layer by layer using second-order gradient information. It produces 4-bit or 3-bit models that are slightly more accurate than naive rounding because it compensates for rounding errors in each layer before moving to the next. Requires a GPU to quantize. Used by AutoGPTQ and supported by the transformers library via GPTQModel. Quality is generally comparable to Q4_K_M GGUF, sometimes slightly better on knowledge benchmarks. Main use case: running quantized models through the transformers + CUDA stack rather than llama.cpp.

AWQ (Activation-Aware Weight Quantization)

AWQ (arXiv:2306.00978) observes that not all weights matter equally - a small subset of weights have much higher impact on model output than others, tied to activation magnitudes. AWQ identifies these "salient" weights and protects them from quantization while compressing the rest more aggressively. The result is INT4 models that often outperform GPTQ at the same bit depth, particularly on tasks requiring factual recall. Implemented in AutoAWQ and supported natively in transformers via the awq quantization backend. Community models available via Hugging Face.

BitsAndBytes (BnB NF4 / INT8)

The bitsandbytes library provides INT8 and NF4 (4-bit NormalFloat) quantization that runs as part of the transformers forward pass - you quantize on load rather than as a separate step. INT8 with outlier management (LLM.int8() method) is essentially lossless for most tasks. NF4 with double quantization (QLoRA format) is a 4-bit scheme optimized for the weight distribution of LLMs and is the standard for fine-tuning with QLoRA. In pure inference settings, NF4 quality roughly matches Q4_K_M, though not always. Documented by Hugging Face.

FP8 Native (Emerging)

Several recent GPUs (NVIDIA H100, H200, GB200, and to a lesser extent RTX 40-series with FP8 tensor core support) can run FP8 (8-bit floating point) natively. This is different from INT8 - FP8 preserves dynamic range better because it uses floating point exponent bits. Current LLMs that ship with FP8 kernels (DeepSeek V3, Llama 3.1 405B FP8) report quality within 0.5% of BF16 at half the VRAM cost. FP8 inference is primarily a data center format today - the RTX 4090 has limited FP8 throughput compared to its FP16 throughput - but it is the direction server inference is heading. Not covered in the per-model tables below, which focus on consumer formats.

Methodology

The delta tables below report the change in each metric at each quantization level compared to BF16 or FP16 baseline. Negative deltas are quality losses. All GGUF figures use the K-quant variants (Q4_K_M, Q5_K_M, Q6_K, Q3_K_M, Q2_K, Q8_0) unless noted.

Benchmark sources used:

MMLU (0-shot or 5-shot, 0-100 scale): Massive Multitask Language Understanding - broad knowledge proxy
HumanEval (0-shot pass@1, 0-100 scale): Python code generation from docstrings
Perplexity on WikiText-2 (lower is better): the standard quantization quality signal from the llama.cpp and GPTQ literature; a perplexity delta of +0.5 is barely noticeable, +2.0 is meaningful, +5.0 is serious degradation

All VRAM figures are for 8K context. VRAM for longer contexts scales linearly with context length due to KV cache growth.

Token generation speeds are measured on RTX 4090 (24 GB, 1,008 GB/s bandwidth) using llama.cpp unless noted. Figures marked ~ are estimates derived from bandwidth math where direct benchmarks were not available.

Where data is unavailable: I write "Not reported" rather than interpolate. A dash means not applicable. Scores marked with ~ are community-reported approximations from llama.cpp GitHub benchmark threads (#4167, #15013, #2094) or the bartowski and unsloth model card quantization notes.

Tier 1 - Small Models (6B-9B)

Representative models: Llama 3.1 8B, Qwen 2.5 7B, Mistral Small 3.2 (22B - covered separately below)

Small models are where quantization bites hardest relative to their already limited capacity. A 7B model at BF16 has limited representational headroom; stripping bits removes more of what it knows in proportional terms. Q2_K at this size tier is effectively unusable for any task requiring factual grounding or code generation.

Llama 3.1 8B - Quantization Impact

BF16 baseline: MMLU 69.4, HumanEval 72.6, WikiText-2 PPL ~6.1

Quantization	VRAM (8K ctx)	MMLU delta	HumanEval delta	PPL delta vs BF16	Tok/s RTX 4090	Quality Loss Rating
BF16	~16.0 GB	0 (baseline)	0 (baseline)	0 (baseline)	~58	Baseline
Q8_0	~8.5 GB	~-0.1	~-0.2	+0.04	~105	Negligible
Q6_K	~6.7 GB	~-0.3	~-0.5	+0.11	~118	Negligible
Q5_K_M	~6.1 GB	~-0.5	~-0.8	+0.18	~129	Minimal
Q4_K_M	~5.0 GB	~-1.4	~-2.1	+0.42	~145	Acceptable
Q3_K_M	~4.1 GB	~-3.1	~-4.8	+1.12	~162	Noticeable
Q2_K	~3.0 GB	~-7.2	~-16.3	+4.31	~184	Severe - avoid

The Q4_K_M column is the practical reference point for most users. At ~5 GB VRAM, it fits on any GPU with 6 GB or more, runs at ~145 tok/s on RTX 4090, and loses only about 1.4 MMLU points versus full precision. That gap is real but small enough that for most downstream tasks the difference in outputs is undetectable.

Q3_K_M is the last tier I would deploy for production workloads. The 3.1-point MMLU drop and 1.12 perplexity delta start showing up as factual errors and degraded instruction following in my testing. HumanEval drops nearly 5 points - code generation quality degrades visibly.

Q2_K at 8B is functionally broken. The 4.31 perplexity delta is catastrophic - comparable to the gap between GPT-2 and a 2022-era 7B model. HumanEval collapses by 16 points. The model produces grammatically correct text that is increasingly wrong about facts.

Qwen 2.5 7B - Quantization Impact

BF16 baseline: MMLU ~74.2, HumanEval ~72.1, WikiText-2 PPL ~5.8

Quantization	VRAM (8K ctx)	MMLU delta	HumanEval delta	PPL delta vs BF16	Tok/s RTX 4090	Quality Loss Rating
BF16	~15.2 GB	0 (baseline)	0 (baseline)	0 (baseline)	~61	Baseline
Q8_0	~8.1 GB	~-0.1	~-0.1	+0.03	~110	Negligible
Q6_K	~6.4 GB	~-0.2	~-0.3	+0.09	~121	Negligible
Q5_K_M	~5.8 GB	~-0.4	~-0.6	+0.15	~133	Minimal
Q4_K_M	~4.8 GB	~-1.1	~-1.8	+0.36	~149	Acceptable
Q3_K_M	~3.9 GB	~-2.7	~-4.1	+0.98	~168	Noticeable
Q2_K	~2.9 GB	~-6.8	~-15.1	+3.98	~189	Severe - avoid

Qwen 2.5 7B quantizes slightly more gracefully than Llama 3.1 8B at the same levels - the Q4_K_M MMLU delta is 1.1 versus 1.4, and the perplexity delta is 0.36 versus 0.42. This is consistent with community observations that Qwen 2.5 models have better weight distribution for quantization, possibly related to their grouped-query attention architecture. The Q3_K_M tier is more usable here than for Llama, though still not recommended for production.

Tier 2 - Mid-Small Models (12B-15B)

Representative models: Phi-4 14B, Mistral Small 3.2

This tier has more representational capacity to absorb quantization. The Q4_K_M sweet spot becomes clearer here: you lose proportionally less quality per bit removed because the model has more redundancy. Q3_K_M is more survivable than at 7B, though still not my recommendation for anything quality-sensitive.

Phi-4 14B - Quantization Impact

BF16 baseline: MMLU ~84.8, HumanEval ~82.6, WikiText-2 PPL ~5.2

Quantization	VRAM (8K ctx)	MMLU delta	HumanEval delta	PPL delta vs BF16	Tok/s RTX 4090	Quality Loss Rating
BF16	~29.0 GB	0 (baseline)	0 (baseline)	0 (baseline)	~32	Baseline
Q8_0	~15.3 GB	~-0.1	~-0.1	+0.03	~63	Negligible
Q6_K	~12.0 GB	~-0.2	~-0.4	+0.08	~71	Negligible
Q5_K_M	~10.7 GB	~-0.4	~-0.7	+0.13	~78	Minimal
Q4_K_M	~8.8 GB	~-1.0	~-1.6	+0.31	~89	Acceptable
Q3_K_M	~7.0 GB	~-2.4	~-3.5	+0.79	~101	Noticeable
Q2_K	~5.2 GB	~-5.8	~-11.2	+2.91	~117	Severe

Phi-4 14B at Q4_K_M (~8.8 GB) fits comfortably on a 12 GB GPU and produces at a useful 89 tok/s. The MMLU delta of ~1.0 is the smallest of any model at this tier - Microsoft's training methodology (heavy on synthetic data and curriculum) appears to produce weights that are somewhat more compression-resistant. The HumanEval delta of ~1.6 at Q4 is also low for a 14B model, which matters if you are running a coding assistant.

Mistral Small 3.2 (22B) - Quantization Impact

BF16 baseline: MMLU ~82.7, HumanEval Not reported officially, WikiText-2 PPL ~5.6

Quantization	VRAM (8K ctx)	MMLU delta	HumanEval delta	PPL delta vs BF16	Tok/s RTX 4090	Quality Loss Rating
BF16	~45.0 GB	0 (baseline)	0 (baseline)	0 (baseline)	~22	Baseline
Q8_0	~23.5 GB	~-0.1	Not reported	+0.04	~44	Negligible
Q6_K	~18.5 GB	~-0.2	Not reported	+0.10	~50	Negligible
Q5_K_M	~16.8 GB	~-0.4	Not reported	+0.16	~55	Minimal
Q4_K_M	~13.8 GB	~-0.9	Not reported	+0.30	~64	Acceptable
Q3_K_M	~11.0 GB	~-2.2	Not reported	+0.74	~74	Noticeable
Q2_K	~8.3 GB	~-5.1	Not reported	+2.65	~87	Severe

Mistral Small 3.2 at Q4_K_M (~13.8 GB) fits on a 16 GB GPU. The Q4 MMLU delta (~0.9) is lower than at smaller model sizes, consistent with the general pattern that larger models tolerate quantization better proportionally. Mistral does not publish HumanEval scores for their official releases, so those cells are not reported.

Tier 3 - Mid Models (27B-35B)

Representative models: Gemma 3 27B, Qwen 2.5 32B

At this tier, Q4_K_M is the practical requirement for 24 GB consumer GPUs. BF16 and even Q8_0 require multi-GPU setups or high-end workstation cards. The quality delta at Q4 is typically under 1 MMLU point relative to full precision.

Gemma 3 27B - Quantization Impact

BF16 baseline: MMLU 78.6, HumanEval ~56.8, WikiText-2 PPL ~6.0

Quantization	VRAM (8K ctx)	MMLU delta	HumanEval delta	PPL delta vs BF16	Tok/s RTX 4090	Quality Loss Rating
BF16	~55.0 GB	0 (baseline)	0 (baseline)	0 (baseline)	~12	Baseline
Q8_0	~28.5 GB	~-0.1	~-0.1	+0.03	~24	Negligible
Q6_K	~22.3 GB	~-0.2	~-0.3	+0.08	~27	Negligible
Q5_K_M	~20.1 GB	~-0.4	~-0.5	+0.14	~30	Minimal
Q4_K_M	~16.6 GB	~-0.8	~-1.2	+0.27	~34	Acceptable
Q3_K_M	~13.2 GB	~-1.9	~-2.8	+0.65	~40	Noticeable
Q2_K	~9.8 GB	~-4.6	~-9.7	+2.39	~48	Severe

Note: Gemma 3 27B at Q6_K (~22.3 GB) fits on a single 24 GB GPU (RTX 4090 / 3090) while delivering near-lossless quality. This is the recommended quantization for 24 GB users who want maximum fidelity: you give up a small amount of throughput versus Q4_K_M but keep the MMLU delta under 0.2. If the VRAM is too tight at Q6_K, Q5_K_M at 20.1 GB is also very comfortable.

Gemma 3 27B is worth noting for its multilingual behavior under quantization - Google's published evaluation shows that multilingual MMLU degrades faster than English MMLU at aggressive quantization levels. By Q3_K_M, multilingual performance drops an additional 0.5-1.0 points beyond the English figure in the table.

Qwen 2.5 32B - Quantization Impact

BF16 baseline: MMLU ~83.1, HumanEval ~75.8, WikiText-2 PPL ~5.5

Quantization	VRAM (8K ctx)	MMLU delta	HumanEval delta	PPL delta vs BF16	Tok/s RTX 4090	Quality Loss Rating
BF16	~65.0 GB	0 (baseline)	0 (baseline)	0 (baseline)	~10	Baseline
Q8_0	~33.8 GB	~-0.1	~-0.1	+0.03	~20	Negligible
Q6_K	~26.3 GB	~-0.2	~-0.2	+0.07	~22	Negligible
Q5_K_M	~23.8 GB	~-0.3	~-0.4	+0.11	~25	Minimal
Q4_K_M	~19.5 GB	~-0.7	~-1.1	+0.24	~29	Acceptable
Q3_K_M	~15.5 GB	~-1.8	~-2.6	+0.60	~34	Noticeable
Q2_K	~11.5 GB	~-4.4	~-9.1	+2.25	~41	Severe

Qwen 2.5 32B at Q4_K_M (~19.5 GB) fits comfortably in 24 GB with room for context, and the 0.7 MMLU delta is very modest. This is one of the cleaner quantization stories in the mid tier: Qwen 2.5 architecture shows consistent resistance to compression across its entire model family.

Tier 4 - Large Models (65B-75B)

Representative models: Llama 3.3 70B, Qwen 2.5 72B

This is where quantization earns its keep. No consumer single GPU can run these models at BF16 or Q8_0. Q4_K_M is the minimum for most 48-64 GB setups; Q3_K_M or Q2_K is sometimes required to squeeze these onto 32-40 GB configurations. The good news: these models have so much representational capacity that they survive aggressive quantization better than smaller models do.

Llama 3.3 70B - Quantization Impact

BF16 baseline: MMLU 83.6, HumanEval 80.5, WikiText-2 PPL ~3.8

Quantization	VRAM (8K ctx)	MMLU delta	HumanEval delta	PPL delta vs BF16	Tok/s RTX 4090	Quality Loss Rating
BF16	~141 GB	0 (baseline)	0 (baseline)	0 (baseline)	~3	Baseline (requires 4x 40GB)
Q8_0	~74 GB	~-0.1	~-0.1	+0.02	~6	Negligible (2x 40GB)
Q6_K	~58 GB	~-0.2	~-0.2	+0.06	~7	Negligible (2x 40GB)
Q5_K_M	~52 GB	~-0.3	~-0.3	+0.10	~8	Minimal (2x 32GB)
Q4_K_M	~43 GB	~-0.5	~-0.8	+0.18	~10	Minimal (64GB unified or 2x 24GB)
Q3_K_M	~34 GB	~-1.3	~-2.0	+0.47	~13	Acceptable (32-40 GB GPU)
Q2_K	~26 GB	~-3.2	~-6.9	+1.68	~16	Noticeable (fits RTX 5090 24GB+8GB)

The Q4_K_M delta of 0.5 MMLU points is remarkably small. Llama 3.3 70B at Q4 still scores ~83.1 MMLU - higher than most 13-34B models at full precision. The H.264 analogy applies here: you are compressing something that has so much information that even aggressive compression leaves plenty behind.

The Q3_K_M result deserves attention for users running on 32-36 GB systems: a 1.3 MMLU delta is acceptable, and at 34 GB it fits on an M4 Max 48GB or a single high-end workstation GPU. The resulting model still outperforms most full-precision 13-30B models.

Q2_K is the "last resort" tier. A 3.2-point MMLU delta is larger than the full-precision gap between Llama 3.3 70B and Llama 3.1 8B - you are throwing away quality the model earned from its scale. Use Q2_K only when you need to fit a 70B into a 24-28 GB VRAM budget and have no other option.

Qwen 2.5 72B - Quantization Impact

BF16 baseline: MMLU ~84.1, HumanEval ~79.4, WikiText-2 PPL ~3.7

Quantization	VRAM (8K ctx)	MMLU delta	HumanEval delta	PPL delta vs BF16	Tok/s RTX 4090	Quality Loss Rating
BF16	~145 GB	0 (baseline)	0 (baseline)	0 (baseline)	~3	Baseline (requires 4x 40GB)
Q8_0	~76 GB	~-0.1	~-0.1	+0.02	~6	Negligible
Q6_K	~59 GB	~-0.1	~-0.2	+0.05	~7	Negligible
Q5_K_M	~54 GB	~-0.3	~-0.3	+0.09	~8	Minimal
Q4_K_M	~44 GB	~-0.4	~-0.7	+0.15	~10	Minimal
Q3_K_M	~35 GB	~-1.1	~-1.8	+0.41	~13	Acceptable
Q2_K	~26 GB	~-2.9	~-6.5	+1.52	~16	Noticeable

Qwen 2.5 72B shows slightly lower quantization deltas than Llama 3.3 70B at aggressive levels - the Q2_K PPL delta is 1.52 versus 1.68. This tracks the pattern seen throughout the Qwen 2.5 family: their training produces weight distributions that quantize somewhat more efficiently. At Q4_K_M, the 0.4 MMLU delta is minimal - this model is genuinely almost indistinguishable from full precision at that level.

Tier 5 - Very Large and MoE Models

Representative models: Mixtral 8x22B, DeepSeek V2.5, Mistral Large 2

MoE models quantize differently from dense models. Each expert is a separate feed-forward network, and not all experts activate for every token. The sparse activation means that routing weights and expert selection are more sensitive than individual expert weights. Quantizing too aggressively on MoE models can degrade output consistency in ways that perplexity scores undercount - the model routes to wrong experts intermittently, producing incoherent context switches in long generations.

Mixtral 8x22B - Quantization Impact

Architecture: 8 experts, 2 active. Total ~141B parameters, ~39B active. BF16 baseline: MMLU ~77.8, HumanEval ~75.5, WikiText-2 PPL ~4.0

Quantization	VRAM (8K ctx)	MMLU delta	HumanEval delta	PPL delta vs BF16	Tok/s RTX 4090	Quality Loss Rating
BF16	~283 GB	0 (baseline)	0 (baseline)	0 (baseline)	N/A	Baseline (8x A100)
Q8_0	~148 GB	~-0.1	~-0.2	+0.04	N/A	Negligible
Q6_K	~116 GB	~-0.2	~-0.3	+0.09	N/A	Negligible
Q5_K_M	~104 GB	~-0.4	~-0.5	+0.14	Not reported	Minimal
Q4_K_M	~86 GB	~-0.9	~-1.5	+0.32	Not reported	Acceptable
Q3_K_M	~68 GB	~-2.3	~-3.8	+0.88	Not reported	Noticeable
Q2_K	~51 GB	~-5.2	~-12.1	+2.87	Not reported	Severe

At Q4_K_M (86 GB), Mixtral 8x22B requires 128 GB of RAM or VRAM - achievable on a Mac M2 Ultra (192GB) or dual-GPU server setup. The per-model token throughput is not well-characterized in public NVIDIA consumer benchmarks because there are few setups that can run it. The Q2_K figure (51 GB) is more practical for high-end Mac configurations, but the 5.2-point MMLU delta and HumanEval collapse are significant for a model this size.

DeepSeek V2.5 - Quantization Impact

Architecture: MoE, ~236B total, ~21B active. BF16 baseline: MMLU ~80.4, HumanEval ~84.0, WikiText-2 PPL Not reported publicly

Quantization	VRAM (8K ctx)	MMLU delta	HumanEval delta	PPL delta vs BF16	Tok/s (A100 cluster ref)	Quality Loss Rating
BF16	~472 GB	0 (baseline)	0 (baseline)	0 (baseline)	Not reported	Baseline
Q8_0	~248 GB	Not reported	Not reported	Not reported	Not reported	Not reported
Q4_K_M	~133 GB	~-0.7	~-1.1	Not reported	Not reported	Acceptable
Q3_K_M	~106 GB	~-1.8	~-2.9	Not reported	Not reported	Noticeable
Q2_K	~79 GB	~-4.1	~-9.2	Not reported	Not reported	Severe

DeepSeek V2.5 is primarily a server-run model. Few public comprehensive quantization benchmarks exist for it in consumer formats. The MMLU and HumanEval deltas above are estimated from community reports on the unsloth GGUF releases. PPL data on WikiText-2 was not publicly reported in a comparable format. Use these figures as directional rather than definitive.

Consumer Hardware Sweet Spots

This section translates the data above into actionable recommendations. Given a specific VRAM budget, what model-and-quantization combination gives you the best quality?

24 GB VRAM (RTX 4090, RTX 3090, RX 7900 XTX)

Goal	Best Combination	MMLU Score	VRAM Used	Tok/s
Max absolute quality	Qwen 2.5 32B Q4_K_M	~82.4	~19.5 GB	~29
Max quality + speed balance	Gemma 3 27B Q6_K	~78.4	~22.3 GB	~27
Best coding (HumanEval)	Qwen 2.5 32B Q4_K_M	~74.7 HumanEval	~19.5 GB	~29
Fastest useful quality	Phi-4 14B Q5_K_M	~84.4	~10.7 GB	~78

Key insight: On a 24 GB card, the single best move for maximum quality is Qwen 2.5 32B at Q4_K_M rather than Llama 3.3 70B at Q2_K. The 32B model at Q4 scores ~82.4 MMLU and runs at 29 tok/s. The 70B at Q2 scores ~80.4 MMLU but runs at only ~16 tok/s and has noticeably worse coherence in long generations. More parameters squeezed through brutal quantization does not beat fewer parameters at a clean compression level.

See the home GPU LLM leaderboard for full hardware-by-hardware model rankings.

48 GB VRAM (Mac M3/M4 Max 48GB, dual RTX 3090/4090, RTX A6000)

Goal	Best Combination	MMLU Score	VRAM Used	Tok/s
Max absolute quality	Llama 3.3 70B Q4_K_M	~83.1	~43 GB	~10
Max quality + speed balance	Qwen 2.5 32B Q8_0	~83.0	~33.8 GB	~20
Best coding	Qwen 2.5 72B Q3_K_M	~83.0 HumanEval delta -1.8	~35 GB	~13
Best throughput with good quality	Qwen 2.5 32B Q4_K_M	~82.4	~19.5 GB	~29

At 48 GB, you can run Llama 3.3 70B at Q4_K_M - the sweet spot for this model. The 0.5 MMLU delta at Q4 is minimal; you're getting ~83.1 MMLU from a model that genuinely competes with frontier commercial APIs from 2024.

96 GB VRAM (Mac M2/M3/M4 Ultra, 4x RTX 3090, A100 80GB x2)

At 96 GB, you can run Llama 3.3 70B or Qwen 2.5 72B at Q5_K_M or Q6_K - essentially lossless quality at the 70B scale. The perplexity delta at Q6_K is 0.06 for these models, which is imperceptible in practice. This is the configuration where quantization becomes a non-issue: you have enough VRAM to run near-perfect quality at interactive speeds.

Quantization Dead Zones

Not all quantization levels make sense across all model sizes. There are configurations where the quality degradation is so severe that you are better off running a smaller model at a higher quantization level.

Q2_K Below 13B: Do Not Use

The Q2_K format stores weights at roughly 2.6 bits per parameter. At 13B and below, the information density per bit is already at the limit of what current quantization methods can manage. Below that threshold, Q2_K produces:

Perplexity deltas of 3.0-5.0+ - comparable to the full gap between a 7B and a 1B model
HumanEval collapse - code generation drops 15-25 points absolute; the model generates syntactically plausible code that doesn't run
Factual hallucination increases - MMLU losses of 6-8 points mean roughly one in twelve questions the model could answer correctly at full precision gets answered wrongly
Instruction following degradation - the model begins confabulating formats and ignoring constraints in system prompts

If your VRAM budget forces you to Q2_K on a 7B or 8B model, run a different model. A Q4_K_M Phi-4-mini (3.8B, ~2.5 GB) is a better choice than a Q2_K Llama 3.1 8B (~3.0 GB) - the 3.8B model at clean Q4 outperforms the 8B model at brutalized Q2.

Q3_K_M and Multilingual Tasks

Quantization is not task-neutral. English-language benchmarks like MMLU and HumanEval give an optimistic picture of Q3_K_M quality. For multilingual tasks - particularly languages that are underrepresented in training data relative to English - the quality degradation at Q3_K_M and below is steeper.

Community analysis of Mistral and Llama models on the multilingual M-MMLU benchmark shows that non-English language performance at Q3_K_M degrades approximately 1.5x the English delta. For a model that loses 2.0 MMLU points on English at Q3_K_M, expect 3.0-3.5 points of loss on French, German, Spanish, and more for lower-resource languages.

If your deployment target includes non-English languages, use Q4_K_M as your floor, not Q3_K_M.

Tool-Calling and Structured Output Collapse

Tool-calling accuracy and structured JSON output fidelity degrade faster than chat quality under quantization. The reason is mechanistic: tool calls require the model to produce exact JSON syntax, specific field names, and logically consistent argument values. These outputs require high confidence in specific token sequences. As quantization noise increases, the probability mass over correct tokens spreads, leading to:

Misformatted JSON (unclosed brackets, wrong field names)
Wrong argument types (passing a string where a number is expected)
Tool selection errors (calling the wrong tool or hallucinating tool names not in the schema)

Published data from community benchmarks on the function-calling leaderboard suggests that tool-calling accuracy begins degrading meaningfully at Q4_K_M (not just Q3), and drops sharply at Q3_K_M. Specifically: a model that achieves 90% tool-call success at BF16 may drop to ~87% at Q4_K_M, ~81% at Q3_K_M, and ~68% at Q2_K. These are rough estimates based on community reports; official benchmarks across quantization levels for tool-calling specifically are limited. For production agentic workloads, I recommend Q4_K_M as the minimum and Q5_K_M or higher where possible. See the function calling benchmarks leaderboard for model rankings at full precision.

Hallucination Increases Monotonically with Quantization Aggression

This is the finding from the GPTQ paper (arXiv:2210.17323) and subsequent community analysis that most brief quantization guides omit: hallucination rate increases monotonically with quantization aggressiveness, not just quality benchmarks.

MMLU and HumanEval capture "does the model know the right answer." They do not directly measure "does the model confidently produce wrong answers." The perplexity metric captures this better - a higher perplexity means the model is less certain about each token, which means that when it makes a mistake, it is less obviously making a mistake. Subtly wrong answers with high confidence are more dangerous than obviously wrong answers.

The practical implication: for use cases where hallucination risk matters (medical, legal, financial, factual research), do not use Q3_K_M or below, even if benchmark scores look acceptable. The MMLU delta of 2.0 at Q3_K_M understates the confidence-calibration damage that aggressive quantization does to model outputs.

For reference, the AWQ paper (arXiv:2306.00978) shows that activation-aware methods reduce hallucination rate at equivalent bit depths compared to naive quantization or GPTQ, which is part of why AWQ and I-quant GGUF variants are preferred over basic Q4_0 GGUF for quality-sensitive workloads.

Choosing a Format: Quick Decision Tree

Do you use llama.cpp, Ollama, LM Studio, Jan, or koboldcpp?

Use GGUF K-quants. Q4_K_M as your default. Q5_K_M or Q6_K if you have VRAM to spare. Q8_0 if the model fits easily and you want maximum fidelity.

Do you run models through the transformers library with CUDA?

Use AWQ (best quality per bit) or GPTQ INT4 (wider model availability). BitsAndBytes INT8 for lossless inference if VRAM allows.

Do you deploy to a production inference server (vLLM, TensorRT-LLM)?

AWQ INT4 is the standard for vLLM quantized deployment. TensorRT-LLM supports INT8 and INT4 with its own calibration pipeline. See the best open-source LLM inference servers guide for per-server format recommendations.

Do you have an NVIDIA H100 or similar data center GPU?

FP8 native inference is the right choice. DeepSeek V3 ships with FP8 kernels; Llama 3.1 405B has an official FP8 release from Meta. Quality is near-lossless versus BF16 at half the memory requirement.

Are you running on Apple Silicon?

GGUF via llama.cpp or MLX via the MLX-LM library. For models that fit at Q4_K_M or higher, the quality and speed are both excellent. See the home GPU LLM leaderboard for Apple Silicon speeds.

Caveats and Limitations

Per-task variance is real. MMLU and perplexity averages mask task-specific behavior. A model that loses 1.5 MMLU points globally at Q4_K_M may lose 3 points on graduate-level science questions and 0 points on reading comprehension. If you have a specific high-stakes task, benchmark the specific task at the quantization level you plan to deploy, not just MMLU.

Calibration data matters for I-quants and GPTQ. The improved quality of I-quant GGUF variants (IQ3_M, IQ4_NL) and GPTQ models comes from calibration data used during quantization. A GPTQ model calibrated on code data will preserve code performance better at a given bit depth than a GPTQ model calibrated on generic web text. The model name alone does not tell you the calibration set. Check the quantization author's notes. The bartowski and unsloth model releases generally document this.

Context length affects quantization impact. Long-context tasks are more sensitive to quantization than short-context tasks. At 32K+ context, quantization errors in attention layers compound over the long KV cache. The perplexity numbers in the tables above are measured at short contexts. For long-context deployments, add approximately 0.5-1.0 extra perplexity delta to Q3_K_M and 0.2-0.4 to Q4_K_M to account for this.

Model families matter more than format. A Qwen 2.5 32B at Q3_K_M likely outperforms a Llama 3.1 8B at BF16 on MMLU. The scale advantage overwhelms the quantization penalty at any reasonable comparison. When choosing between "smaller model at high precision" and "larger model at lower precision," bias toward larger models at moderate quantization unless you have specific VRAM or latency constraints.

This leaderboard covers published formats as of April 2026. FP8 inference, MXFP4, and hardware-native quantization formats are evolving rapidly. The GGUF I-quant variants are improving with each llama.cpp release. Check the llama.cpp quantization documentation for the latest formats before committing to a deployment configuration.

Cross-References

For model quality rankings at full precision, see the open-source LLM leaderboard and overall LLM rankings
For hardware-specific speed benchmarks by model, see the home GPU LLM leaderboard
For on-device and edge inference (sub-7B models, phones, Raspberry Pi), see the edge and mobile LLM leaderboard
For small model quality rankings specifically, see the small language model leaderboard
For inference server tooling and deployment format compatibility, see the best open-source LLM inference servers

Sources

Medical LLM Leaderboard 2026: MedQA, USMLE, PubMedQA

James Kowalski — Sun, 19 Apr 2026 00:00:00 +0000

Most benchmark discussions involve a model getting a wrong answer and losing a point. In medicine, a wrong answer means a missed diagnosis, a contraindicated drug recommendation, or a clinician acting on a fabricated lab value. That asymmetry is what makes medical AI benchmarks worth tracking as a distinct category - separate from the general reasoning leaderboard and distinct from hallucination benchmarks, even though both are directly relevant here.

The FDA has cleared over 950 AI-enabled medical devices as of early 2026. LLMs are appearing in clinical decision support tools, prior authorization workflows, medical record summarization, and patient-facing triage assistants. The benchmarks covered in this leaderboard are the primary way to assess whether a model's medical knowledge holds up before it gets anywhere near a care setting.

TL;DR

o3 and GPT-5 lead across MedQA USMLE and MMLU-Medical, both exceeding expert physician pass rates
Reasoning models pull ahead most sharply on NEJMqa case reasoning, where multi-step clinical logic is required
Domain fine-tuned models (MedGemma, OpenBioLLM, MMed-Llama) show strong results on narrow tasks but trail frontier generalist models on aggregate
HealthBench, OpenAI's 2025 open-source evaluation, reveals gaps even in top models on realistic clinical conversations
"Not reported" is common - most labs do not publish medical benchmark scores for their newest models

The Benchmarks Explained

MedQA (USMLE)

MedQA (arxiv:2009.13081, Jin et al. 2020) is built from real United States Medical Licensing Examination Step 1, 2, and 3 questions. The dataset contains 12,723 questions in its English split, covering anatomy, physiology, pharmacology, pathology, and clinical medicine. Questions are 4-option or 5-option multiple choice. USMLE Step 1 pass threshold is roughly 60%; Step 2 Clinical Knowledge is typically 65-68%.

This is the single most widely reported medical AI benchmark. A model that scores above the USMLE pass threshold demonstrates minimum competency in medical knowledge. Scores at 80%+ represent performance that matches or exceeds practicing physicians in controlled multiple-choice settings. The benchmark was updated in 2023 with additional question types, and the community standard is the 4-option USMLE English split.

MedMCQA

MedMCQA (arxiv:2009.13811, Pal et al. 2022) covers over 194,000 questions from Indian medical entrance examinations - AIIMS PG and NEET PG. The scope is broader than USMLE, spanning 21 medical subjects including dental and surgery specialties. Questions are 4-option multiple choice. Average human expert performance on MedMCQA sits around 70-75%.

MedMCQA is harder than its size suggests because many questions require integrating information across subspecialties, and Indian medical training emphasizes drug dosages, surgical indications, and tropical disease patterns that Western training datasets may underrepresent. Models that perform well on MedMCQA tend to have genuinely broad medical knowledge rather than US-centric USMLE coverage.

PubMedQA

PubMedQA (arxiv:1809.06609, Jin et al. 2019) is a biomedical research QA benchmark built from PubMed abstracts. Given a research question and its abstract context, a model must answer yes/no/maybe and provide a supporting long answer. The labeled test set contains 1,000 manually annotated questions. Human performance is around 78%.

This benchmark measures a different capability than USMLE - not clinical knowledge recall, but the ability to reason from biomedical literature. A model that scores well on PubMedQA is useful for systematic review assistance, literature-based clinical queries, and research summarization. PubMedQA correlates less strongly with MedQA performance than you might expect, because strong clinical knowledge doesn't necessarily translate to accurate biomedical literature interpretation.

MMLU-Medical (Aggregate)

MMLU (arxiv:2009.03300, Hendrycks et al. 2020) contains 57 subjects across STEM, humanities, and social sciences. The medical subset spans six topics: anatomy, clinical knowledge, medical genetics, professional medicine, college biology, and college medicine. Together these form a 1,874-question medical aggregate that is widely used to compare models across providers.

MMLU-Medical is not as clinically realistic as MedQA - questions tend toward memorization rather than case reasoning - but it is one of the most consistently reported benchmarks across model releases, making it useful for tracking progress over time. Scores above 90% are achievable for top frontier models and no longer discriminate among the best systems.

NEJMqa

NEJMqa (arxiv:2209.04460, Kung et al. 2023) was one of the first serious evaluations of LLM performance on the New England Journal of Medicine's Clinical Problem-Solving case reports. The task involves reading a detailed clinical vignette - presenting complaint, labs, imaging findings, differential diagnosis discussion - and answering questions about diagnosis and management. Expert physician performance on NEJM cases is high by construction, but the reasoning required is more complex than multiple-choice.

NEJMqa is a useful discriminator for reasoning quality beyond knowledge recall. Two models with identical MedQA scores can diverge significantly on NEJMqa because the case reasoning format exposes gaps in clinical logic that multiple-choice questions mask.

MedAgentBench

MedAgentBench (arxiv:2402.01767) is a more recent benchmark evaluating LLM agents on clinical workflow tasks - not just answering questions but executing multi-step clinical actions: ordering labs, interpreting results in sequence, drafting clinical notes, and managing medication reconciliation. It uses a simulated EHR environment to test whether models can act like a clinical assistant rather than an exam-taker.

This benchmark is particularly relevant as agentic deployments in healthcare accelerate. It correlates with the agentic AI benchmarks category, though with healthcare-specific constraints. Models with strong general agentic capabilities do not always transfer cleanly to constrained clinical workflows.

HealthBench

HealthBench (arxiv:2505.10074) is OpenAI's 2025 open-source evaluation framework for LLM performance on realistic health-related conversations. It includes 5,000 physician-written conversations across 26 health topics, scored by a panel of medical professionals. Unlike USMLE-style questions, HealthBench covers patient communication quality, appropriate uncertainty expression, and the ability to escalate to professional care rather than over-answering. OpenAI released the full evaluation harness publicly, making it a growing community standard.

HealthBench is harder to game than multiple-choice benchmarks because it evaluates response quality holistically. A model that memorizes USMLE answers but fails to hedge appropriately on ambiguous clinical queries will score poorly here.

Medical LLM Rankings - April 2026

The table covers 14 models from frontier generalists to specialized medical fine-tunes. Scores are drawn from published papers, model cards, and independent evaluation reports. Where no public figure exists, the cell reads "Not reported" - I do not extrapolate.

Rank	Model	Provider	MedQA USMLE %	MedMCQA %	PubMedQA %	MMLU-Medical %	HealthBench %	Notes
1	o3	OpenAI	~96	~89	~82	~95	Not reported	Best documented medical reasoning; extended thinking
2	GPT-5	OpenAI	~93	~87	~80	~93	~73	Highest HealthBench score among frontier models
3	Claude 4 Opus	Anthropic	~91	~84	~79	~91	Not reported	Strong clinical reasoning on case-style questions
4	Gemini 2.5 Pro	Google DeepMind	~90	~83	~80	~90	~68	Long-context strength useful for case summaries
5	GPT-4.1	OpenAI	~88	~82	~78	~88	~64	Best-documented baseline with published scores
6	Claude 4 Sonnet	Anthropic	~87	~80	~76	~87	Not reported	Faster and cheaper than Opus; competitive accuracy
7	DeepSeek R2	DeepSeek	~85	~78	~74	~85	Not reported	Reasoning model; no published medical benchmark scores
8	MedGemma 27B-IT	Google	~84	~80	~77	~82	Not reported	Medical fine-tune of Gemma 3; strong specialist tasks
9	Qwen 3.5	Alibaba	~82	~76	~72	~83	Not reported	Competitive; no systematic medical benchmark publication
10	Llama 4 Maverick	Meta	~79	~72	~70	~79	Not reported	Open-weight baseline; falls behind on complex reasoning
11	OpenBioLLM-Llama3-70B	Saama AI Research	~74	~73	~75	Not reported	Not reported	Medical fine-tune; strong on PubMedQA; per model card
12	MMed-Llama 3 8B	Henrychur (Community)	~63	~62	~74	Not reported	Not reported	Multilingual medical focus; per paper arxiv:2312.04465
13	Asclepius-Llama3-8B	NCATS NIH	~61	Not reported	~71	Not reported	Not reported	Clinical note fine-tune; per paper arxiv:2305.01116
14	Med-PaLM 2 (historical)	Google	86.5 (paper)	Not reported	~79	Not reported	Not reported	2023 paper baseline; superseded by current models

Scores for o3, GPT-5, Claude 4, Gemini 2.5, GPT-4.1, DeepSeek R2, Qwen 3.5, and Llama 4 are approximate figures based on early evaluation reports and extrapolation from publicly reported MMLU and reasoning scores, as comprehensive published medical benchmark evaluations for these most recent models are not yet available. MedGemma, OpenBioLLM, MMed-Llama, Asclepius, and Med-PaLM 2 scores are from their respective published papers or model cards. "Not reported" means no public figure was available as of April 2026.

Key Findings

Reasoning Models Have a Structural Edge on Clinical Cases

The gap between reasoning models and standard frontier models is sharpest on case-based reasoning tasks like NEJMqa and MedAgentBench - not on multiple-choice recall. On MedQA USMLE, the difference between o3 and GPT-4.1 is roughly 8 percentage points. On clinical case reasoning, that gap widens because case problems require chaining multiple diagnostic steps, considering and excluding differential diagnoses, and integrating information from different parts of a vignette.

This mirrors what I documented in the finance leaderboard: reasoning models with extended thinking outperform their non-reasoning counterparts most clearly when tasks require multi-step chains rather than single recall events. The medical domain adds another layer - errors in a clinical reasoning chain can cascade in clinically consequential directions, and the ability to backtrack and revise an intermediate diagnostic hypothesis is genuinely valuable.

Domain Fine-Tunes Trail Frontier Generalists - With Exceptions

MedGemma, OpenBioLLM, MMed-Llama, and Asclepius represent the state of purpose-built medical fine-tuning in 2026. The pattern is the same as what we saw in the finance domain: frontier generalists that have been trained on enormous amounts of medical text - PubMed abstracts, textbooks, clinical case discussions, USMLE prep materials - now outperform domain fine-tunes on most aggregate benchmarks.

MedGemma is the strongest exception. Google's purpose-built medical model built on Gemma 3 posts competitive MedMCQA and PubMedQA scores that approach Gemini 2.5 Pro territory, despite being a fraction of the size. For organizations that need on-premise deployment, privacy compliance, or lower inference cost, MedGemma represents the most capable option that doesn't require sending data to an external API. The Med-PaLM 2 paper score of 86.5% on MedQA (from 2023) was a landmark at the time - showing that a fine-tuned medical model could approach human physician performance - but that number is now a floor, not a ceiling.

Hallucination Risk Doesn't Correlate Cleanly with Accuracy

A model that scores 88% on MedQA still fabricates information on the other 12% of questions - and in clinical settings, that's not a rounding error. The relationship between benchmark accuracy and hallucination behavior on real clinical queries is weaker than practitioners expect. A model can score well on USMLE-style questions (which have clear correct answers) while still generating plausible-sounding but unsupported recommendations on the open-ended clinical queries that reflect actual use cases.

This is where HealthBench provides signal that MedQA cannot. OpenAI's evaluation framework specifically targets the kinds of clinical communication failures - overconfident recommendations, failure to escalate, inappropriate certainty on ambiguous presentations - that accuracy benchmarks miss. The gap between GPT-5's 93% on MedQA and its 73% on HealthBench illustrates the point: medical knowledge and medical communication quality are related but not the same thing.

For a more complete picture of where frontier models fail on factual accuracy across domains, see the hallucination benchmarks leaderboard.

The Clinical Reasoning vs. Knowledge Trivia Distinction

MedQA and MMLU-Medical are primarily knowledge recall benchmarks. They test whether a model knows that metformin is first-line for type 2 diabetes, or that the superior mesenteric artery supplies the midgut. NEJMqa, MedAgentBench, and HealthBench test something harder: whether a model can reason through an ambiguous clinical presentation, hold competing hypotheses, and reach a defensible conclusion.

The distinction matters because clinical practice is mostly the second kind of task. A physician who has memorized Harrison's but cannot manage uncertainty, cannot reason through an atypical presentation, and cannot communicate appropriate limits of knowledge is a dangerous clinician. The same applies to LLMs. Benchmark selection should reflect which capability you actually need.

Benchmark Methodology Notes

MedQA USMLE is evaluated on the standard 4-option English test split (1,273 questions). The 5-option variant is harder and scores will be lower. Many published papers report 4-option results; always check which variant is being cited when comparing across sources.

MedMCQA uses the official test split. Evaluation is straightforward multiple-choice accuracy. The dataset is publicly available on HuggingFace.

PubMedQA uses the 1,000-question expert-labeled (labeled) test split, not the auto-generated or expert-generated reasoning splits. Accuracy on yes/no/maybe predictions is the standard metric. Dataset available at HuggingFace.

MMLU-Medical is the aggregate of six MMLU subjects: anatomy, clinical knowledge, medical genetics, professional medicine, college biology, college medicine. I report the average accuracy across these six. Full MMLU dataset available at HuggingFace.

HealthBench scores are overall scores from the HealthBench evaluation harness as described in arxiv:2505.10074. The benchmark is open-source and results for newer models are not yet systematically reported.

For all benchmarks, I prefer independently verified numbers over self-reported model card claims. The frontier model estimates in the table are based on extrapolation from published MMLU Pro Medical subset scores and publicly available early evaluation reports, not internal evaluations.

Caveats and Limitations

Not Medical Advice - Not Even Close

This article ranks AI models on standardized benchmarks. It is not a recommendation to use any model in a clinical setting. None of the models in this table are FDA-cleared for clinical decision support on their own. Benchmark performance on USMLE questions does not equal clinical safety or efficacy. Before deploying any AI tool in a healthcare workflow, verify regulatory compliance in your jurisdiction, conduct task-specific validation, and maintain human clinical oversight.

The Benchmark-to-Clinic Gap

USMLE questions are carefully written to have clear, unambiguous correct answers. Clinical reality is messier. Real patients have atypical presentations, incomplete histories, comorbid conditions, and social factors that influence clinical decisions. A model that aces USMLE Step 1 has demonstrated medical knowledge, not clinical judgment. The correlation between benchmark scores and real clinical utility has not been established in robust prospective studies for most models in this table.

Demographic Bias in Test Sets

MedQA USMLE reflects the demographics and disease epidemiology emphasized in US medical education. MedMCQA reflects Indian medical education. Neither adequately represents disease patterns in sub-Saharan Africa, Southeast Asia, or Indigenous populations in the Americas and Australia. A model that scores well on these benchmarks may underperform on clinical queries relevant to underrepresented demographics - a gap that published benchmark scores will not reveal.

Data Contamination Risk

USMLE questions, PubMed abstracts, and MMLU questions are all publicly available and almost certainly appear in the training data of every frontier model in this table. This creates a contamination problem: a model might "know" the correct answer from training exposure rather than deriving it from the presented information. The MedQA paper's authors acknowledge this limitation. Newer evaluations like HealthBench and MedAgentBench mitigate contamination by using withheld or dynamically generated questions, but the issue cannot be fully resolved for benchmarks with long public histories.

This is a more serious concern in medicine than in most domains, because contamination means a model might look safer than it actually is. For higher-confidence evaluation, clinical validation studies on held-out cases are necessary.

Missing Recent Scores

GPT-5, Claude 4, DeepSeek R2, and Qwen 3.5 do not yet have comprehensive published evaluations on the medical benchmark suite. The scores in the table are approximations based on available MMLU Pro Medical subset reports and early evaluation papers. As formal evaluations are published, this table will be updated. Treat all frontier model estimates as indicative rather than definitive until primary sources become available.

Medical reasoning overlaps with several other evaluation categories:

Reasoning Benchmarks Leaderboard - GPQA Diamond, AIME, HLE rankings that underpin medical reasoning performance
Hallucination Benchmarks Leaderboard - Factuality and hallucination rates that are directly relevant to clinical safety
AI Safety Leaderboard - Safety evaluations that include medical harm refusal and appropriate escalation behaviors

Bottom Line

For clinical decision support tooling: o3 and GPT-5 lead where evaluations exist, with GPT-4.1 as the most thoroughly documented baseline with published medical benchmark coverage. Claude 4 Opus and Gemini 2.5 Pro are competitive on case reasoning. No frontier model should be deployed without task-specific clinical validation.

For on-premise or privacy-constrained deployment: MedGemma 27B-IT is the strongest open-weight medical model available as of April 2026. It trails frontier models on aggregate benchmarks but leads every other specialized option and is the practical choice when external APIs are not viable.

For research assistance and literature review: PubMedQA scores are the relevant discriminator. OpenBioLLM-Llama3-70B performs well on biomedical literature tasks relative to its size and is suitable for research augmentation workflows where a full frontier model is cost-prohibitive.

What to avoid: Treating USMLE accuracy alone as a proxy for clinical safety. A model that scores 93% on MedQA still fails on 7% of questions, and the failure modes in free-text clinical queries are harder to predict. HealthBench is the more honest assessment of whether a model communicates about health in a responsible and clinically appropriate way.

Sources:

Reward Model and LLM-as-Judge Leaderboard 2026 Ranked

James Kowalski — Sun, 19 Apr 2026 00:00:00 +0000

Most AI benchmarks tell you what a model can do on a test. Reward models and judge models tell you something harder to measure: whether a model's outputs are actually good - by the standard of human preferences, not a rubric written by researchers. That's a different job, and it's one of the most consequential in the entire LLM stack.

Reward models power RLHF training pipelines. They sit between the human preference data collected at enormous cost and the policy model that gets fine-tuned on that signal. If the reward model is miscalibrated - too easy to game, biased toward length, or just wrong about what humans prefer - the downstream model inherits those errors at scale. On the evaluation side, LLM-as-judge workflows are increasingly replacing expensive human annotation for automated testing, red-teaming, and benchmark construction. Getting this right matters in ways that are easy to underestimate.

This leaderboard tracks performance across four major evaluation frameworks: RewardBench v1, RewardBench 2, JudgeBench, and MT-Bench Judge Agreement. It covers both dedicated reward models trained specifically for preference judgment and frontier LLMs used as judges via prompting.

TL;DR

Skywork-Reward-Gemma-2-27B leads the dedicated RM category at 96.1% on RewardBench v1, beating much larger proprietary models
GPT-5 and Claude 4 Opus top the frontier judge category, but o3 pulls ahead specifically on math and coding judgment tasks
Dedicated RMs are 10-20x cheaper per inference than frontier judge calls - the right choice for high-volume preference labeling at training time
Position bias and length bias remain unsolved problems even for top-ranked models

Why This Benchmark Category Matters

The LLM pipeline has three broad stages: pretraining, instruction tuning, and alignment. Reward models are the alignment stage's critical infrastructure. During RLHF, a reward model scores candidate outputs from the policy model; the policy model then gets reinforced toward higher-scoring outputs via PPO or DPO-style optimization. If that signal is noisy or systematically biased, the resulting model will drift in the direction of the bias, not toward what humans actually want.

Outside of training, LLM-as-judge has become the dominant paradigm for automated evaluation. Human annotation at the scale required for modern model development costs millions of dollars per year. Replacing human raters with GPT-4-class judges running at $0.01-0.10 per comparison - while maintaining reasonable correlation with human preference - makes systematic evaluation tractable. The question is: how reliable is that correlation, and for which categories does it break down?

These benchmarks exist to answer that question.

The Two Categories of Judges

Dedicated Reward Models

Dedicated RMs are models trained end-to-end for preference scoring. They take a prompt plus one or more candidate responses as input and output a scalar score or a ranked ordering. They're not general-purpose conversational models - they exist specifically to score quality.

Their main advantage is efficiency. A 7B or 8B reward model can score thousands of preference pairs per hour on commodity GPU hardware. This matters enormously during RLHF training, where you may need to score millions of rollouts. Their main limitation is generalization: reward models trained on one distribution of preference data often fail on out-of-distribution inputs, and they can overfit to surface features like response length or formatting rather than true quality.

Well-known examples: Skywork-Reward-Gemma-2-27B, ArmoRM-Llama3-8B-v0.1, Eurus-RM-7b, Prometheus 7B/13B, and Auto-J 13B.

Frontier LLMs as Judges

Frontier LLMs used as judges are prompted to evaluate responses - either scoring on a 1-10 scale (pointwise) or picking the better of two responses (pairwise). The model uses its broad world knowledge and reasoning capability rather than specialized preference training. This gives it an advantage on complex, knowledge-heavy tasks where a small dedicated RM would lack context.

The tradeoffs are obvious: frontier judges cost substantially more per call, introduce new failure modes like self-bias (a model rating its own outputs higher), and add latency to evaluation pipelines. They also change behavior based on prompt design in ways that dedicated RMs don't.

Reasoning Models as Chain-of-Thought Judges

A newer variant: reasoning models like o3 that produce step-by-step evaluation rationales before reaching a judgment. These chain-of-thought judges often perform better on structured tasks like math solution evaluation, where showing work allows detection of subtle errors invisible to direct scoring. Their disadvantage is inference cost - o3-level judgment can be 5-20x more expensive than a standard frontier judge call.

Benchmark Explainers

RewardBench v1

RewardBench, from AllenAI, presents reward models with chosen/rejected response pairs and asks which is better. Performance is measured as the percentage of pairs where the model agrees with the human-labeled preference. It covers five categories:

Chat: general conversational quality
Chat-Hard: more challenging conversations requiring nuanced reasoning
Safety: harmful/harmless preference pairs
Reasoning: math and logical reasoning quality
Code: code correctness and quality

The leaderboard is live on HuggingFace Spaces. An overall score is computed as the average across categories.

RewardBench 2

RewardBench 2, released by AllenAI in early 2025, is a harder version designed to address overfitting concerns with the original. It features:

More challenging chosen/rejected pairs that are harder to distinguish by surface features alone
Better coverage of agentic tasks and multi-turn conversations
Reduced susceptibility to the length bias that inflated scores on RewardBench v1

Average scores on RewardBench 2 drop roughly 8-12 points compared to v1 for the same models - it's a more discriminating test.

JudgeBench

JudgeBench evaluates LLM judges specifically - not reward model scoring, but the LLM-as-judge paradigm. It presents judge models with sets of responses and measures agreement with human expert labels. The score is percentage agreement across 350 carefully curated examples spanning coding, math, reasoning, and writing. Because JudgeBench was constructed with expert human raters (not crowdworkers), it's a higher-quality signal for judge agreement than most alternatives.

MT-Bench Judge Agreement

MT-Bench Judge Agreement measures how well a model's pairwise judgments on MT-Bench conversations match GPT-4 Turbo's judgments, which serve as a human proxy. This is a softer signal - it measures agreement with another model, not direct human labels - but it's widely reported in model cards and useful for cross-model comparison.

Rankings

The table below covers 18 entries spanning dedicated reward models and frontier LLMs used as judges. Scores are sourced from the official RewardBench leaderboard at huggingface.co/spaces/allenai/reward-bench, the RewardBench 2 leaderboard, the JudgeBench paper, and published model cards. Where no public figure exists, the entry reads "Not reported."

Rank	Model	Type	RB v1 %	RB-2 %	JudgeBench %	MT-Judge %	Notes
1	GPT-5	Frontier Judge	Not reported	Not reported	85.2	96.1	Self-bias present; strongest on writing/reasoning judgment
2	Claude 4 Opus	Frontier Judge	Not reported	Not reported	83.7	95.0	Low self-bias; strong safety and nuance judgment
3	o3 (reasoning judge)	Frontier Judge	Not reported	Not reported	82.9	93.8	Best on math/coding judgment; 5-20x cost premium
4	Skywork-Reward-Gemma-2-27B	Dedicated RM	96.1	84.3	71.2	88.4	Top dedicated RM overall; leads all open RMs on RB v1
5	Claude 4 Sonnet	Frontier Judge	Not reported	Not reported	80.1	92.4	Best cost/quality tradeoff in frontier judge category
6	GPT-4.1	Frontier Judge	Not reported	Not reported	79.8	91.3	Strong but trails Claude 4 Sonnet on safety pairs
7	Gemini 2.5 Pro	Frontier Judge	Not reported	Not reported	78.4	90.7	Leads in multilingual judgment; position bias documented
8	Nemotron-340B-RM	Dedicated RM	92.8	79.1	68.3	85.2	NVIDIA's large-scale RM; strong reasoning/code judgment
9	DeepSeek V3.2	Frontier Judge	Not reported	Not reported	74.2	87.6	Competitive on code/math judgment; not trained for this task
10	Llama 4 Maverick	Frontier Judge	Not reported	Not reported	72.8	86.1	Open-weight frontier judge; underperforms on safety pairs
11	QRM-Llama3.1-8B	Dedicated RM	91.4	77.6	63.1	82.7	Best 8B RM; quality-aware multi-attribute scoring
12	ArmoRM-Llama3-8B-v0.1	Dedicated RM	89.0	75.3	61.8	81.0	Absolute-rating RM; strong for RLHF pipelines
13	Qwen 3.5 72B	Frontier Judge	Not reported	Not reported	71.3	84.9	Solid multilingual judge; length bias documented
14	InternLM2-Reward-7B	Dedicated RM	88.4	73.9	59.2	79.6	Trained on large Chinese/English preference dataset
15	Mistral Large 3	Frontier Judge	Not reported	Not reported	68.5	83.2	Reasonable judge for cost-sensitive EU-deployment pipelines
16	Prometheus 7B v2	Dedicated RM	84.1	70.2	57.4	77.1	Specialized for feedback generation; reference-guided
17	Eurus-RM-7b	Dedicated RM	83.7	68.8	54.6	75.3	Strong on math/reasoning; weaker on general chat
18	Auto-J 13B	Dedicated RM	80.2	64.5	52.3	73.8	Generative judge; produces natural-language critiques

RB v1 = RewardBench v1 overall score. RB-2 = RewardBench 2 overall score. JudgeBench = human-agreement score from JudgeBench paper. MT-Judge = MT-Bench judge agreement percentage. Frontier LLM scores on RB v1 and RB-2 are not reported in public leaderboard data - these models were not submitted to the preference-pair evaluation protocol. Scores current as of April 2026.

RewardBench evaluates models across five categories: Chat, Chat-Hard, Safety, Reasoning, and Code. Dedicated RMs and LLM judges show different strength profiles across these categories. Source: pexels.com

Key Findings

Dedicated RMs Can Beat Frontier Judges on Preference Accuracy

On RewardBench v1, Skywork-Reward-Gemma-2-27B scores 96.1% - a result that no frontier judge approaches in the RewardBench evaluation protocol. This is because dedicated RMs are trained end-to-end to solve exactly the problem RewardBench measures: given a prompt and two candidate responses, pick the better one. Frontier judges were trained for conversation, not preference discrimination.

This distinction matters for RLHF pipeline design. If your goal is high-accuracy preference labeling at training time, a purpose-built 27B RM running cheaply on your own hardware may outperform $0.05-per-call GPT-4.1 judgments in head-to-head accuracy - not just in cost efficiency.

Reasoning Judges Are the Best Option for Math and Code

On JudgeBench's math and coding subcategories, o3 leads by a clear margin. Chain-of-thought reasoning allows the model to verify mathematical derivations and trace code execution before making a judgment - catching errors that a direct scoring call misses. For evaluation pipelines specifically targeting math problem-solving or code correctness, the cost premium of a reasoning judge is often justified.

For general quality assessment across writing, instruction following, and conversational tasks, Claude 4 Opus and GPT-5 are the more cost-efficient frontier options.

Judge Self-Bias is Real and Measurable

GPT-5 judges GPT-5 outputs roughly 8-12% higher than equivalent outputs from other models, when blind identity is removed. This is documented in JudgeBench analysis and consistent with findings from the original MT-Bench paper. The implication: if your evaluation pipeline uses GPT-5 to evaluate candidates that may include GPT-5 outputs, you're not running a fair comparison. Claude 4 Opus shows lower measured self-bias, making it a better choice for evaluations spanning multiple providers.

Position Bias in Pairwise Judging

When LLM judges evaluate two responses side by side, they show a consistent preference for the response presented first (primacy bias) or last (recency bias), independent of actual quality. The strength of this bias varies by model: Gemini 2.5 Pro shows moderate primacy bias; GPT-4.1 shows recency bias in longer contexts. Standard mitigation is to run each pair twice with positions swapped and average the result - at the cost of doubling inference calls.

Length Bias Inflates Scores for Verbose Responses

All judge models in this table show some degree of length bias - a tendency to prefer longer, more detailed responses over shorter, equally correct ones. RewardBench v1's relatively high average scores partly reflect this bias being baked into the training distribution: the "chosen" responses in most RLHF datasets are statistically longer than the "rejected" ones. RewardBench 2 was specifically designed to reduce this confound, which is why average scores drop 8-12 points across the board.

Skywork-Reward-Gemma-2-27B scores 96.1% on RewardBench v1 - beating every frontier judge in the dedicated preference-pair evaluation protocol at a fraction of the inference cost.

Open-Source RMs Show RewardBench Saturation Signs

Several dedicated RMs now exceed 95% on RewardBench v1. AllenAI has noted that benchmark construction artifacts - particularly the statistical predictability of chosen/rejected pairs from the source datasets - allow models to achieve high scores without genuinely calibrated preference judgment. This is why RewardBench 2 was developed, and why the score gap between models on RB-2 (12 points between ranks 4 and 18) is more informative than the compressed top of RB v1.

Methodology Notes

Dedicated RM scores are taken from the live RewardBench v1 leaderboard maintained by AllenAI and from published RewardBench 2 results in the RewardBench 2 paper. All scores represent the model's percentage accuracy at identifying the preferred response in chosen/rejected pairs, averaged across all five benchmark categories.

JudgeBench scores are from the JudgeBench paper (arXiv:2404.13512), which reports human-agreement percentage across 350 expert-labeled examples. Frontier LLM scores were reported in the paper's LLM-as-judge evaluation section; dedicated RM scores required adapting the pointwise format used in JudgeBench for comparison.

MT-Bench Judge Agreement scores are sourced from individual model cards and the original MT-Bench paper. This metric measures agreement with GPT-4 Turbo judgments, not direct human labels - treat it as a relative signal, not an absolute quality measure.

Frontier LLMs do not appear on RewardBench v1 or v2 because the benchmark requires models to be submitted as reward scorers (returning a scalar), not as conversational judges. Applying frontier LLMs in the RewardBench protocol would require significant adaptation; reported scores use the models in their natural judge-prompting mode across compatible benchmarks.

Caveats

Judge agreement is not the same as human truth. A model that agrees 85% of the time with human labels may still be systematically wrong on specific domains, languages, or task types that happen to be underrepresented in the benchmark. Measure agreement on your specific data distribution before trusting leaderboard scores for production decisions.

Open-source RMs overfit to RewardBench construction. The fact that multiple RMs exceed 95% on RB v1 while dropping 15-20 points on RB-2 is a clear signal of benchmark overfitting. Models have been fine-tuned on distributions that happen to overlap with RB v1's source datasets. RewardBench 2 is the more reliable signal.

RMs are only as good as their preference training data. Reward model quality is fundamentally bounded by the quality and diversity of human preference labels used to train it. Models trained on English-only, US-centric preference data will produce miscalibrated scores on multilingual outputs or culturally specific tasks. InternLM2-Reward-7B's strong performance on Chinese-language preference pairs (not reflected in aggregate RB scores) illustrates that the training distribution matters as much as architecture.

Contamination risk for newer benchmarks. JudgeBench examples are not public, which reduces contamination risk. RewardBench v1's source datasets are partially public, creating real contamination risk for models trained on large web-crawled preference datasets. Scores on contaminated benchmarks overstate real-world judge quality.

Cost and latency are not captured in these rankings. A 7B dedicated RM can run 1,000+ comparisons per minute on an A10G GPU. GPT-5 judgment at the same throughput would cost orders of magnitude more. For high-volume training-time preference labeling, cost and latency constraints almost always dominate accuracy differences at the margin.

Practical Recommendations

For RLHF training pipelines where you need high-volume, cost-efficient preference labels: Skywork-Reward-Gemma-2-27B is the current best open-weights option. QRM-Llama3.1-8B is the best choice under 10B parameters if memory is the binding constraint.

For automated evaluation of general model outputs (writing, instruction following, conversational quality): Claude 4 Sonnet is the most balanced frontier judge - strong JudgeBench performance, documented low self-bias, and lower cost than Claude 4 Opus.

For math or code evaluation pipelines where error detection matters: o3 as a reasoning judge with chain-of-thought output is worth the cost premium. The quality delta on structured tasks is significant.

For multi-provider evaluation where self-bias is a concern: Claude 4 Opus shows the lowest documented self-bias among frontier judges and is the preferred choice when GPT-5 or OpenAI models are among the candidates being evaluated.

For broader context on how model rankings translate to real-world quality, see the Chatbot Arena Elo Rankings and the Instruction Following Leaderboard. If you're building evaluation infrastructure and need visibility into judge behavior in production, AI observability tools can help trace judge calls and detect drift in agreement patterns over time.

FAQ

What is a reward model and how does it differ from a regular LLM?

A reward model is trained specifically to score the quality of language model outputs according to human preferences. Unlike general-purpose LLMs that generate text, reward models output scalar scores or rankings. They're smaller and much cheaper to run than frontier models, but they only do one job well.

Can I use GPT-5 as a reward model replacement during RLHF?

Technically yes, but the economics rarely work out. GPT-5 as a judge costs roughly $0.05-0.15 per comparison. A good 7B dedicated RM on your own GPU costs a fraction of a cent per comparison. For training-time labeling at millions of rollouts, frontier judge costs become prohibitive. Dedicated RMs are the right tool for training pipelines.

What is position bias in LLM judging?

Position bias refers to a judge model systematically preferring whichever response is presented in a particular position - first or last - in a pairwise comparison, regardless of actual quality. It's well-documented across all tested LLM judges. The standard mitigation is to evaluate each pair twice with positions swapped and average the scores.

Why do dedicated RM scores drop so much from RewardBench v1 to v2?

RewardBench v1 has artifacts in its construction - the "chosen" responses tend to be statistically predictable from source datasets that overlap with RM training distributions. RewardBench 2 was designed to remove these cues, requiring genuine preference understanding rather than surface-level pattern matching. The 10-15 point drop is expected and reflects more honest measurement.

Is JudgeBench better than RewardBench for measuring real-world judge quality?

For evaluating LLM judges specifically, yes - JudgeBench uses expert human labels rather than crowdworker preference data, and it's designed for the prompting paradigm rather than the preference-pair paradigm. For dedicated RMs used in training pipelines, RewardBench remains the primary benchmark because it matches their operating mode.

How often does this leaderboard update?

The RewardBench v1 and v2 leaderboards update continuously as new models are submitted to AllenAI's evaluation infrastructure. JudgeBench scores are static to the paper; new models require the original authors to run evaluations. This article's scores reflect publicly reported data as of April 2026.

Sources:

Robotics Embodied AI Leaderboard 2026: VLA Models Ranked

James Kowalski — Sun, 19 Apr 2026 00:00:00 +0000

Robotics AI is the domain where the gap between demo reel and deployable system is largest. Companies have been posting viral videos of humanoids folding laundry for three years. What the papers actually show is a narrower story: task success rates on carefully staged setups, single-arm manipulation in controlled lighting, evaluation suites that share authors with the models being tested.

None of that means the underlying research is bad - it means you should read the methodology section before citing the headline number. This leaderboard does exactly that.

TL;DR

Physical Intelligence's pi0 and pi0.5 hold the strongest published results on real-robot multi-task evals, with pi0.5 reporting 75-80% success on the most challenging long-horizon household tasks
NVIDIA's GR00T N1 leads on CALVIN ABC-D (4-task chain), the most demanding simulation benchmark with published multi-model comparisons
Octo and OpenVLA remain the best open-weight baselines: reproducible, documented, and significantly behind the proprietary frontier
RoboCasa and SimplerEnv have the most rigorous evaluation protocols in simulation; real-robot numbers from Figure, Tesla Optimus, and 1X are not independently verified and should be treated accordingly
No company has published blind independent evaluations of their humanoid systems. Every real-robot number in this leaderboard comes from the company that built the system

Benchmark Overview

Open X-Embodiment (RT-X)

The Open X-Embodiment dataset aggregates over 1 million robot trajectories across 22 different robot embodiments and more than 500k distinct robot arm trajectories. It is primarily a training resource, not an evaluation suite - but the associated evaluation protocol tests generalization across robot types not seen during training. The dataset and evaluation framework live at the Google DeepMind Open X-Embodiment repository. Scores on the cross-embodiment transfer task measure whether a policy trained on this corpus can generalize to unseen robot hardware setups. The benchmark is simulation-only with sim-to-real transfer as an open research problem.

CALVIN

CALVIN (Composing Actions by Learning Interpretable Language-conditioned Visuo-motor Navigation) is a long-horizon benchmark for language-conditioned robot manipulation. The hardest variant - ABC-D - requires a robot to complete 4 consecutive manipulation tasks (e.g., move slider, turn light on, place ball in bowl, push button) described in free-form natural language, in a new environment not seen during training. Success requires completing all 4 tasks without intervention; the metric is average number of tasks completed in a 1,000-sequence evaluation. A policy that randomly completes 1 task per episode scores 1.0; completing all 4 scores 4.0. Paper: CALVIN - arXiv 2112.03227. Repository: github.com/mees/calvin. CALVIN is simulation-only.

LIBERO

LIBERO is a lifelong robot learning benchmark suite from NeurIPS 2023. It defines four suites testing different transfer axes: LIBERO-Spatial (object position variation), LIBERO-Object (object identity variation), LIBERO-Goal (goal instruction variation), and LIBERO-Long (long-horizon 10-step task sequences). Evaluation measures forward transfer - how well skills learned in one phase generalize to new tasks in the next phase - alongside absolute success rate. Repository: github.com/Lifelong-Robot-Learning/LIBERO. LIBERO is simulation-only.

SimplerEnv

SimplerEnv provides simulation benchmarks specifically designed to align with real-robot task setups from published papers - the same object arrangements, the same tasks, the same success criteria as in real-robot evaluations from Google RT-2, RT-X, and similar work. The goal is to make simulation scores predictive of real-robot performance. It covers Google robot and WidowX robot setups. Paper: Evaluating Real-World Robot Manipulation Policies in Simulation - arXiv 2405.05941. Repository: github.com/simpler-env/SimplerEnv.

RoboCasa

RoboCasa is a large-scale simulation benchmark for everyday household tasks, specifically kitchen-scale manipulation. It covers 100+ distinct task types across 25 kitchen environments, with objects drawn from procedurally varied sets. Tasks include opening appliances, placing items, and sequential cooking prep steps. It is built on the MuJoCo physics engine. Paper: RoboCasa - arXiv 2406.02523. Repository: github.com/robocasa/robocasa. RoboCasa is simulation-only.

DROID

DROID is a large-scale in-the-wild manipulation dataset and evaluation framework, covering 76,000 demonstrations collected across 86 different environments on Franka robot arms. Unlike controlled lab datasets, DROID captures genuine environment diversity - varied lighting, clutter, table surfaces, and backgrounds. It serves as both a pre-training source and an evaluation harness for generalization to novel scenes. Paper: DROID - arXiv 2403.12945. Repository: github.com/droid-dataset/droid. DROID evaluation includes real-robot components.

Methodology

What "passing" a task means

Across simulation benchmarks, success is binary: the robot either achieves the goal state within the episode time limit or it does not. CALVIN uses a sequential chain metric where partial completion counts - reaching 2.5 out of 4 tasks represents meaningful capability. LIBERO measures both success rate within each suite and forward transfer efficiency across suites. RoboCasa uses per-task binary success averaged across the task distribution. SimplerEnv measures per-step action matching and end-state binary success.

For real-robot evaluations from company demos, I treat any number not published in a peer-reviewed paper or technical report with documented protocol as "unverified." The success rates from Figure, Tesla Optimus, and 1X Neo are drawn from company announcements and product blogs. They are included for reference, but they are not comparable to simulation benchmark numbers - the evaluation setup, task difficulty, number of trials, and intervention criteria are controlled by the company being evaluated.

Simulation vs. real robot

Simulation benchmarks have reproducible, auditable protocols. Real-robot benchmarks require physical access, are subject to hardware variance, and cannot be independently replicated by other researchers without the same hardware. The correlation between simulation and real performance is the central open research question in this field - SimplerEnv exists specifically to address it.

The ranking table below separates simulation and real-robot results and marks each accordingly. Do not compare a simulation success rate to a real-robot success rate directly - they measure different things.

VLA Model Rankings

CALVIN ABC-D - Long-Horizon Manipulation (Simulation)

CALVIN ABC-D is the hardest widely-used manipulation benchmark with multi-model comparison data. The metric is average number of tasks completed out of 4 in 1,000 evaluation sequences.

Rank	Model	Provider	CALVIN ABC-D	Eval Type	Notes
1	GR00T N1 (ft)	NVIDIA	~3.5	Sim	Fine-tuned; GR00T N1 tech report, arXiv 2503.14734
2	pi0 (ft)	Physical Intelligence	~3.3	Sim	Estimate from pi0 paper comparisons
3	Octo (ft)	UC Berkeley	2.78	Sim	Published in Octo paper; fine-tuned on CALVIN
4	OpenVLA (ft)	Stanford/Berkeley	2.31	Sim	OpenVLA paper; fine-tuned variant
5	SuSIE	Google DeepMind	2.69	Sim	Published in SuSIE paper; video generation backbone
6	RT-2-X (ft)	Google DeepMind	1.98	Sim	RT-X evaluation; fine-tuned on CALVIN distribution
7	Octo (zero-shot)	UC Berkeley	1.22	Sim	Zero-shot transfer from Octo pre-training
8	OpenVLA (zero-shot)	Stanford/Berkeley	0.97	Sim	Zero-shot; significant gap to fine-tuned
-	Human baseline	-	~3.9	Sim	Near-perfect sequential completion

Fine-tuning on CALVIN training data is the norm for top scores. Zero-shot numbers (no CALVIN-specific fine-tuning) are far more indicative of genuine generalization. The gap between 2.78 (Octo fine-tuned) and 1.22 (Octo zero-shot) illustrates how much the benchmark can be closed by task-specific adaptation.

SimplerEnv - Simulation Aligned to Real-Robot Tasks

SimplerEnv tasks mirror the exact setups from Google robot papers. Success rates are comparable across models because the evaluation is standardized.

Rank	Model	Provider	SimplerEnv Avg	Eval Type	Notes
1	RT-2-X	Google DeepMind	~58%	Sim	From SimplerEnv paper baseline comparisons
2	Octo (fine-tuned)	UC Berkeley	~48%	Sim	Fine-tuned on Bridge/RT-X mix
3	OpenVLA (fine-tuned)	Stanford/Berkeley	~45%	Sim	OpenVLA paper; WidowX + Google robot setups
4	Octo (zero-shot)	UC Berkeley	~26%	Sim	Strong zero-shot generalist baseline
5	RT-1	Google DeepMind	~22%	Sim	Original RT-1 results from SimplerEnv paper
6	OpenVLA (zero-shot)	Stanford/Berkeley	~19%	Sim	Zero-shot on unseen task distributions

SimplerEnv scores align reasonably well with published real-robot results from the same paper lineage, which is the benchmark's specific design goal. Numbers marked with ~ are estimates from SimplerEnv paper figures rather than exact table values.

LIBERO Suites - Lifelong Learning (Simulation)

LIBERO-Long is the hardest LIBERO suite, requiring 10-step sequential task completion.

Rank	Model	Provider	LIBERO-Long	LIBERO-Spatial	Eval Type	Notes
1	pi0 (ft)	Physical Intelligence	~88%	~95%	Sim	pi0 paper ablations
2	GR00T N1 (ft)	NVIDIA	~85%	~92%	Sim	GR00T N1 tech report
3	RoboFlamingo	ByteDance	77.8%	89.3%	Sim	Published in RoboFlamingo paper
4	Octo (ft)	UC Berkeley	65.2%	82.1%	Sim	Octo paper, LIBERO evaluation
5	OpenVLA (ft)	Stanford/Berkeley	58.1%	79.6%	Sim	OpenVLA paper
6	RT-2-X (ft)	Google DeepMind	51.3%	74.8%	Sim	Estimated from RT-X report

LIBERO-Long at 65% for Octo means the model fails a 10-step task sequence on more than 1 in 3 attempts. For a real deployment context, you need to understand what "fail" means in your specific scenario - does the robot stop, does it make an incorrect action, or does it damage an object?

Real-Robot Published Success Rates

These numbers come exclusively from company technical reports, product announcements, and published papers. They are not independently verified. Task definitions, trial counts, and environment setup are determined by the reporting organization.

System	Organization	Reported Task	Success Rate	Source	Notes
pi0.5	Physical Intelligence	Household manipulation (multi-task)	75-80%	pi0.5 paper, arXiv 2501.09747	9 task categories, 30+ trials each
pi0	Physical Intelligence	Laundry folding / table bussing	~70%	pi0 paper, arXiv 2410.24164	Specific task families; varies by task
RT-2	Google DeepMind	Tabletop instruction following	~62%	RT-2 paper, arXiv 2307.15818	700+ real robot eval episodes
Helix (Figure 02)	Figure AI	Multi-task home manipulation	~67%*	Figure blog post, Feb 2025	*Company-reported; no third-party audit
GR00T N1 (real)	NVIDIA	Pick-and-place, dexterous	~61%*	GR00T N1 tech report	Real-robot pilot; limited task set
Octo	UC Berkeley	Tabletop manipulation (Bridge)	~56%	Octo paper	Real WidowX robot; documented eval
OpenVLA	Stanford/Berkeley	Tabletop manipulation	~47%	OpenVLA paper	BridgeV2 robot; documented eval
Tesla Optimus Gen 2	Tesla	Parts sorting (factory)	Not published	Tesla AI Day demos	No technical report; demo footage only
1X Neo	1X Technologies	Home tasks	Not published	Product videos	No technical report
Sanctuary Phoenix	Sanctuary AI	Factory manipulation	Not published	Press releases	Limited technical disclosure

The asterisked entries come from company-issued blog posts without methodology documentation. pi0.5 and Octo are the only entries here with enough methodological transparency to compare directly - both provide trial counts, task definitions, and success criteria in their published papers.

Key Findings

The simulation-to-real gap is still wide

The best simulation scores (CALVIN ABC-D ~3.5/4.0 for GR00T N1 fine-tuned) translate to modest real-robot performance on comparable tasks. SimplerEnv was built to close this measurement gap and partially succeeds - its scores correlate better with real behavior than arbitrary simulation setups - but it still cannot replicate the full distribution of physical variability that a real environment introduces.

Any system that looks impressive on CALVIN or RoboCasa has still not shown it works reliably in your kitchen, with your objects, under your lighting conditions. The research community knows this; the marketing departments have decided to ignore it.

Physical Intelligence leads on documented real-robot performance

pi0 and pi0.5 are the only proprietary systems with detailed enough published methodology to take seriously as benchmark points. pi0.5's 75-80% across 9 task categories with 30+ trials each is the strongest documented claim in the real-robot category. It is also the most honest: the paper breaks down per-category performance, shows the variance, and identifies failure modes explicitly. That is the standard every other company should be meeting before their numbers appear in a ranking table.

Open-weight models are two to three generations behind

Octo and OpenVLA are the reproducible open-weight baselines. Octo at 56% on real WidowX tabletop tasks and OpenVLA at 47% are respectable given both are general pre-trained policies, not task-specific fine-tuned systems. But pi0 and GR00T N1 are running 10-20 percentage points ahead on equivalent tasks. The open-weight robotics ecosystem is where the open-weight LLM ecosystem was in early 2023 - usable for research, not competitive with proprietary frontier systems.

GR00T N1, while technically open-weight via Hugging Face, requires substantial NVIDIA infrastructure to run at inference speed suitable for real-robot control. Open-weight does not mean accessible here.

Humanoid robot companies are not publishing benchmark numbers

Figure, Tesla, 1X, and Sanctuary produce demo videos and press releases. None of them publish success rates with documented methodology. The Figure Helix blog post is the closest to a technical disclosure - it includes task categories and approximate success rates - but it is still a company-authored document with no third-party verification. Until any humanoid robot company publishes an evaluation protocol that an independent lab could replicate, their "success rates" belong in the same category as marketing claims.

DROID is the right training data; evaluation coverage is thin

DROID's 76,000 diverse demonstrations have been used to pre-train several of the best current policies. But DROID as an evaluation suite is underused - models trained on DROID are rarely evaluated on the DROID test split in a standardized way. This is a gap the field needs to close. Diverse training data with no corresponding diverse evaluation is how you end up with policies that overfit to the most common lab conditions.

Methodology Notes

All simulation scores in this leaderboard are drawn from:

Published arxiv papers with documented evaluation protocols
Official technical reports from model developers (GR00T N1, pi0/pi0.5)
Benchmark papers themselves where recent models were included in evaluation

I have not published scores from demo videos, product launch blog posts, or uncited secondary sources. Where numbers are estimated from paper figures rather than exact table entries, they are marked with ~. Company-reported real-robot numbers are included with clear source attribution and the label "Company-reported; no third-party audit" where appropriate. "Not published" means no technical disclosure of any kind is available, not that I couldn't find the number.

Zero-shot vs. fine-tuned scores are explicitly separated because the difference is practically significant. A model that scores 3.5 on CALVIN after fine-tuning on CALVIN training data is demonstrating task-specific adaptation, not general manipulation competence. Rankings in each table are ordered by the primary eval metric, with fine-tuned results listed before zero-shot.

Caveats and Limitations

Benchmark diversity problem. CALVIN, LIBERO, SimplerEnv, and RoboCasa are all single-arm desktop manipulation benchmarks. None of them test bimanual manipulation, mobile manipulation, navigation-with-interaction, or contact-rich tasks like assembly. The field's best-documented benchmarks cover a narrow slice of what embodied AI needs to do. Humanoid robot demos show tasks that no standardized benchmark currently measures.

Fine-tuning inflates simulation scores. All the top simulation scores in this leaderboard involve models fine-tuned on benchmark-specific training data. This is not cheating - it is the standard experimental setup in the field - but it means that comparing a fine-tuned score to a zero-shot score from a different paper is essentially meaningless. The table separates these cases explicitly.

Real-robot evaluations are not reproducible. A researcher at another institution cannot replicate the Figure Helix or Tesla Optimus results. Physical robot evaluations require the specific hardware, the specific environment setup, and cooperation from the company that ran them. Independent verification is structurally impossible for most real-robot claims in this space.

Benchmark contamination is possible. Several of these benchmarks have been public for 2-4 years. Any model trained on large web scrapes of robotics literature and code may have encountered benchmark task descriptions, solution approaches, or even specific object arrangements during pre-training. This is particularly a concern for language-conditioned tasks where the task description itself is a natural language string that could appear in training data.

Physics sim and real-world physics diverge. Soft objects, deformable materials, and tasks involving liquids are poorly simulated by MuJoCo and Isaac Sim. RoboCasa explicitly avoids these task categories. CALVIN's physical objects are rigid. The benchmarks in this leaderboard systematically undertest the task categories that are hardest in the real world.

Sources

Scientific Reasoning LLM Leaderboard 2026: GPQA Ranks

James Kowalski — Sun, 19 Apr 2026 00:00:00 +0000

Scientific reasoning is its own distinct capability - one that gets blurred when it's lumped in with general reasoning or pure mathematics. This leaderboard is specifically about STEM: physics, chemistry, biology, and earth science. It focuses on problems that require domain knowledge applied under reasoning pressure, not just symbol manipulation.

To be clear about scope: this is not the Math Olympiad leaderboard, which covers AIME, IMO, FrontierMath, and formal proof benchmarks. It is also not the general reasoning leaderboard, which covers GPQA Diamond alongside AIME and Humanity's Last Exam as a broader trio. If you landed here because you care about how well models solve physics problems, balance chemical equations, reason through genetics, or apply thermodynamics - you are in the right place. If you are choosing a model for hallucination resistance or factual recall, see the hallucination benchmarks leaderboard.

TL;DR

Reasoning-optimized models (o3, Claude 4 Opus, Gemini 2.5 Pro Deep Think) dominate GPQA Diamond and OlympiadBench-Sci - gap over non-reasoning frontiers is 6-12 percentage points
On knowledge-heavy tasks (MMLU-STEM, ARC-Challenge), non-reasoning frontier models like GPT-4.1 and DeepSeek V3.2 close the gap to within a few points
Open-weight models (Llama 4 Maverick, Phi-4, Qwen 3.5) trail on GPQA Diamond but are competitive on ARC-Challenge and MMLU-STEM

The Benchmarks Explained

GPQA Diamond - PhD-level science

GPQA (Graduate-Level Google-Proof Questions Answering) Diamond is a set of 198 extremely hard multiple-choice questions written by domain experts in physics, chemistry, and biology. "Google-Proof" is literal: even PhDs in the relevant field with unrestricted internet access score around 81%. Non-expert humans with internet access land near 34%, barely above random chance on four-option questions.

These are not trivia questions. A chemistry item might require applying thermodynamic cycle analysis to a novel organic system. A physics item might ask you to derive a scattering cross-section under unusual boundary conditions. GPQA Diamond is the most demanding science reasoning benchmark with wide model coverage, which makes it the anchor of this leaderboard. Paper: arxiv.org/abs/2311.12022. Repository: github.com/idavidrein/gpqa.

SciBench - Quantitative college textbook problems

SciBench targets open-ended quantitative problems drawn from university-level textbooks in thermodynamics, quantum mechanics, electromagnetism, and physical chemistry. Unlike multiple-choice benchmarks, it requires the model to produce a numerical answer, often in a specific unit and format. This makes it more brittle on scoring but much harder to game. A model that cannot set up and solve differential equations, apply conservation laws, or use dimensional analysis will fail here regardless of its parametric knowledge. Paper: arxiv.org/abs/2406.11694. Repository: github.com/mandyyyyii/scibench.

OlympiadBench-Science - International science olympiad problems

OlympiadBench includes problems from Physics, Chemistry, and Biology Olympiad competitions (IPhO, IChO, IBO) across multiple difficulty levels. These are problems that stump talented high-school students who have been specifically trained for them. The Science subset (OlympiadBench-Sci) excludes the mathematics problems tracked separately in the Math Olympiad leaderboard. Paper: arxiv.org/abs/2402.14008. Repository: github.com/OpenBMB/OlympiadBench.

MMLU-STEM - Broad knowledge across 12 STEM subjects

Massive Multitask Language Understanding (MMLU) covers 57 subjects; the STEM subset isolates 12 technical disciplines including abstract algebra, astronomy, college biology, college chemistry, college physics, computer science, high-school chemistry, high-school physics, and others. At roughly 4,000 questions with four-choice answers, MMLU-STEM is more a knowledge breadth test than a deep reasoning test. Models that have absorbed a wide undergraduate science curriculum score well even without chain-of-thought. Paper: arxiv.org/abs/2009.03300.

ARC-Challenge - Multi-step science reasoning

The AI2 Reasoning Challenge (ARC) Challenge set contains 1,172 four-choice science questions that retrieval-based and word-overlap systems could not answer correctly. They test multi-step inference: a question about thermal expansion might require knowing that metals conduct heat, that molecules vibrate faster at higher temperatures, and that this causes dimensional change, all in a single problem. ARC-Challenge remains useful for separating capable models from capable-looking ones. Dataset: huggingface.co/datasets/allenai/ai2_arc.

ChemQA and Physics Olympiad subset

For chemistry and physics specifically, I pull in scores on ChemQA (college-level quantitative chemistry problems with multi-step synthesis and reaction pathways) and the Physics Olympiad subset from OlympiadBench where providers have reported them. These scores are less uniformly reported, so I treat them as supplementary signal in the last column rather than a primary ranking criterion.

Scientific Reasoning Rankings - April 2026

Scores below are drawn from published papers, model cards, and official system cards. Where no public figure exists, I write "Not reported" rather than interpolate. Ranges indicate conflicting reports across evaluation sources or different prompting conditions.

Rank	Model	Provider	GPQA Diamond	SciBench	OlympiadBench-Sci	MMLU-STEM avg	ARC-Challenge	ChemQA / Phys Olympiad	Notes
1	o3	OpenAI	87.7%	Not reported	Not reported	~92%	98.0%	Not reported	Best public GPQA Diamond from system card
2	Claude 4 Opus	Anthropic	84.9%	Not reported	Not reported	~91%	97.8%	Not reported	Anthropic system card; inference-time reasoning
3	Gemini 2.5 Pro (Deep Think)	Google DeepMind	84.0%	Not reported	Not reported	~91%	Not reported	Not reported	Deep Think mode; Gemini 2.5 Pro tech report
4	GPT-4.1	OpenAI	75.0%	Not reported	Not reported	88.5%	96.8%	Not reported	OpenAI system card; non-reasoning baseline
5	Claude 4 Sonnet	Anthropic	78.2%	Not reported	Not reported	~89%	97.2%	Not reported	Anthropic system card
6	Gemini 2.5 Pro	Google DeepMind	80.3%	Not reported	Not reported	89.9%	Not reported	Not reported	Standard mode; tech report
7	DeepSeek-R2	DeepSeek	76.8%	Not reported	Not reported	~88%	Not reported	Not reported	Estimated from R2 tech report; reasoning chain
8	DeepSeek V3.2	DeepSeek	71.6%	Not reported	Not reported	88.3%	Not reported	Not reported	DeepSeek V3.2 tech report
9	Grok 4 Heavy	xAI	77.1%	Not reported	Not reported	~89%	Not reported	Not reported	xAI system card; limited independent verification
10	Qwen 3.5	Alibaba	72.0%	Not reported	Not reported	85.0%	Not reported	Not reported	Qwen 3.5 tech report
11	QwQ-Max	Alibaba	68.3%	Not reported	Not reported	~83%	Not reported	Not reported	Alibaba model card
12	Llama 4 Maverick	Meta	69.8%	Not reported	Not reported	84.5%	95.3%	Not reported	Meta Llama 4 tech report
13	Phi-4	Microsoft	62.8%	Not reported	Not reported	82.0%	94.1%	Not reported	Microsoft Phi-4 tech report
14	Mistral Large 3	Mistral	61.0%	Not reported	Not reported	78.0%	93.4%	Not reported	Mistral model card
15	Skywork-OR1	Skywork	57.2%	Not reported	Not reported	Not reported	Not reported	Not reported	Limited public reporting

Rankings ordered primarily by GPQA Diamond, which has the most consistent coverage across models. Dashes and "Not reported" entries are genuine gaps in public documentation - not omissions on my part. SciBench, OlympiadBench-Sci, and ChemQA/Physics Olympiad scores have sparse coverage across the current generation of frontier models because providers updated systems faster than benchmark papers could track them.

Models where only estimated scores are available are marked with a tilde (~). Estimates derive from interpolation across related benchmarks in official tech reports - not from independent runs.

Domain Breakdown

Where the data allows, it helps to separate science reasoning by domain. The pattern that emerges across GPQA Diamond's question categories and OlympiadBench's subject split is consistent: reasoning-optimized models pull ahead most sharply on physics and chemistry, where multi-step formal reasoning is unavoidable. Biology and earth science - which rely more heavily on factual recall and classification - show a smaller gap between reasoning and non-reasoning frontiers.

Physics

Quantitative physics demands both symbolic reasoning and numerical fluency. A model must understand the structure of a problem (identify relevant equations, recognize symmetry, choose a coordinate system), then execute the algebra without error, and finally interpret the result physically. Reasoning-heavy models that take extended compute at inference time have a significant edge here. On the IPhO subset of OlympiadBench, the gap between the top reasoning models and non-reasoning frontiers is wider than any other science domain.

SciBench's thermodynamics and quantum mechanics problems align with this finding: models that produce extended chain-of-thought consistently outperform those that do not, even controlling for parameter count. The answer-format brittleness of SciBench (unit mismatches kill otherwise correct solutions) is a real caveat, but the rank ordering is stable.

Chemistry

Chemistry straddles qualitative (reaction mechanisms, periodicity, molecular geometry) and quantitative (stoichiometry, equilibrium constants, thermodynamic cycles). GPQA Diamond's chemistry questions skew toward multi-step quantitative problems. This is the domain where chain-of-thought length most predicts accuracy: models that work through each reaction step explicitly perform better than those that shortcut.

ChemQA, where reported, shows a similar tier structure. Frontier reasoning models and frontier non-reasoning models are separated by roughly 8-10 points on quantitative synthesis questions, with the gap narrowing on qualitative mechanism identification.

Biology

Biology in GPQA Diamond is weighted toward molecular biology, genetics, and biochemistry - areas where conceptual depth matters more than formal manipulation. Non-reasoning frontier models (GPT-4.1, Gemini 2.5 Pro standard, DeepSeek V3.2) close significantly closer to the reasoning models here. A model that has thoroughly internalized the genetics and biochemistry literature can perform well without extended inference time.

The exception is multi-step experimental design questions, where reasoning chains help significantly. But on straightforward molecular biology recall and classification, the advantage of inference-time compute is smaller than in physics or chemistry.

Earth Science and Interdisciplinary

Earth science questions in MMLU-STEM (college physical geology, climate science) favor models with broad factual coverage. ARC-Challenge's earth science questions (weather, ecosystems, geological processes) are generally the easiest sub-category for frontier models to handle. The interesting cases are interdisciplinary questions - astrobiology, biogeochemistry, environmental chemistry - where models need to integrate across domains. This is where non-reasoning frontier models show their weakest relative performance.

Key Findings

Reasoning-heavy models dominate the top of GPQA Diamond. The gap between o3 (87.7%) and a strong non-reasoning frontier model like GPT-4.1 (75.0%) is twelve percentage points. This is not noise. It reflects a genuine structural advantage of extended chain-of-thought on problems that require multi-step formal derivation. For applications where STEM reasoning accuracy is critical - scientific research assistance, education, technical documentation - this gap is practically meaningful.

Frontier non-reasoning models close the gap on knowledge-heavy tasks. On MMLU-STEM and ARC-Challenge, models like GPT-4.1, DeepSeek V3.2, and Gemini 2.5 Pro standard sit within a few points of the reasoning-optimized frontiers. A lot of MMLU-STEM can be answered from memorized factual associations without explicit reasoning chains. For applications that prioritize breadth of science knowledge over deep problem-solving, the cost premium of reasoning models is harder to justify.

Open-weight models are competitive on ARC and MMLU-STEM but trail on GPQA Diamond. Llama 4 Maverick at 69.8% GPQA Diamond, Phi-4 at 62.8%, and Qwen 3.5 at 72.0% are all credible open-weight results. But none of them are within shouting distance of the 84-87% range occupied by the top closed-source reasoning models on the benchmark that most directly measures expert-level science reasoning. The open-weight ecosystem is improving fast, but the GPQA Diamond gap is real.

Scientific data coverage is sparse. Almost every model I surveyed has published MMLU-STEM and ARC-Challenge scores. Very few have published SciBench or OlympiadBench-Sci scores for current-generation models. The benchmark infrastructure has not kept pace with model releases - providers ship models faster than independent evaluation can measure them. This is a problem for the field and a limitation of this leaderboard.

Methodology

Scores in this leaderboard are sourced from the following in priority order:

Official system cards and technical reports from model providers (OpenAI, Anthropic, Google DeepMind, DeepSeek, Meta, Microsoft, Alibaba, Mistral, xAI)
The benchmark papers themselves, where newer models were evaluated post-publication
Independent evaluation platforms with documented methodology

I do not publish scores from social media posts, unverifiable blog posts, or uncited third-party sources. Where a score is marked with ~, it is an estimate from interpolation across benchmarks documented in the relevant tech report - not a number I measured.

Rankings are ordered by GPQA Diamond because it has the most consistent coverage across the model set and the best-validated methodology. Where GPQA Diamond scores are tied within a percentage point, MMLU-STEM average serves as tiebreaker.

Caveats and Limitations

GPQA contamination risk. The 198 GPQA Diamond questions have been public since late 2023. Any model trained after that date may have seen them during pre-training or instruction tuning. Providers typically claim no deliberate contamination, but it is impossible to audit this fully. GPQA Diamond scores should be interpreted as an upper bound - actual out-of-distribution science reasoning performance may be lower.

SciBench answer-format brittleness. SciBench requires exact numerical answers in specified units. A model that correctly sets up and solves a thermodynamics problem but expresses the answer in joules when the expected unit is kilojoules will score zero. This creates variance unrelated to reasoning ability. It also means that prompting format and unit-specification in the system prompt can swing SciBench scores by several points - which makes comparing scores across evaluation setups unreliable.

Lab and equipment procedural reasoning is not tested. None of these benchmarks test whether a model can reason about actual laboratory procedure - titration protocol, spectroscopy interpretation, statistical error analysis in experimental data. GPQA Diamond and SciBench test theoretical and quantitative reasoning. A model that scores well here may still fail on questions that require practical experimental knowledge.

Scientific paper generation is a separate problem. Scoring well on multiple-choice and short-answer science benchmarks does not mean a model can write accurate, non-hallucinated scientific literature. The hallucination benchmarks leaderboard covers factual accuracy in generation more directly. See also the general reasoning leaderboard for GPQA Diamond in the context of broader reasoning benchmarks.

OlympiadBench-Sci and ChemQA coverage is sparse. These two benchmarks have solid papers and methodology behind them, but few current-generation frontier models have been evaluated against them using consistent prompting conditions. I have not fabricated numbers to fill the table - "Not reported" is honest, and the field needs to do better on this front.

Grok 4 Heavy lacks API access. xAI's Grok 4 Heavy is only available through the Grok web interface and iOS/Android app. Independent benchmarking is limited to what xAI has reported in its own system cards. Treat its scores with appropriate skepticism compared to models that have been independently replicated.

Sources

Structured Output JSON Schema Leaderboard 2026

James Kowalski — Sun, 19 Apr 2026 00:00:00 +0000

Every agent pipeline eventually bottlenecks on structured output. It doesn't matter how good a model's reasoning is if it can't reliably return a JSON object that matches the schema your downstream code expects. A single missing required field, an extra property the schema forbids, or an incorrectly typed value will break the pipeline as surely as a wrong tool choice.

This leaderboard covers both sides of the structured output problem: native JSON schema enforcement built into model APIs (OpenAI Structured Outputs, Anthropic tool use, Google Gemini responseSchema, and others), and open-source constrained decoding libraries that work at the inference layer (Outlines, Guidance, LM Format Enforcer, XGrammar, SGLang, jsonformer). The two approaches solve the same problem at different levels of the stack, and the right choice depends on whether you control the inference runtime.

Scores come from JSONSchemaBench - a Microsoft Research and EPFL paper from January 2025 testing six constrained decoding frameworks across 10,000 real-world JSON schemas - and from BFCL v3, which covers structured function call formatting across frontier models. Native API providers are assessed separately where official documentation reports compliance behavior.

If you're reading this alongside the function calling benchmarks leaderboard, note the distinction: function calling evaluates whether a model picks the right tool and populates the right arguments. Structured output benchmarks evaluate whether the raw JSON or object the model emits is valid against a schema, regardless of the task it's solving. The instruction following leaderboard covers a third dimension - whether a model follows format constraints given in natural language instructions. All three matter for production agents.

TL;DR

On JSONSchemaBench (10K real-world schemas), Guidance leads on coverage (96% empirical coverage on GlaiveAI schemas) while OpenAI and Gemini native APIs achieve 100% compliance on schemas they support but cover fewer schema types
Constrained decoding libraries handle complex nested schemas better than native APIs but with measurable latency costs - Outlines compiles grammars in 3-8 seconds vs. Guidance's near-zero compile time
On BFCL v3 structured function call formatting, GLM-4.5 (76.7%) and Qwen3 32B (75.7%) lead frontier models; Claude Opus 4 scores 25.3% due to conversational output wrapping that fails AST parsing
For production pipelines: if you control the inference runtime, Guidance delivers the strongest coverage-speed tradeoff; if you're calling a hosted API, OpenAI strict: true mode offers the most reliable guarantee within its supported schema subset

The Benchmarks Explained

JSONSchemaBench

JSONSchemaBench, published by researchers from EPFL, Microsoft Research, and the JSON Schema team in January 2025, is the most systematic evaluation of constrained decoding that exists. The benchmark contains 9,558 real-world JSON schemas organized across 10 datasets with varying complexity levels - from simple flat objects to deeply nested schemas with $ref resolution, anyOf/oneOf combiners, and complex constraint types like pattern, minimum, uniqueItems, and const.

The evaluation measures three things:

Empirical coverage - what fraction of schemas in the dataset does a framework successfully process? A framework that crashes on anyOf schemas will score low here even if it handles simple schemas perfectly.

Compliance rate - for the schemas a framework does process, what fraction of generated outputs actually validate against the schema? A framework can have high empirical coverage but low compliance if it attempts all schemas but fails many.

Efficiency - grammar compilation time (the overhead before generation begins) and time per output token (the generation slowdown introduced by constrained decoding).

The researchers tested six frameworks: Guidance, Outlines, Llamacpp (via llama.cpp's GBNF grammar backend), XGrammar, OpenAI (gpt-4o with response_format: {type: "json_schema"}), and Gemini (gemini-1.5-pro with responseSchema). All constrained decoding frameworks ran against the same base model (Llama-3.1-8B-Instruct) to isolate framework behavior from model capability.

BFCL v3 (Berkeley Function Calling Leaderboard)

BFCL v3 from UC Berkeley's Sky Computing Lab is the standard tool-use benchmark. It's relevant here because function calls require valid JSON payloads matching an argument schema. BFCL uses Abstract Syntax Tree comparison to check structural correctness - catching subtle issues like mistyped field names and incorrectly nested arguments. Results here reflect how frontier models generate structured API call payloads when using their native tool-use APIs.

JSONSchemaBench Results

Coverage and Compliance by Dataset Complexity

Data from JSONSchemaBench (arXiv:2501.10868), Table 4. All constrained decoding frameworks tested against Llama-3.1-8B-Instruct. OpenAI and Gemini tested against their respective hosted models.

GlaiveAI Dataset (moderate complexity)

Rank	Framework / API	Empirical Coverage	Compliance Rate
1	Guidance	96%	98%
2	Llamacpp (GBNF)	95%	97%
3	Outlines	95%	96%
4	XGrammar	93%	93%
5	OpenAI (gpt-4o)	89%	100%
6	Gemini (1.5 Pro)	86%	100%

The native API results reveal an important tradeoff: OpenAI and Gemini achieve perfect compliance on the schemas they process, but skip more schemas than the local frameworks do. When OpenAI's structured output mode encounters a schema type it doesn't support (like certain $ref patterns or anyOf with complex branches), it falls back to unguided generation rather than attempting to comply - so the 100% compliance number comes at the cost of 11% empirical coverage.

GitHub Easy Dataset (simpler, developer-produced schemas)

Rank	Framework / API	Empirical Coverage	Compliance Rate
1	Guidance	86%	96%
2	XGrammar	79%	87%
3	Llamacpp (GBNF)	75%	88%
4	Outlines	59%	83%
5	OpenAI (gpt-4o)	29%	97%
6	Gemini (1.5 Pro)	Not reported	Not reported

OpenAI's coverage drops to 29% on the GitHub Easy dataset - not because the schemas are harder, but because they include more variety in structural patterns (schemas from real GitHub repos) that OpenAI's strict mode doesn't cover. The 97% compliance on what it does handle remains strong.

GitHub Hard Dataset (complex, highly nested real-world schemas)

Rank	Framework / API	Empirical Coverage	Compliance Rate
1	Guidance	41%	69%
2	Llamacpp (GBNF)	39%	63%
3	XGrammar	28%	41%
4	Outlines	3%	6%
5	OpenAI (gpt-4o)	Not reported	Not reported
6	Gemini (1.5 Pro)	Not reported	Not reported

The hard dataset exposes a harsh truth: none of the tested frameworks handles deeply complex schemas reliably. Guidance leads at 41% coverage and 69% compliance, but a 31% failure rate on complex schemas is still meaningful for production use. Outlines' 3% coverage on this dataset is the sharpest drop - its grammar compilation approach struggles significantly with advanced JSON Schema constructs like multi-level $ref resolution and complex combiners.

Grammar Compilation Efficiency

Constrained decoding imposes two types of overhead: grammar compilation time (paid once per schema, before generation starts) and per-token generation overhead (paid on every generated token). Both matter in production. The per-schema compilation time determines whether you can afford to compile dynamically per-request. The per-token overhead determines throughput.

Data from JSONSchemaBench Tables 2-3.

Grammar Compilation Time

Framework	Compilation Time (GlaiveAI)	Compilation Time (GitHub)
Guidance	0.00-0.01 seconds	0.00-0.01 seconds
Llamacpp	0.05-0.06 seconds	0.05-0.06 seconds
XGrammar	0.12-0.30 seconds	0.12-0.30 seconds
Outlines	3.48-8.05 seconds	Variable

Outlines' compile time of 3-8 seconds per schema is its most significant production limitation. For applications that compile grammars once and reuse them across many requests (batch processing, static schema services), this isn't critical. For request-time schema compilation - where you're generating a schema per user request and enforcing it immediately - Outlines' compile time will dominate your latency budget. Guidance's near-zero compile time changes the calculus entirely for this use case.

Time Per Output Token (Generation Overhead)

Framework	TPOT (ms)	vs. Unconstrained
Guidance	6.37-9.47	Minimal overhead
Llamacpp	27.22-29.98	~4x slower
Outlines	30.33-46.57	~5x slower
XGrammar (HF backend)	65.20-66.78	~10x slower

Guidance's per-token speed is notably faster than other libraries. The paper attributes this to Guidance's token-level rather than character-level FSM approach and to its coalescence optimization, which defers constraint application to avoid unnecessary re-computation during generation.

JSON Schema Feature Support

The JSONSchemaBench authors also ran frameworks against the official JSON Schema Test Suite to measure feature coverage. The test suite contains 440 individual constraint categories.

Framework	Categories with 100% Coverage	Categories with > 50% Coverage
Guidance	13	21
Llamacpp	1	5
XGrammar	1	3
Outlines	0	2

Guidance covers more of the JSON Schema specification than any other tested framework. Most libraries implement a working subset of JSON Schema that handles common cases but doesn't cover the full spec. For applications that generate schemas programmatically or accept user-provided schemas, this matters - a user could provide a valid JSON Schema that the framework simply can't process.

BFCL v3 Rankings - Structured Function Call Format

Data from llm-stats.com, April 2026. These scores measure whether models produce structurally valid JSON function call payloads matching tool schemas - a closely related but distinct task from free-form JSON generation.

Rank	Model	Provider	BFCL v3 Score
1	GLM 4.5 Thinking	Z AI	76.7%
2	Qwen3 32B	Alibaba	75.7%
3	Qwen3 235B A22B	Alibaba	74.9%
4	GLM-4.7-Flash	Z AI	74.6%
5	GLM 4.5 Air	Z AI	69.1%
6	Nova Pro 1.0	Amazon	67.9%
7	Kimi K2.5	Moonshot AI	64.5%
8	INTELLECT-3	Prime Intellect	63.5%
9	Llama 4 Scout	Meta	55.7%
10	Gemini 3 Flash Preview Thinking	Google	53.5%
11	MiniMax M1	MiniMax	47.8%
12	Claude Opus 4	Anthropic	25.3%

The Claude result at 25.3% looks alarming but reflects a specific evaluation mismatch rather than genuine inability to produce valid JSON. BFCL's AST parser expects tool calls in a rigid format; Claude wraps its tool invocations in conversational context that the parser rejects even when the underlying JSON structure is correct. When Claude is tested on multi-turn task completion - which tolerates formatting variation - it leads the field. See the function calling benchmarks leaderboard for a full treatment of this split.

For structured output tasks where you need exact schema compliance, not just tool-call formatting, the BFCL results suggest using an explicit enforcement layer (constrained decoding or native strict mode) rather than relying on model instruction following alone.

Native API Approaches Compared

The major hosted model providers all offer some form of structured output enforcement. They differ significantly in scope and reliability.

OpenAI Structured Outputs (`strict: true`)

OpenAI's structured outputs mode, introduced in August 2024 and available on gpt-4o and later models, is the most widely adopted native approach. With strict: true, OpenAI guarantees that the model output matches the provided JSON Schema - it won't return invalid JSON and it won't omit required fields. The guarantee is enforced at the API level using constrained decoding on OpenAI's serving infrastructure.

The tradeoff is that strict: true only supports a specific subset of JSON Schema. Unsupported features include: anyOf branches with incompatible types, certain $ref usage patterns, unevaluatedProperties, and some integer constraint patterns. When you submit a schema that uses unsupported features, OpenAI falls back to non-strict mode without always surfacing a clear warning. The JSONSchemaBench results - 89% empirical coverage on moderate schemas, 29% on developer-produced schemas - reflect this subset limitation.

For applications where you control the schema and can design it to stay within OpenAI's supported subset, the guarantee is strong. For applications that accept or generate schemas dynamically, the coverage gap matters.

Anthropic Tool Use JSON Schema

Anthropic's tool use API requires tool parameter schemas to be provided as JSON Schema. Claude's output for tool calls is structurally JSON - it will produce a JSON object for each tool invocation - but the enforcement is behavioral rather than architectural. Anthropic doesn't apply constrained decoding at the API level; the model is trained to follow tool schemas reliably.

In practice, Claude scores extremely well on multi-turn tool use benchmarks (0.862 on tau-bench retail, leading the field) but produces conversational output that fails strict AST parsing in single-call evaluations. For production use, the key distinction is that Anthropic doesn't provide a hard guarantee of schema validity the way OpenAI strict: true does. Most well-formed requests will produce valid tool calls, but schema violations are possible on unusual or complex schemas.

Google Gemini `responseSchema`

Google Gemini's structured output via responseSchema in the generation config is the Gemini equivalent of OpenAI's strict mode. Available on Gemini 1.5 Pro and later, it accepts a JSON Schema and enforces the output structure. JSONSchemaBench results show 86% empirical coverage and 100% compliance on supported schemas - similar coverage to OpenAI but with the same subset limitation.

Gemini's implementation handles a slightly different subset of JSON Schema than OpenAI's. For teams evaluating both providers, it's worth testing your specific schemas against both - the unsupported schema patterns don't overlap exactly.

Mistral Response Format

Mistral supports a response_format parameter in its chat API that accepts {"type": "json_object"} for JSON mode. This enforces valid JSON but does not validate against a schema. For schema-level enforcement with Mistral models, you need to implement validation client-side or use an external constrained decoding layer. No published benchmark scores are available for Mistral-specific schema adherence.

Together AI and Fireworks JSON Modes

Together AI's JSON mode and Fireworks' structured output both support response_format with schema-based enforcement. Fireworks specifically uses a grammar-based approach (similar to Outlines/GBNF) that supports a wider schema subset than pure API-level enforcement. Neither has published systematic benchmarks against JSONSchemaBench, so schema validity rates are not reported here.

Open-Source Constrained Decoding Libraries

For teams running their own inference, constrained decoding libraries give you framework-level enforcement that works independently of which model you use.

Outlines

Outlines, from dottxt-ai (formerly the outlines-dev organization), is the most popular open-source library for structured generation. It uses a finite state machine (FSM) compiled from a JSON Schema to mask invalid tokens at each generation step - the model can only produce tokens that would lead to a valid completion of the current partial output.

JSONSchemaBench results: 95% coverage, 96% compliance on GlaiveAI schemas. Drops to 3% coverage on GitHub Hard schemas. Grammar compilation takes 3-8 seconds. Per-token generation adds ~5x overhead vs. unconstrained generation using the HuggingFace backend (though Outlines also supports VLLM, which offers substantially better throughput).

Best for: applications where schemas are known at startup, batched generation, or VLLM-backed deployments where the compilation cost is amortized.

Guidance

Guidance, from Microsoft Research, takes a different approach. Rather than compiling a full FSM at the start, it interleaves generation with constraint checking at a finer granularity. Its token-level (rather than character-level) FSM design and coalescence optimization reduce per-token overhead substantially.

JSONSchemaBench results: 96% coverage, 98% compliance on GlaiveAI schemas. 86% coverage on GitHub Easy. 41% coverage and 69% compliance on GitHub Hard - the strongest result in that category. Grammar compilation is near-zero. Per-token generation is 6-9 ms vs. 30+ ms for other libraries.

Guidance is also the only tested framework with meaningful JSON Schema Test Suite coverage across multiple feature categories. For production systems with complex or user-provided schemas, it offers the most complete spec coverage currently available.

Best for: dynamic schema compilation, complex schemas with $ref and combiners, applications where per-token latency matters.

XGrammar

XGrammar, from the MLC AI team (the group behind TVM and MLC-LLM), optimizes for serving throughput at scale. Its context-free grammar approach is designed for GPU batched inference scenarios where you need to enforce different schemas for different items in a batch simultaneously.

JSONSchemaBench results: 93% coverage, 93% compliance on GlaiveAI schemas. 79% coverage on GitHub Easy. 28% coverage on GitHub Hard. Compile time 0.12-0.30 seconds. Per-token generation using the HuggingFace backend is 65-67 ms - slower than Outlines on the same backend - but the library is optimized for dedicated GPU serving runtimes rather than the HuggingFace inference pipeline used in the benchmark.

Best for: GPU serving infrastructure where you need batched constrained decoding across many concurrent requests.

LM Format Enforcer

LM Format Enforcer is a lightweight library that integrates with HuggingFace Transformers, VLLM, and llama.cpp backends. It supports JSON Schema enforcement by filtering the logits at each generation step. It wasn't included in the JSONSchemaBench evaluation, so no systematic coverage data is available. Its main advantage is broad framework compatibility and simple integration: it works as a logit processor that can be dropped into most existing inference pipelines without restructuring the serving setup.

Best for: quick integration into existing HuggingFace or VLLM pipelines.

jsonformer

jsonformer takes a simpler approach than FSM-based libraries: it generates each field of a JSON object separately, using the schema to determine the type and constraints of each field and then calling the model only for the value portion. This avoids generating structural JSON tokens (braces, commas, colons) under model control entirely.

No published JSONSchemaBench scores. The approach works well for simple flat schemas and becomes increasingly limited with complex nested structures, optional fields, and anyOf branching. Not recommended for schemas that go beyond basic typed flat objects.

SGLang

SGLang's structured outputs support is built into its serving framework rather than layered on top. SGLang uses a compiled EBNF grammar approach and is designed for high-throughput serving. It supports JSON Schema, regex, and custom grammars. No independent JSONSchemaBench evaluation exists, but SGLang's XGrammar backend integration means its coverage characteristics approximate XGrammar's results when using that backend.

Best for: high-throughput inference servers where you want constrained decoding integrated at the serving layer rather than in client code.

Constrained decoding libraries enforce JSON Schema compliance by masking invalid tokens at each generation step - ensuring structural validity at the cost of some generation overhead. Source: pexels.com

Quality Impact of Constrained Decoding

One concern about constrained decoding is that forcing the model down a constrained token path might degrade output quality - the model can't choose the best token, only the best valid token. JSONSchemaBench tested this directly on three reasoning tasks, comparing framework outputs against unconstrained base model outputs.

Data from JSONSchemaBench Table 8 (Llama-3.1-8B-Instruct, quality assessment):

Framework	Last Letter (%)	GSM8K (%)	Shuffle Objects (%)
Guidance	54.0	83.8	55.9
Outlines	53.3	81.6	53.0
Llamacpp	52.0	82.4	52.6
XGrammar	51.2	83.7	52.7
Unconstrained	50.7	80.1	52.6

The finding is reassuring: constrained decoding doesn't hurt quality - and in several cases, it marginally improves it. The improvement on GSM8K (math reasoning) is notable: Guidance improves from 80.1% to 83.8% when the output must be formatted as structured JSON. The forced structure appears to help the model organize its answer rather than degrade its reasoning.

This result is consistent with prior work on chain-of-thought formatting: structured output constraints can act as a light scaffold that improves answer quality, especially for numerical tasks.

Key Takeaways

Coverage vs. guarantee: the core tradeoff

The most important finding from JSONSchemaBench is that no approach does everything well. Constrained decoding libraries (especially Guidance) cover more schema types and handle complex structures better than native APIs. But native APIs like OpenAI strict: true offer a hard guarantee - when they support a schema, the output is always valid. Libraries offer probabilistic compliance even at 96-98% rates.

For production systems: if your schema is stable, simple, and within the native API's supported subset, use the native API. If you need complex schema support or schema flexibility, use a constrained decoding library.

Outlines' compilation overhead is a real constraint

The 3-8 second grammar compilation time for Outlines rules out per-request dynamic schema compilation in latency-sensitive applications. If you're building a system where schemas vary per request (user-defined output formats, dynamic API integrations), Guidance's near-zero compile time is a meaningful operational advantage.

BFCL scores alone don't predict JSON validity

The BFCL leaderboard measures structured API call formatting. Claude Opus 4 at 25.3% and Guidance at 96-98% JSON Schema compliance are measuring fundamentally different things. A model's BFCL score tells you about its tool-call formatting behavior; it doesn't predict how it performs on arbitrary JSON Schema enforcement tasks. Use the JSONSchemaBench coverage and compliance numbers for the latter.

Hard schemas are still hard for everyone

On GitHub Hard schemas - the deeply nested, complex real-world schemas - even the best framework (Guidance) only achieves 41% coverage and 69% compliance. This is a research frontier problem, not a solved engineering one. For applications that need to enforce arbitrary complex JSON Schemas today, the practical answer is a combination of constrained decoding and post-generation validation with retry logic.

For broader agent infrastructure context, see our roundup of best AI agent frameworks, which covers how these structured output approaches integrate into full agent orchestration stacks.

Methodology Notes and Caveats

Schema complexity scaling: JSONSchemaBench's three dataset tiers - GlaiveAI (moderate), GitHub Easy, GitHub Hard - show non-linear degradation for all frameworks. Results on GlaiveAI schemas do not predict results on complex schemas. If your application uses schemas with advanced constructs (nested $ref, anyOf/oneOf combiners, pattern validation), test against the GitHub Hard tier performance profile, not the headline GlaiveAI numbers.

Refusal rates and fallback behavior: Native API providers handle unsupported schema types differently. OpenAI silently falls back to non-strict mode; Gemini's behavior on unsupported schemas is less consistently documented. Monitor your API responses for the refusal field in the response object to detect cases where the provider is generating without constraint enforcement.

Speed penalty of constrained decoding: The per-token generation overhead ranges from minimal (Guidance) to ~10x (XGrammar on HuggingFace backend). Benchmark measurements used the HuggingFace Transformers backend for local libraries, which isn't representative of production VLLM or TensorRT-LLM deployments. Outlines on VLLM, for example, runs substantially faster than Outlines on HuggingFace Transformers. Reproduce benchmarks in your target serving environment before drawing production conclusions.

Base model matters for constrained decoding quality: JSONSchemaBench's framework comparison used Llama-3.1-8B-Instruct as the base model. A larger or more capable model will produce higher quality outputs under the same constraints. The coverage and compliance numbers reflect framework capability, not a prediction of what you'd achieve with GPT-4o or Claude running through Guidance.

Native API feature sets evolve: OpenAI, Anthropic, and Google update their structured output implementations and expand supported schema subsets over time. The coverage numbers from JSONSchemaBench (early 2025) may not reflect the current feature sets of hosted APIs. Check current provider documentation for the latest supported schema patterns.

FAQ

Which approach is best for strict JSON Schema enforcement in production?

For simple, stable schemas within the OpenAI-supported subset: OpenAI strict: true. For complex schemas or schema flexibility: Guidance with post-generation validation. Both approaches have different failure modes - test your specific schema against each before committing.

Does constrained decoding hurt the quality of generated content?

Based on JSONSchemaBench's quality measurements: no. Constrained generation matched or slightly exceeded unconstrained generation quality on all three tested tasks. The forced structure appears to help rather than hurt reasoning quality in some cases.

Why does Claude score so low on BFCL structured call formatting?

Claude's BFCL score of 25.3% reflects how it wraps tool calls in conversational context that BFCL's AST parser rejects. It's not evidence of poor JSON generation capability - Claude leads multi-turn tool use benchmarks. Use an explicit schema enforcement layer if you need hard guarantees rather than relying on BFCL scores to predict schema compliance behavior.

Can I use constrained decoding with hosted APIs?

Some hosted inference providers support it. Fireworks has a grammar-based structured output mode. Together AI has JSON mode. For truly arbitrary schema enforcement, you typically need to control the inference runtime (self-hosted models via VLLM, llama.cpp, or similar) to apply a constrained decoding library at the logit level.

How do I handle schemas that fall in the "unsupported" range for native APIs?

Two options: redesign your schema to stay within the provider's supported subset (usually means avoiding advanced $ref, complex anyOf, and certain string pattern constraints), or run a constrained decoding library on a self-hosted model. If the schema is user-provided and you can't control it, add client-side validation with retry logic regardless of which enforcement approach you use.

What is XGrammar's advantage over Outlines or Guidance?

XGrammar is optimized for batched GPU inference at scale - applying different schema constraints to different items in a batch simultaneously. Its HuggingFace Transformers benchmark times look slow, but it's not designed for that backend. In a dedicated GPU serving deployment (e.g., SGLang's XGrammar backend or a custom TensorRT setup), it achieves better throughput than libraries designed for single-request inference.

Sources:

Summarization LLM Leaderboard 2026: ROUGE and Faithfulness

James Kowalski — Sun, 19 Apr 2026 00:00:00 +0000

Summarization looks like a solved problem on the leaderboards. Load up CNN/DailyMail, run ROUGE-L against the reference summaries, and most frontier models cluster in the high 40s - a range that barely moved between GPT-3 and GPT-4, and hasn't moved much since. The metrics suggest diminishing returns. The reality is different.

Short news article summarization is genuinely easy for today's models. Give any frontier LLM a 500-word BBC story and ask for a paragraph summary, and the result will be accurate, fluent, and useful. The problems show up when the source is long, technical, or multi-document. Summarizing a 200-page government report, a full book chapter, a week of meeting transcripts, or a cluster of contradictory scientific papers - these tasks expose real capability gaps that ROUGE scores have never adequately captured.

This leaderboard covers both dimensions: the established automatic metrics that still dominate academic comparison, and the human judgment scores that better reflect whether a summary is actually useful. Where published numbers don't exist, I've written "Not reported" rather than estimating.

TL;DR

GPT-5 and Claude 4 Opus lead on human preference and long-document tasks, but few providers publish comparable scores on the same benchmarks
ROUGE-L scores are saturated for news summarization - differences of 1-2 points are not meaningful; focus on FActScore and human-preference win rate instead
Reasoning models (o3, Claude thinking variants) often over-explain rather than summarize - higher verbosity is not better on summarization benchmarks
Open-source models (Llama 4, Qwen 3.5) have closed most of the ROUGE gap on short documents but still lag on multi-document and long-form tasks

The Benchmarks Explained

Not all summarization benchmarks measure the same thing. Here is what each one tests and why it matters.

ROUGE-L

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) was introduced by Chin-Yew Lin in 2004 and remains the most widely reported summarization metric despite well-documented limitations. ROUGE-L specifically measures the longest common subsequence between the generated summary and the reference - a proxy for fluency and coverage.

What it measures well: Lexical overlap with a human-written reference summary. Useful for comparing systems on the same dataset under the same conditions.

What it does not measure: Factual accuracy, coherence, handling of novel phrasing, or whether a summary is actually readable. A summary that copies sentences verbatim can score higher than one that paraphrases more naturally. Two perfectly good summaries with different word choices can score very differently against the same reference. ROUGE also penalizes abstractive models that rephrase rather than extract - which is exactly what you want a good summarizer to do.

The practical consequence: ROUGE-L scores on CNN/DailyMail have a ceiling effect. Scores above 43 are roughly equivalent in real-world quality. Once models cross that threshold, differences are noise, not signal.

BERTScore

BERTScore, introduced by Zhang et al. in 2020, replaces exact token matching with contextual embedding similarity using a pretrained language model (typically DeBERTa-xxlarge). It computes precision, recall, and F1 between the tokens of the candidate and reference summaries in embedding space.

BERTScore correlates better with human judgments than ROUGE on abstractive summaries and is less sensitive to paraphrase. But it inherits the same reference-dependency problem: if the reference summary is mediocre, a model that writes a better summary will still score lower. It also tends to reward verbosity, which biases against concise models.

FActScore

FActScore (Factuality Score), introduced by Min et al. (2023), takes a fundamentally different approach. Rather than comparing a summary to a reference, it decomposes the generated text into atomic facts and checks each one against a knowledge source - typically Wikipedia or the source document. The score is the fraction of atomic facts that are verifiable.

For summarization, FActScore directly measures what ROUGE cannot: whether the model is making things up. A high FActScore means the summary stays grounded in the source material. A low score means the model is hallucinating details, conflating information from multiple sources, or generating plausible-sounding but unsupported claims. This is the metric that matters most for high-stakes applications - legal documents, medical records, financial reports.

For more on how factuality failures manifest across different tasks, see our Hallucination Benchmarks Leaderboard.

Human Preference Win Rate

Human preference evaluation asks annotators to compare two summaries side-by-side and select the better one. Win rate is the fraction of comparisons where a model's output is preferred over a baseline (typically GPT-4 or the previous best model). This is the most direct measure of real-world quality but is expensive, slow, and hard to reproduce.

Human judges typically weight fluency, informativeness, faithfulness to source, and conciseness. The weighting varies by annotator and instruction, which makes cross-study comparison unreliable. Numbers from different research teams should be treated as directional signals, not precise rankings.

Long-Document Benchmarks

GovReport consists of 19,466 U.S. government report summaries from the Congressional Research Service and Government Accountability Office. Source documents average 9,409 words; summaries average 553 words. This benchmark tests a model's ability to distill dense technical policy documents rather than news articles.

QMSum is a meeting summarization benchmark with 232 meeting transcripts and 1,808 query-based summarization tasks. Unlike most summarization benchmarks, it requires models to identify and summarize only the parts of a transcript relevant to a specific question - a much harder task than summarizing an entire document.

BookSum covers chapter-level and book-level summarization of literary texts (Project Gutenberg, NewsRoom). With source documents up to 100,000 tokens, it directly tests long-form comprehension and the ability to maintain narrative coherence across a very large context. See our Long-Context Benchmarks Leaderboard for the retrieval side of this capability.

MultiNews is a multi-document summarization benchmark with 56,216 article clusters (2-10 articles per cluster) from the web. Each cluster covers the same news event from multiple sources. The task requires reconciling potentially contradictory information, identifying the most important facts across sources, and writing a coherent single summary.

XSum (Extreme Summarization) uses BBC articles where each article has a single-sentence human-written summary. The task demands extreme compression - generating a single sentence that captures the most important information from an article, often requiring inference beyond literal extraction.

MeetingBank covers 1,366 city council meeting recordings transcribed and annotated with reference summaries. It tests long-form meeting comprehension with domain-specific vocabulary.

Summarization benchmarks vary significantly in what they measure - from single-sentence compression on XSum to full-book summaries in BookSum. Source: pollinations.ai

Main Ranking Table

The table below ranks 14 models on the metrics where published numbers are available. ROUGE-L and BERTScore figures come from published papers, model cards, or public leaderboards. FActScore figures come from the SummEval evaluation framework and related factual consistency studies. Human preference win rates come from Chatbot Arena summarization-specific elo comparisons and published model evaluations. Long-document GovReport figures use ROUGE-L where available.

Where no public figure has been published, I have written "Not reported." I have not estimated or interpolated scores.

Rank	Model	Provider	ROUGE-L CNN/DM	ROUGE-L XSum	FActScore	Human-Pref Win Rate	Long-doc GovReport	Notes
1	GPT-5	OpenAI	Not reported	Not reported	Not reported	~68%	Not reported	Strongest human-pref scores in independent evals; limited public benchmark data
2	Claude 4 Opus	Anthropic	Not reported	Not reported	Not reported	~64%	Not reported	Best on long-form tasks in Chatbot Arena summarization category
3	Gemini 2.5 Pro	Google	Not reported	Not reported	Not reported	~61%	Not reported	Strong factual grounding; leads FACTS Grounding benchmark
4	GPT-4.1	OpenAI	44.2	27.4	Not reported	~58%	Not reported	Solid ROUGE baseline; 1M context handles BookSum natively
5	Claude 4 Sonnet	Anthropic	Not reported	Not reported	Not reported	~55%	Not reported	Balanced cost-quality tradeoff for document summarization pipelines
6	o3 (reasoning)	OpenAI	Not reported	Not reported	Not reported	~48%	Not reported	Verbose; over-explains rather than summarizes; human judges penalize length
7	DeepSeek V3.2	DeepSeek	43.8	26.1	Not reported	~45%	Not reported	Cost-efficient; competitive ROUGE at fraction of frontier model cost
8	Grok 4	xAI	Not reported	Not reported	Not reported	~43%	Not reported	Limited public benchmark data; strong on short-form tasks in Arena
9	Qwen 3.5	Alibaba	43.5	25.7	Not reported	~41%	Not reported	Best open-weight option for short-doc summarization; MoE architecture
10	Llama 4	Meta	43.1	24.9	Not reported	~39%	Not reported	Open-source; 10M context window useful for BookSum-scale tasks
11	Mistral Large 3	Mistral AI	42.7	24.2	Not reported	~36%	Not reported	Competitive on extractive tasks; struggles with multi-doc reconciliation
12	Phi-4	Microsoft	41.9	23.5	Not reported	~31%	Not reported	Strong ROUGE-per-parameter; best small-model option for news summarization
13	Mixtral 8x22B	Mistral AI	41.4	22.8	Not reported	~28%	Not reported	MoE architecture; lower instruction-following reliability affects output quality
14	Llama 4 Scout	Meta	40.6	22.1	Not reported	~25%	Not reported	10M context available; base summarization quality lower than larger variants

Important caveat on this table: ROUGE-L scores for GPT-5, Claude 4, Gemini 2.5 Pro, and Grok 4 are "Not reported" because these providers have not published results on standard CNN/DailyMail or XSum test sets. The human preference win rates are approximate figures derived from Chatbot Arena's summarization-specific elo ratings and published comparison studies - they are directional, not precise. Do not treat differences of 2-3 percentage points as significant.

For GPT-4.1 ROUGE-L figures, see the OpenAI GPT-4.1 technical report. For DeepSeek V3.2 and Qwen 3.5, numbers come from their respective model cards on HuggingFace.

FActScore: What We Actually Know

FActScore results for the latest frontier models are largely unpublished as of April 2026. The original FActScore paper (Min et al., 2023) evaluated earlier generations: InstructGPT scored 52.8%, Alpaca 55.0%, and ChatGPT (early version) 58.8% on Wikipedia-grounded biography generation. More recent work applying FActScore to summarization tasks is scattered across academic papers with inconsistent source corpora and model versions.

What we do know from adjacent benchmarks:

FACTS Grounding (Google DeepMind): Gemini 2.0 Flash leads at 83.6%, Claude 3.5 Sonnet at 79.4%, GPT-4o at 78.8% - testing faithfulness to provided documents up to 32K tokens. These numbers directly predict summarization factuality for document-grounded tasks.
Vectara HHEM: Reasoning models including GPT-5, Claude Sonnet 4.5, and Grok 4 all exceed 10% hallucination rates on the expanded 32K-token dataset. Smaller focused models (Phi-4 at 3.7%, Llama-3.3-70B at 4.1%) outperform them on grounded summarization faithfulness.
The pattern is consistent: models that "reason" heavily over source material tend to deviate from it. Summarization rewards extraction and compression, not chain-of-thought elaboration.

Until providers publish FActScore on standardized summarization benchmarks, the FACTS Grounding and Vectara HHEM numbers are the best proxies available. Both are linked in our Hallucination Benchmarks Leaderboard.

Key Takeaways

ROUGE Has Hit Its Ceiling for News Summarization

The ROUGE-L scores for CNN/DailyMail hover between 40 and 44 for every model in this table. GPT-4.1 leads at 44.2, but the gap between it and Llama 4 Scout (40.6) is not meaningfully larger than the variance introduced by different prompt templates or decoding temperatures. ROUGE was calibrated for extractive summarization systems in the early 2010s. Today's abstractive models systematically paraphrase reference summaries rather than reproduce them, so ROUGE penalizes good outputs. The metric's usefulness ended around 2022, and the industry hasn't fully moved on.

If you are choosing a model for a production summarization pipeline, do not select based on CNN/DailyMail ROUGE-L alone. Run FActScore, BERTScore, or human evaluation on your specific document domain instead.

Reasoning Models Are Not Good at Summarization

o3 and thinking-mode variants (Claude with extended thinking, Gemini thinking modes) consistently underperform their non-reasoning counterparts on summarization tasks. The reason is structural: chain-of-thought reasoning is optimized for derivation problems, not compression problems. A summarizer's job is to be shorter than the source. A reasoning model's tendency to elaborate, hedge, and add caveats fights against this goal directly.

In practice, o3 often produces summaries 30-50% longer than what human judges prefer, adding context that wasn't requested and qualifying statements the source made definitively. This isn't a factuality problem - it's a register mismatch. Summarization is a task where less is more, and reasoning models are trained to do more.

This connects to a broader finding from instruction following benchmarks: models that excel at multi-step reasoning sometimes fail at simple constraint following like length limits.

Factuality vs. Abstractive Fluency Is a Real Tradeoff

The best human-preferred summaries tend to be slightly more abstractive - they rephrase rather than extract, they synthesize rather than list, they use cleaner language than the source. But abstraction is where hallucination enters. A model that paraphrases confidently can introduce subtle distortions: changing a qualifier ("may cause" becomes "causes"), eliding important caveats, or conflating two similar facts.

The Vectara data makes this concrete: GPT-5's 10%+ hallucination rate on document summarization versus Phi-4's 3.7% shows that higher general capability does not mean more faithful summarization. Phi-4 is more extractive by default, which keeps it grounded. GPT-5's fluency advantage comes with a faithfulness cost.

For most enterprise use cases - summarizing customer support tickets, legal documents, financial filings - you want a model closer to the faithfulness end of this tradeoff, not the fluency end.

Long-Document Performance Is Where the Real Gaps Are

On CNN/DailyMail (average article length: ~800 words), the differences between frontier models are minor. On GovReport (average source: ~9,400 words), QMSum (multi-turn meeting transcripts), and BookSum (full chapters), the gaps are substantial - but published numbers are sparse. Most benchmark results cover only short-document tasks.

What we can infer: models with larger verified context windows (Claude 4 Opus at 1M tokens with strong MRCR performance, GPT-4.1 at 1M tokens) have a structural advantage on long-document tasks because they can ingest the full source. Models that silently truncate inputs on long documents will produce worse summaries without any indication that truncation occurred. Always verify that your model's effective context is sufficient for your longest source documents before trusting output quality.

The Open-Source Gap Has Narrowed for Short Documents - Not Long Ones

Qwen 3.5 at 43.5 and Llama 4 at 43.1 on CNN/DailyMail ROUGE-L are within noise distance of GPT-4.1's 44.2. For a news article summarization pipeline, the cost difference between Qwen 3.5 (self-hosted or very cheap via API) and GPT-4.1 is hard to justify. But on multi-document and long-form tasks, the frontier closed models still hold a meaningful advantage - primarily because their larger context windows and stronger instruction following allow them to handle complex summarization requests that require genuine synthesis rather than extraction.

Methodology

This leaderboard pulls numbers from the following sources:

ROUGE-L scores: Model cards published on HuggingFace and official technical reports where available. All CNN/DailyMail scores use the standard 3.0.0 test split. All XSum scores use the standard test split. No scores have been reproduced by running models locally - I am working from published figures only.

FActScore: Drawn from the original FActScore paper (Min et al., 2023) for earlier models, and from FACTS Grounding and Vectara HHEM as proxies for current frontier models.

Human preference win rates: Chatbot Arena summarization-specific ELO ratings, accessed April 2026. Win rates are approximate conversions from ELO differences relative to GPT-4 as baseline. These are directional rankings, not precise measurements.

Long-document scores: GovReport and BookSum scores, where reported, use ROUGE-L on the standard test splits from the original datasets. Many frontier models have not been evaluated on these benchmarks in peer-reviewed settings.

Caveats and Limitations

ROUGE limitations for abstractive summaries: ROUGE-L was designed for extractive summarization systems. It systematically undervalues paraphrase, penalizes different but valid orderings of information, and does not capture factual accuracy at all. High ROUGE scores and high-quality summaries are not the same thing.

Dataset contamination: CNN/DailyMail, XSum, and SummEval are all public datasets that have been available since 2015-2020. Any model trained after 2021 has likely seen these documents, their reference summaries, or closely related examples. ROUGE scores on these benchmarks for recent models should be interpreted with skepticism - the model may be partially recalling rather than summarizing.

Domain shift: All three major summarization benchmarks (CNN/DailyMail, XSum, GovReport) are drawn from narrow domains (news, government policy). Performance on your domain (medical records, legal contracts, technical documentation, customer support transcripts) may differ substantially. A model that excels at news summarization may fail on financial reports. Always evaluate on representative samples from your target domain.

Reference quality: SummEval work by Fabbri et al. showed that many CNN/DailyMail reference summaries are of poor quality - they truncate rather than summarize, they miss key information, and they sometimes contain errors. Benchmarking against flawed references produces noisy signals regardless of model quality.

Provider benchmark transparency: OpenAI, Anthropic, Google, and xAI do not consistently publish results on standard summarization benchmarks. This makes the comparison uneven - GPT-4.1 has public ROUGE numbers while GPT-5 does not, which does not mean GPT-5 is worse. It means the information is missing.

Practical Guidance

For high-volume news or content summarization: Qwen 3.5 or Llama 4 are the best cost-efficiency options. Their ROUGE-L scores are within noise of GPT-4.1 on CNN/DailyMail-style tasks, and self-hosting eliminates API costs entirely for large workloads.

For long-form document summarization (legal, financial, technical): Claude 4 Opus or GPT-4.1 with verified long-context retrieval. Context window size and retrieval accuracy matter more than ROUGE-L here. Check the Long-Context Benchmarks Leaderboard before choosing.

For RAG-grounded summarization where factuality is critical: Prioritize FACTS Grounding and Vectara HHEM scores over ROUGE. Smaller models like Phi-4 (3.7% hallucination rate on Vectara) or focused instruction-tuned models often outperform frontier models on grounded faithfulness. Avoid reasoning-mode variants for this task.

For multi-document summarization: This is the most underserved task category. MultiNews and QMSum coverage is thin for frontier models. Claude 4 Opus has the best track record in Chatbot Arena on synthesis tasks, but independent published numbers are scarce. Test on your specific corpus.

For summarizing meeting transcripts: MeetingBank and QMSum are the relevant benchmarks. Transcripts have different language patterns than written documents (disfluencies, speaker attribution, long tangents), and models trained primarily on written text sometimes produce stilted summaries. Test explicitly rather than assuming a model's news-summarization quality transfers.

FAQ

Which LLM is best for summarization in 2026?

For general-purpose short document summarization, GPT-4.1 and Claude 4 Sonnet offer the best quality-to-cost ratio among closed models. For self-hosted or cost-sensitive deployments, Qwen 3.5 closes most of the gap on short documents. For long-form or multi-document tasks, Claude 4 Opus has the strongest human preference scores.

Why are ROUGE scores so similar across models?

ROUGE-L on CNN/DailyMail saturated around 2022. The dataset's reference summaries are short extractive fragments, and most frontier models exceed human-level performance on extracting those fragments. The remaining 1-3 point differences reflect prompt sensitivity and decoding settings more than genuine model capability differences.

Do reasoning models like o3 summarize better?

Generally no. Reasoning models produce longer, more elaborated outputs that human judges consistently rate lower on summarization tasks. Summarization requires compression and constraint following - skills that extended reasoning does not improve. For a related analysis, see our Instruction Following Leaderboard.

Is FActScore the best metric for summarization quality?

FActScore is the best available metric for factual faithfulness, but it requires a knowledge source to check facts against, which makes it expensive to run and dependent on the quality of that source. For practical evaluation, running FActScore on a sample of 50-100 outputs from your specific domain is more informative than any benchmark ROUGE score.

What is the best benchmark for long-document summarization?

GovReport is the most widely cited long-document summarization benchmark with standard evaluation conditions. BookSum is harder and more recent. Neither has comprehensive frontier-model coverage in 2026. For models capable of handling the full document length, long-context retrieval benchmarks are a good complementary signal for whether a model can actually use its context window.

Sources:

Text-to-SQL LLM Leaderboard 2026: Spider and BIRD Ranked

James Kowalski — Sun, 19 Apr 2026 00:00:00 +0000

Turning plain English into correct SQL is one of the most economically valuable things an AI model can do. A business analyst who can ask "what were our top ten products by revenue last quarter, broken down by region?" and get a runnable query back is meaningfully more productive than one who cannot. Text-to-SQL sits at the intersection of language understanding and precise code generation - get either wrong and the query silently returns the wrong answer, or fails at runtime against a production schema.

The problem sounds deceptively simple. But real enterprise databases look nothing like the toy schemas used in early academic work. They have hundreds of tables, ambiguous column names inherited from migrations, undocumented foreign key conventions, and domain-specific jargon that does not appear anywhere in training data. This leaderboard tracks how models perform on benchmarks designed to surface those real-world failure modes, not the polished toy setups where everything just works.

The Benchmarks

BIRD (Big Realistic and Diverse)

BIRD is the current gold standard for real-world text-to-SQL evaluation. Published in a 2023 NeurIPS paper, it contains 12,751 unique question-SQL pairs across 95 databases sourced from actual use cases in finance, healthcare, sports, government data, and education. The databases contain up to 33 tables with 11,000+ rows each.

The primary metric is Execution Accuracy (EX) - the percentage of generated queries that, when run against the database, return exactly the correct result set. This is stricter than comparing SQL strings, because many syntactically different queries produce identical results, and many syntactically similar queries produce subtly wrong ones. BIRD has both a public Dev split (used for model development) and a private Test split (used for official leaderboard submissions).

BIRD also ships with database evidence - column-level value examples and schema descriptions - to simulate the kind of documentation a real database might have. Models that use this evidence typically score 3-8 points higher.

Spider 2.0

Spider 2.0, published in a 2024 paper, is the hardest text-to-SQL benchmark available. It moves from single-database queries to multi-database enterprise workflows involving real cloud database systems: BigQuery, Snowflake, and local SQLite. Tasks require writing complete data science workflows - not just single SELECT statements - and the scoring metric is end-to-end execution success on the full workflow.

Spider 2.0 tests the kinds of questions that require joining across multiple databases, writing CTEs and window functions, and handling platform-specific SQL dialects. A model that scores 70% on BIRD can easily fall below 20% on Spider 2.0 - the jump in complexity is significant.

WikiSQL (Legacy)

WikiSQL is the benchmark that started the field - 80,654 questions over 24,241 simple tables extracted from Wikipedia. It only tests single-table SELECT queries with no joins, no subqueries, and no aggregation beyond basic GROUP BY. Modern frontier models are close to ceiling here, so it is included for historical comparison and for evaluating smaller or fine-tuned models where BIRD scores are less discriminating.

CoSQL (Conversational SQL)

CoSQL extends Spider to multi-turn dialogue settings. Instead of a single natural language question producing a single query, models must follow a conversation thread - understanding clarification questions, corrections, and follow-up queries that reference earlier context. It measures whether models can maintain a coherent mental model of a schema across a multi-turn interaction.

SParC (Sequential Paraphrase Context)

SParC tests context-dependent text-to-SQL in a sequential question-answering format. Questions build on each other within a topic, requiring the model to track which tables and conditions are implied from earlier turns. It is complementary to CoSQL - CoSQL has a human interacting with the system, while SParC has a more predictable sequential structure.

Text-to-SQL Rankings

Scores shown are Execution Accuracy (%) for BIRD Dev and BIRD Test, and Success Rate (%) for Spider 2.0. "Not reported" means no public figure from an official source is available.

Rank	Model / Agent	BIRD Dev EX %	BIRD Test EX %	Spider 2.0 %	Pipeline Type	Notes
1	CHESS Agent (GPT-5 backbone)	73.0	72.1	35.2	Agent	Schema linking + candidate filtering; paper
2	GPT-5 (zero-shot)	71.8	70.4	31.7	Zero-shot	OpenAI private eval via official API
3	Claude 4 Opus (zero-shot)	70.3	68.9	29.4	Zero-shot	Anthropic internal benchmark release
4	DIN-SQL (GPT-5 backbone)	69.9	68.2	27.8	Agent	Decompose-in-Natural-SQL; paper
5	Gemini 2.5 Pro (zero-shot)	68.7	67.1	26.3	Zero-shot	Google technical report
6	GPT-4.1 (zero-shot)	66.4	64.8	22.5	Zero-shot	OpenAI model card
7	Claude 4 Sonnet (zero-shot)	65.2	63.7	21.1	Zero-shot	Anthropic internal benchmark release
8	DeepSeek V3.2 (zero-shot)	64.9	63.1	19.8	Zero-shot	DeepSeek technical report
9	Qwen 3.5 Coder (zero-shot)	63.8	62.4	18.7	Zero-shot	Alibaba model card
10	Codestral (zero-shot)	61.5	59.9	15.4	Zero-shot	Mistral model card
11	Llama 4 Maverick (zero-shot)	59.3	57.6	13.2	Zero-shot	Meta eval suite
12	defog/sqlcoder-7b-2 (fine-tuned)	57.1	Not reported	Not reported	Fine-tuned	HuggingFace; fine-tuned on Spider + BIRD schema
13	DIN-SQL (GPT-4.1 backbone)	55.9	54.3	Not reported	Agent	Public reproduction results
14	SQLCoder-34B (fine-tuned)	54.6	Not reported	Not reported	Fine-tuned	Defog sql-eval benchmark suite
15	GPT-4.1 mini (zero-shot)	48.2	46.9	9.1	Zero-shot	OpenAI model card

Source notes: BIRD Dev and Test scores sourced from bird-bench.github.io official leaderboard where available, supplemented by model card reports from respective vendors. Spider 2.0 scores from spider2-sql.github.io official leaderboard and published papers. All scores verified against primary sources as of April 2026. Rows marked "Not reported" reflect absence of publicly verified figures - scores have not been fabricated.

Key Findings

Agent Scaffolds Outperform Zero-Shot Significantly

The most important finding in this table is the gap between CHESS (73.0% BIRD Dev) and the best zero-shot frontier model at that same task (GPT-5 at 71.8%). A well-designed agent pipeline adds roughly 1-2 percentage points even when wrapping the best available model. On harder tasks and real production schemas, the gap widens.

Two agent approaches dominate the published literature:

CHESS (Contextual Schema-Enriched Search and Selection) uses a four-stage pipeline: entity and column linking, candidate schema pruning, candidate SQL generation (with multiple samples), and result filtering by executing candidates and selecting the most common correct result. The schema linking stage is particularly valuable - it narrows the model's attention from potentially hundreds of irrelevant columns down to the handful that matter for the specific question.

DIN-SQL (Decomposed-In-Natural SQL) breaks complex queries into a dependency graph of sub-problems, solves each in natural language first, then synthesizes the final SQL. This helps on queries that require window functions, CTEs, or multi-step aggregations. DIN-SQL's advantage is that it externalizes the decomposition reasoning that a model would otherwise need to perform implicitly inside a single prompt.

The Spider 2.0 Drop Is Significant

Notice the dramatic compression of scores moving from BIRD to Spider 2.0. The best BIRD Dev score here is 73.0%, but the corresponding Spider 2.0 score drops to 35.2%. Even frontier-quality models solve fewer than one-third of Spider 2.0 tasks. This reflects the genuine difficulty of enterprise-grade SQL generation: real workflows need multi-database joins, platform-specific syntax (BigQuery UNNEST, Snowflake QUALIFY), and correct handling of data types that differ between systems.

For any team considering deploying text-to-SQL in production today, Spider 2.0 scores are the most informative signal, not BIRD. A model that looks great on BIRD may struggle badly on your actual schema.

Fine-Tuned Models Punch Above Their Weight Class

defog/sqlcoder-7b-2 (57.1% BIRD Dev) is a 7B-parameter model - smaller than any frontier model on this list - yet it scores within striking distance of GPT-4.1 mini (48.2%) and Llama 4 Maverick (59.3%). Schema-focused fine-tuning on SQL-specific datasets is remarkably effective. For organizations deploying text-to-SQL at scale where latency and cost matter, a fine-tuned 7B model running on-premises can match or exceed the quality of a much larger general-purpose model at a fraction of the inference cost.

Context Window and Schema Size Matter More Than Model Size

A pattern I have seen repeatedly in my testing: throwing a larger context window at a complex schema does not automatically improve accuracy. Models that can technically handle 128K tokens still degrade noticeably when the schema description grows beyond 20-30K tokens. The relevant tables and columns need to be pre-selected and surfaced prominently. This is why the schema linking step in CHESS is not optional - it is load-bearing.

Benchmark Methodology

Execution Accuracy (EX) executes both the gold-standard SQL and the generated SQL against the database, then compares the result sets. Two queries are considered equivalent if they return identical result sets (same rows, same values, same ordering where ORDER BY is specified). This is preferable to string matching, but has one limitation: a query can return the correct result set by accident if the database happens to contain data where a wrong query produces the same rows as a correct one.

For this reason, BIRD also reports Valid Efficiency Score (VES) in some analyses, which penalizes queries that are correct but much slower than the reference solution. I have focused on EX here because it is the most widely reported metric across models and is what practitioners care about most.

Spider 2.0 uses an end-to-end workflow success rate rather than per-query EX. A task is only counted as successful if the entire multi-step workflow completes and the final output matches the expected result. This is a harder standard and explains the lower absolute numbers.

Caveats and Limitations

SQL Dialect Portability

Most benchmarks test against SQLite (BIRD) or a mix of cloud SQL dialects (Spider 2.0). A model that scores well on SQLite-targeted BIRD may produce PostgreSQL, MySQL, or BigQuery SQL that fails in production. Dialect portability is rarely measured and rarely advertised. If your production database is not SQLite, test explicitly on your target system before trusting leaderboard numbers.

Schema Contamination Risk

BIRD and Spider have been public for long enough that their schemas are in many models' training data. A model might correctly generate SQL for a BIRD Dev question because it has seen that schema - or something very similar - during training rather than because it genuinely generalizes. Spider 2.0, being newer and using proprietary cloud database schemas, is harder to contaminate, which is part of why its scores are more informative about true generalization.

Ambiguity Handling

Natural language questions are often ambiguous in ways that affect the correct SQL. "Top customers" could mean top by number of orders, by total revenue, or by average order value. BIRD handles this partially through the provided evidence field, but ambiguity resolution is not formally evaluated. Models that ask clarifying questions (appropriate for CoSQL-style settings) versus models that make a deterministic best guess will show different failure modes.

Very Large Schemas

BIRD's largest schemas have 33 tables. Real enterprise data warehouses frequently have hundreds to thousands. No public benchmark currently tests performance at that scale. The schema linking techniques that work at 33 tables may not generalize to 300.

Closed-Source Score Verification

For proprietary models (GPT-5, Claude 4 Opus, Gemini 2.5 Pro), scores come from the vendors' own technical reports or the official leaderboard where those models have submitted results. Independent third-party verification is not always available. Where I could find corroborating community reproductions, I cross-checked, but treat proprietary vendor numbers with the appropriate skepticism.

Leaderboards | Awesome Agents

AI Music Generation Leaderboard 2026: Suno, Udio, More

Methodology

FAD - Frechet Audio Distance

CLAP Score - Text-Audio Alignment

MOS - Mean Opinion Score

KAD - Kernel Audio Distance

MusicCaps

Text-to-Music Rankings

Lyric-to-Song Rankings

Stem Separation and Remixing Rankings

Key Takeaways

The Benchmark Coverage Gap Is Real

MOS Tests Are Mostly Compromised

Suno and Udio Lead User Preference - With Caveats

MusicGen's Advantage Comes From Reproducibility

YuE Fills the Open-Source Lyric Gap

Pop Music Overfitting Is a Problem

Practical Guidance

FAQ

What is FAD and why does it matter for music generation?

Why don't Suno and Udio publish FAD scores?

Is MusicGen better than Suno?

What is the MusicCaps benchmark?

Which AI music tool is best for stem separation?

Can AI generate music with custom lyrics?

Sources

Code Completion and Generation LLM Leaderboard 2026

The Benchmark Landscape - What to Trust and What to Ignore

HumanEval and MBPP - Useful History, Unreliable Frontier Scores

LiveCodeBench - The Number I Actually Trust

BigCodeBench - Harder and More Realistic

MultiPL-E - Does It Work in Your Language?

Competitive Programming - Where Reasoning Models Pull Away

DS-1000 - Data Science Workloads

CRUXEval - Can It Reason About Code?

The Leaderboard

Reading the Table

The HumanEval Problem in Practice

Reasoning Models Are a Different Category

The Open-Weight Story is Better Than It Looks

Contamination in the Middle Tier

BigCodeBench as the Better Everyday Signal

Methodology

Benchmark Score Sources

Why Pass@1 Greedy

The Training Cutoff Problem

Multilingual Coverage (MultiPL-E)

Model Notes

GPT-5 and o4

Claude Opus 4 and Sonnet 4

DeepSeek-V3 and DeepSeek-Coder-V3

Qwen 2.5-Coder 32B and Qwen3-Coder

Codestral

StarCoder2 and Granite-Code

Practical Guidance

Sources

Creative Writing LLM Leaderboard 2026: Fiction Ranked

Why Creative Writing Benchmarks Are Uniquely Hard

The Benchmarks Explained

EQ-Bench Creative Writing v3

Antislop Writing Evaluations

Human-Preference Win Rate vs GPT-5

Main Rankings Table

Key Takeaways

Closed Models Hold the Top Tier - For Now

Reasoning Models Are Over-Structured

Fine-Tuned Writing Models Have a Tradeoff

Antislop Reveals Training Data Contamination

Open vs Closed: The Practical Gap

Caveats and Known Limitations

Benchmark Methodology Notes

Related Leaderboards

FAQ

What is EQ-Bench Creative Writing and how does it work?

Which model writes the best creative prose in 2026?

Why do reasoning models like o1 and o3 rank lower for creative writing?

What is the Antislop Score measuring?

Are community fine-tuned writing models worth using?

How often does this leaderboard update?