Two months since our April rankings, the frontier has moved faster than any comparable period we've tracked. Four major new models shipped. The open-weight tier posted scores that were frontier-exclusive six months ago. And the model that set new records across every major coding and reasoning benchmark is currently unavailable to every user on the planet.

Claude Fable 5 launched on June 9, 2026. Three days later, US Commerce Secretary Howard Lutnick issued an export control directive ordering Anthropic to disable it globally - the first such order ever issued against a commercial LLM. As of June 15, Fable 5 remains offline everywhere. We've left it at #1 because the benchmark numbers are real. We've also flagged every row clearly, because the practical question of what you can actually use today has a different answer.

TL;DR

Benchmark #1 is Claude Fable 5: 95% SWE-bench Verified, ~95% GPQA Diamond - and completely unavailable as of June 15 due to a US export ban
Effective #1 for available models: Claude Opus 4.8 ($5/$25), which leads the Artificial Analysis Intelligence Index at 61.4 and posts 88.6% SWE-bench Verified
Biggest open-weight story: Kimi K2.6 and MiniMax M3 both hit 80%+ on SWE-bench Verified under Apache-style licenses, matching Gemini 3.1 Pro on coding

What We Measure and Why

These rankings combine three benchmarks because no single metric captures the full picture:

GPQA Diamond tests graduate-level reasoning across physics, chemistry, and biology. Questions are intentionally designed to stump Google searches and to require actual understanding rather than pattern recall. Human experts in the relevant field score around 69%. Models above 90% are operating at a truly different level.

SWE-bench Verified measures whether a model can resolve real GitHub issues in popular Python repositories, end-to-end without scaffolding assistance. Independent evaluators vetted every sample. It's one of the few benchmarks that has held up reasonably well against contamination concerns because the tasks are drawn from real codebases with real commits.

Chatbot Arena Elo (from LMArena) is a human-preference score derived from millions of blind pairwise comparisons. It's noisy and reflects what people subjectively prefer, which isn't always what's most capable - but it captures something the capability benchmarks miss: whether the model is actually useful in practice.

Where a score is marked N/P (Not Published), the provider hasn't released an independent evaluation. Vendor-reported numbers are marked with an asterisk.

The June 2026 Overall Rankings

Rank	Model	Provider	GPQA Diamond	SWE-bench Verified	Arena Elo	Price (In/Out per 1M)	Open?
1**	Claude Fable 5	Anthropic	~95%	95.0%	1510	$10 / $50	No
2	Claude Opus 4.8	Anthropic	93.6%	88.6%	1512	$5 / $25	No
3	GPT-5.5	OpenAI	93.6%	N/P	1506	$5 / $30	No
4	Gemini 3.1 Pro	Google	94.3%	80.6%	1505	$2 / $12	No
5	MiniMax M3	MiniMax	92.7%	N/P	TBD	N/P	Yes
6	Kimi K2.6	Moonshot AI	90.5%	80.2%	TBD	N/P	Yes
7	Grok 4.3	xAI	90.1%	N/P	TBD	$1.25 / $2.50	No
8	Qwen 3.7 Max	Alibaba	TBD	TBD	~1475	Free (Apache 2.0)	Yes
9	DeepSeek V4	DeepSeek	72.8%*	80.6%*	TBD	$0.27 / $0.87	Yes
10	Gemini 3.5 Flash	Google	TBD	TBD	TBD	$1.50 / $9.00	No
11	Llama 4 Maverick	Meta	87.6%	~24%	TBD	Free (Llama)	Yes

**Claude Fable 5 is currently offline globally due to a US Commerce Department export control directive issued June 12, 2026. Benchmark numbers are published; the model is not accessible.

*DeepSeek V4 scores are vendor-reported and not yet independently reproduced. SWE-bench Verified figure tied to V4-Pro variant. All other asterisk-free scores are from third-party evaluations.

N/P = Not Published by provider. TBD = evaluation pending at time of writing. Arena Elo from LMArena as of June 2026 per Swfte composite. Pricing reflects standard API rates.

Key Takeaways

Fable 5 rewrote the record books in three days

Anthropic's June 9 launch was the biggest single-model performance jump we've seen on SWE-bench Verified - 95.0%, a 6.4-point leap over Opus 4.8 and a 14.4-point gap over the best publicly available non-Anthropic model. On the Swfte composite, Fable 5 scored 100/100, the first time any model has topped that index.

The GPQA Diamond number (~95%) needs a caveat: Anthropic's announcement stressed third-party evaluator results over a standard GPQA run, so the figure is sourced from Vellum's post-launch breakdown rather than Anthropic's own documentation. We treat it as credible but preliminary until a canonical source confirms it.

Fable 5 was also positioned as something different from prior Claude releases. The launch announcement simultaneously called for a global AI safety pause and released the model it was describing as a potential concern. The export ban that followed three days later suspended both Fable 5 and Claude Mythos 5 - the research-tier model that Fable 5 is derived from.

The best LLM on earth right now has zero users. That's new territory for a rankings table.

Claude Opus 4.8 is the effective frontier

Shipped May 28, Opus 4.8 is the model Anthropic users fall back to after the Fable 5 suspension. At 88.6% on SWE-bench Verified and 93.6% on GPQA Diamond, it sits above GPT-5.5 on coding and ties it on science reasoning - at the same $5/$25 input/output pricing that Anthropic has held stable since Opus 4.6.

The Artificial Analysis Intelligence Index puts it at 61.4 as of early June, ahead of GPT-5.5 (60.2) and Gemini 3.1 Pro (57). The SWE-bench Pro leaderboard tells a similar story: 69.2% for Opus 4.8 versus 58.6% for GPT-5.5 and 54.2% for Gemini 3.1 Pro.

One wrinkle: Opus 4.8's Arena Elo (1512 on the Swfte composite) technically edges Fable 5's text Arena Elo (1510). This likely reflects Fable 5's limited exposure before the shutdown - the Arena computes Elo from vote volume, and three days of availability doesn't yield a stable estimate.

Open-weight models breach the frontier tier

The April rankings described the open-weight tier as "closing fast." June 2026 is the month it arrived. Kimi K2.6 from Moonshot AI posts 80.2% on SWE-bench Verified - matching Gemini 3.1 Pro's independently verified score. MiniMax M3 posts 92.7% on GPQA Diamond, higher than Grok 4.3's 90.1% and within two points of Opus 4.8.

Neither provider has published a full SWE-bench run, which is the gap that still separates them from the closed-source leaders on our ranking. But on what's published, both models are well inside what we'd have called "frontier" in February 2026. Both are distributable under permissive licenses.

AI model benchmark comparison chart showing GPQA Diamond scores across frontier models Benchmark comparison across the top LLMs as of June 2026. Source: benchlm.ai

Qwen 3.7 Max is harder to place - Alibaba released it with a reasoning-agent framing and a 1M-token context window, but hasn't published GPQA or SWE-bench numbers. Its Arena Elo of approximately 1475 on text tasks is strong for an open-weight model, putting it above Llama 4 Maverick in human preference despite Maverick having a higher GPQA score.

The Llama 4 Maverick situation deserves a direct note. Meta published a 87.6% GPQA Diamond score, which would rank it above Grok 4.3 by that metric alone. But its SWE-bench Verified score is about 24% - dramatically below every other model in this table. The GPQA/SWE-bench gap suggests Maverick optimized for science reasoning at the expense of code generation. For pure reasoning tasks it's competitive; for software engineering workflows it's not in the same tier.

The pricing floor dropped again

Grok 4.3 at $1.25/$2.50 per million tokens is the cheapest frontier-adjacent model we've tracked. Gemini 3.5 Flash at $1.50/$9.00 is positioned below it in reasoning benchmarks but produces output at 166 tokens per second - roughly 4x faster than other frontier models - making it truly different for latency-sensitive applications. GPT-5.5's batch pricing mode drops to $2.50/$15 for high-volume use, halving the standard rate.

The gap between best-available and cheapest-available is still significant: Opus 4.8 at $5/$25 costs 4x Grok 4.3 on input tokens. But the practical question is whether that 4x price buys enough capability for your use case.

Practical Guidance

For agentic coding and software engineering

Claude Opus 4.8 is the correct choice at the $5/$25 tier. It leads on SWE-bench Pro (69.2%), has the Dynamic Parallel Workflows research preview for multi-agent orchestration, and its Arena Coding Elo (from BenchLM) ranks it first among currently available models. If you're running self-hosted open-weight, Kimi K2.6 at 80.2% SWE-bench Verified is the strongest available option.

For science, research, and reasoning

Gemini 3.1 Pro holds the highest published GPQA Diamond score of any currently available model at 94.3%. At $2/$12 it's also the most cost-efficient frontier model on reasoning tasks. Opus 4.8 is its closest competitor at 93.6%. The practical difference between 94.3% and 93.6% GPQA Diamond is small; the 2.5x price difference on output tokens isn't.

For high-volume or latency-sensitive work

Gemini 3.5 Flash at $1.50/$9.00 with 166 tokens/second output is built for this use case. On Google's published agentic benchmarks (MCP Atlas, Toolathlon), it matches or leads every frontier model including Opus 4.7. The reasoning ceiling is lower, but for structured tasks where speed and cost matter more than PhD-level science reasoning, it's a strong pick.

Grok 4.3 at $1.25/$2.50 is the cheapest frontier-adjacent option. xAI hasn't published SWE-bench Verified numbers, which makes the coding capability harder to confirm, but its 90.1% GPQA Diamond is independently cited and its 1M token context window is one of the largest available.

For open-weight deployment

Kimi K2.6 and MiniMax M3 are the top choices for self-hosted deployments. On coding, Kimi leads with the 80.2% SWE-bench Verified score. On science reasoning, MiniMax M3 edges Kimi with 92.7% GPQA Diamond. Both are distributable under permissive licenses. Qwen 3.7 Max under Apache 2.0 is worth watching - its Arena Elo shows strong human preference even without published GPQA benchmarks.

FAQ

What is the best LLM available right now in June 2026?

Claude Opus 4.8 leads among available models. Fable 5 has higher benchmarks (95% SWE-bench Verified vs 88.6%) but is offline globally due to a US export order. Gemini 3.1 Pro is the best-value frontier option at $2/$12.

Why is Claude Fable 5 unavailable?

On June 12, 2026, US Commerce Secretary Howard Lutnick issued an export control directive ordering Anthropic to disable Fable 5 and Claude Mythos 5 globally. Because Anthropic can't verify caller nationality at API query time, it shut both models down for all users. No restoration date has been announced.

Which model has the best GPQA Diamond score?

Gemini 3.1 Pro holds the highest published score for an available model at 94.3%. Claude Fable 5's score is approximately 95% per third-party breakdowns, but that model is currently offline. Claude Opus 4.8 and GPT-5.5 both sit at 93.6%.

What is the best open-source LLM for coding right now?

Kimi K2.6 from Moonshot AI posts 80.2% on SWE-bench Verified under a permissive license - the highest independently verified score in the open-weight tier. MiniMax M3 is close on GPQA Diamond (92.7%) but hasn't published SWE-bench numbers.

Has GPT-5.5 published SWE-bench Verified results?

No. OpenAI has published SWE-bench Pro (58.6%) and GPQA Diamond (93.6%) for GPT-5.5 but not SWE-bench Verified. That gap makes direct comparison with Claude models harder and is one reason GPT-5.5 ranks #3 despite matching Opus 4.8 on GPQA.

How often do these rankings update?

We publish a new overall rankings article monthly. Individual model scores get updated when third-party evaluations or major provider benchmark releases change a model's standing materially. The lastVerified field in the front matter reflects when we last confirmed the scores above.

Sources: