Item: Claude Sonnet 5
Author: Elena Marchetti

Anthropic released Claude Sonnet 5 on June 30, 2026, and made it the default model for every Free and Pro user on claude.ai from day one. That's a statement of intent. Anthropic isn't positioning this as a cheaper compromise - they're saying Sonnet 5 is good enough to be the first thing most people reach for. After spending time with the benchmarks, the system card, and the pattern of what moved and what didn't, I think they're mostly right, with a few honest caveats worth knowing before you switch.

TL;DR

8.5/10 - the best agentic mid-tier model right now, with real FrontierCode and Terminal-Bench gains that matter for developers
Key strength: FrontierCode v1 jumped from 15.1% to 38.8% - a 2.6x improvement in one generation on tasks derived from real open-source PRs
Key weakness: Verbose by default (roughly 1/3 more output than GPT-5.5) and a tokenizer update that silently increases token counts by up to 35%
Use it if you run coding agents, autonomous browser tasks, or professional workflow automation at scale. Stick with Claude Opus 4.8 if you need frontier-grade mathematical reasoning or offensive security research

What Actually Changed from Sonnet 4.6

Anthropic's last Sonnet, 4.6, made a name for itself by matching Opus on office productivity tasks. The bet for Sonnet 5 was extending that pattern into harder territory - specifically agentic coding, long-horizon browser research, and computer use. The results show that bet mostly paid off.

The benchmark that catches my eye is FrontierCode v1: 38.8% versus 15.1% for Sonnet 4.6. FrontierCode, built by Cognition, derives tasks from real pull requests in actively maintained open-source repositories, with no human intervention allowed and only binary pass/fail scoring. A 2.6x gain on a benchmark like that isn't noise. It reflects something truly different about how the model handles unfamiliar codebases and multi-file reasoning under pressure.

Terminal-Bench 2.1 moved from 67.0% to 80.4% - a 13-point gain that puts Sonnet 5 ahead of Claude Opus 4.8 on terminal-based workflows (Opus 4.8 scores 74.6%). That inversion is striking. On SWE-bench Verified, the most widely cited coding benchmark, Sonnet 5 reaches 85.2% versus Sonnet 4.6's 79.6%. The harder SWE-bench Pro variant, which uses harder problems from actively maintained repos to reduce leakage, scores 63.2% - beating GPT-5.5's 58.6%.

A 2.6x FrontierCode gain in a single generation isn't noise. Something genuinely different happened here.

Agentic Search and Computer Use

BrowseComp - Anthropic's benchmark for hard-to-find research via autonomous web browsing - shows Sonnet 5 at 84.7% single-agent, up from 76.2% on Sonnet 4.6 and effectively tied with GPT-5.5 (84.4%). On OSWorld-Verified, which measures autonomous desktop computer use across structured tasks, the score is 81.2%, ahead of Sonnet 4.6 (78.5%) and GPT-5.5 (78.7%).

For the computer use leaderboard, these two results together put Sonnet 5 in a different category than previous mid-tier models. Earlier Sonnet versions were credible for supervised computer use with human checkpoints. Sonnet 5 starts to look plausible for production autonomous workflows - not Opus-grade, but close enough that the cost savings justify the tradeoff in most commercial deployments.

The system card highlights improved prompt injection resistance as part of this. When a model is autonomously browsing and executing tasks, its ability to recognize and ignore injected instructions in web content matters significantly. Anthropic reports that Sonnet 5 beats Sonnet 4.6 on this dimension, which is the right thing to focus on as agentic deployments scale.

The coding benchmarks leaderboard tells a consistent story: Sonnet 5 is now truly competitive in the tier below Opus, not just a cheaper fallback.

Real-World Developer Experience

Early access reports from developers are consistent on a few points. The model finishes tasks where previous Sonnet versions stalled. One engineer described testing it on a legacy codebase bug: unprompted, Sonnet 5 wrote a reproducing test, applied the fix, then stashed the change to verify the regression - all in a single pass. That kind of autonomous self-verification behavior didn't require explicit prompting.

Sonnet 5 is also available in GitHub Copilot for Pro, Pro+, Max, Business, and Enterprise users, and performs well on CLI-style tasks in that context. For Claude Code users, it defaults to high reasoning effort on the API, which is a reasonable default for most coding tasks. Dropping to medium or low effort is worth doing for simpler tasks where cost matters.

The weaknesses that come through clearly: Sonnet 5 is verbose. Testing by The Human Co found it averaging roughly a third more output tokens than GPT-5.5 and about double Gemini 3.1 Pro's response length on comparable tasks, even when explicitly asked to be concise. For interactive use, this is mildly annoying. For high-volume API calls, it's a real cost factor.

On knowledge work, the GDPval-AA v2 Elo score of 1,609 leads the benchmark table, ahead of GPT-5.5 (1,492) and Sonnet 4.6 (1,381). HealthBench Professional reaches 57.8%, up from 44.2% on Sonnet 4.6. Legal Agent Benchmark scores 8.9 on the public set. These are specialized, but the consistency of gains across professional domains suggests the underlying capability improvements aren't limited to software engineering.

Where Sonnet 5 clearly trails is mathematical reasoning. USAMO 2026 - olympiad-level proof problems judged by frontier models - scores 79.5% for Sonnet 5 versus 96.7% for Opus 4.8. That's a 17-point gap, and it's not closing at a pace that makes Sonnet 5 a substitute for Opus on hard scientific or mathematical work. If your workflows involve deep formal reasoning, Opus remains the right tier.

Pricing and What the Tokenizer Change Means

The headline pricing is $2 per million input tokens and $10 per million output tokens through August 31, 2026, reverting to $3/$15 from September onward. Compared to Opus 4.8 at $5/$25, Sonnet 5 costs 40% less at standard rates and 60% less during the intro period.

That framing, though, doesn't account for the tokenizer update. Anthropic notes in the system card that the updated tokenizer maps the same input text to roughly 1.0 to 1.35 times more tokens than Sonnet 4.6's tokenizer. On inputs that hit the upper end of that range, you're paying 35% more per token before the per-token price even enters the calculation. For teams with established token budgets built around Sonnet 4.6, the effective price increase is worth modeling carefully before assuming the intro pricing represents a straight discount.

The 90% savings from prompt caching remain available and are especially valuable for agentic workloads where large system prompts repeat across many calls. Batch API pricing at 50% off standard is also available for asynchronous processing.

Tier	Input	Output
Intro pricing (through Aug 31, 2026)	$2.00/M	$10.00/M
Standard pricing (from Sep 1, 2026)	$3.00/M	$15.00/M
Batch API (50% off standard)	$1.50/M	$7.50/M
Opus 4.8 standard	$5.00/M	$25.00/M

Strengths and Weaknesses

Strengths:

FrontierCode v1 jumped from 15.1% to 38.8% - the most significant gain in the benchmark set
Terminal-Bench 2.1 at 80.4% beats Opus 4.8's 74.6% on terminal-based agentic tasks
SWE-bench Verified at 85.2%, the highest score for any Sonnet-class model
BrowseComp 84.7% single-agent - effectively tied with GPT-5.5 on autonomous research
GDPval-AA v2 Elo of 1,609 leads the professional knowledge work benchmark table
Improved prompt injection resistance over Sonnet 4.6 (critical for production agentic use)
1M context window with adaptive thinking across five effort levels

Weaknesses:

Verbose by default - averages roughly 1/3 more output than GPT-5.5 on comparable tasks
Tokenizer update silently increases token counts by up to 35% vs Sonnet 4.6
USAMO 2026 at 79.5% trails Opus 4.8 (96.7%) by 17 points - not a substitute for hard math
Cybersecurity capabilities intentionally reduced vs Opus 4.7/4.8
Standard pricing reverts September 1 - the intro period advantage is time-limited
At max reasoning effort (xhigh), costs can exceed Opus 4.8 for comparable quality

Verdict

Sonnet 5 is the clearest upgrade in the Sonnet line since Anthropic started releasing tiered models. The FrontierCode and Terminal-Bench jumps are large enough to make a difference on real workflows, not just benchmark sheets. Developers who pushed against the ceiling of Sonnet 4.6 on autonomous coding and browser research will find Sonnet 5 clears those cases more reliably.

The verbosity and tokenizer issues are real and worth tracking, especially for high-volume deployments. Anthropic is also transparent about the safety tradeoffs: cybersecurity capabilities are intentionally reduced versus the Opus tier, and the system card - at 145 pages - is notably more focused on agentic behavior evaluation than benchmark tables.

Score: 8.5/10. The right model for most developer workflows in July 2026. Use Opus 4.8 for deep math, formal reasoning, or security research. Use Sonnet 5 for everything else that previously pushed you toward the Opus tier on grounds of capability rather than preference.

Claude Sonnet 5 Review: Near-Opus at Half the Price