Name: Claude Sonnet 4.6: Mid-Tier Model, Flagship Results
Author: Anthropic

Overview

Claude Sonnet 4.6, released February 17, 2026, occupies an unusual position: a mid-tier model that beats its own flagship on the benchmark that most enterprise teams care about. On GDPval-AA, the office productivity leaderboard, Sonnet 4.6 scores 1,633 Elo - 27 points ahead of Claude Opus 4.6. On OSWorld, which measures autonomous computer use, it scores 72.5% against Opus's 72.7%. These aren't rounding errors. The gap between the two tiers is collapsing, and Sonnet 4.6 is the proof.

TL;DR

Best-in-class on office tasks (GDPval-AA Elo 1,633) and near-parity on computer use (72.5% OSWorld) at $3/$15 per million tokens
1M token context window now GA; 64K max output tokens; model ID claude-sonnet-4-6
Five times cheaper than Opus 4.6, within 4 points on SWE-bench Verified, and the default model for free and Pro users on claude.ai

Anthropic priced Sonnet 4.6 identically to its predecessor - $3 per million input tokens, $15 per million output tokens - while shipping major improvements: a 1M token context window (beta at launch, GA on March 13, 2026), adaptive thinking with adjustable effort levels, native tool calling and code execution, and context compaction for long-running sessions. The default context is 200K tokens; the full million requires an explicit API flag.

The model sits below Opus 4.6 on pure reasoning benchmarks (GPQA Diamond: 74.1% vs 91.3%) and on deep scientific tasks. But for the daily work of building and maintaining software, automating workflows, and processing large document sets, the benchmark data suggests Sonnet 4.6 is the better default choice - and Opus is the surgical tool for tasks that specifically demand frontier reasoning depth.

Key Specifications

Specification	Details
Provider	Anthropic
Model Family	Claude
Parameters	Not disclosed
Context Window	1,000,000 tokens (GA March 13, 2026); default 200K; 64K max output
Input Price	$3.00/M tokens
Output Price	$15.00/M tokens
Release Date	February 17, 2026
License	Proprietary (API, claude.ai, cloud platforms)
Model ID	`claude-sonnet-4-6`
Input Modalities	Text, images
Adaptive Thinking	Yes (effort: low/medium/high/max; medium recommended for most tasks)

Benchmark Performance

Benchmark	Sonnet 4.6	Sonnet 4.5	Opus 4.6	GPT-5.2
SWE-bench Verified (coding)	79.6%	70.3%	83.8%	80.0%
OSWorld (computer use)	72.5%	42.0%	72.7%	-
ARC-AGI-2 (novel reasoning)	58.3%	24.4%	68.8%	54.2%
GDPval-AA Elo (office tasks)	1,633	-	1,606	1,462
GPQA Diamond (PhD science)	74.1%	-	91.3%	93.2%
Terminal-bench (agentic coding)	52.5%	40.5%	56.7%	64.7%

Three numbers stand out. The OSWorld jump from 42.0% to 72.5% in a single generation - a 30-point gain - reflects a real architectural improvement in how the model plans multi-step interface interactions, not benchmark noise. The ARC-AGI-2 score of 58.3% represents a 4.3x improvement over Sonnet 4.5 (24.4%), which Anthropic internally describes as the largest single-generation gain they have published on any benchmark. And the GDPval-AA result - where Sonnet outscores Opus on the office productivity leaderboard - is the first time a Sonnet model has beaten its Opus counterpart on any major benchmark.

Where Sonnet 4.6 falls short is instructive. The GPQA Diamond gap (74.1% vs Opus's 91.3%) is 17 points - that's a meaningful difference for deep scientific reasoning. Terminal-bench shows GPT-5.2 ahead at 64.7%, which lines up with feedback that GPT-5.x models handle terminal-based multi-language workflows better. Both gaps are real, and teams that need consistent performance on those specific tasks should route accordingly.

AI chip in extreme close-up representing advanced AI processing Claude Sonnet 4.6 runs on Anthropic's inference infrastructure across multiple cloud providers. Source: unsplash.com

Key Capabilities

Coding and Agentic Workflows

At 79.6% on SWE-bench Verified, Sonnet 4.6 is within 4.2 points of Opus on production-grade coding tasks. In Claude Code internal testing, developers preferred it over Sonnet 4.5 in 70% of head-to-head comparisons and over the previous flagship Opus 4.5 in 59% of comparisons. The improvement isn't mostly about raw correctness - reviewers consistently note that Sonnet 4.6 reads broader context before making changes, avoids duplicating shared logic, and produces cleaner frontend output than its predecessor. For teams using AI coding assistants like Cursor or Windsurf, this model replaced Opus 4.5 as the recommended default in most workflows.

The 1M token context window makes a practical difference for codebase-level work. Dropping an entire repository into context and running targeted analysis was technically possible before, but expensive and unreliable. At $3/M input tokens with a working 1M window, it's now viable for routine tasks. For context on how Sonnet 4.6 compares in coding-specific rankings, see the coding benchmarks leaderboard.

Computer Use

The 72.5% OSWorld score is the spec that most developers underestimate. When Anthropic launched computer use with Claude 3.5 Sonnet in October 2024, the OSWorld score was 14.9%. Sonnet 4.6 is at 72.5% - nearly a fivefold improvement in 16 months. On the Pace insurance benchmark, which tests computer use in a real enterprise environment, Sonnet 4.6 achieves 94% task accuracy. These scores translate to real automation: navigating legacy portals, completing multi-step form workflows, interacting with software that has no API.

The computer use leaderboard shows where the remaining reliability limits are. Anthropic is candid that the model "still lags behind most skilled humans" and that prompt injection remains a risk for web-facing workflows. The 2026 API addition of a Zoom Action - letting the model inspect small UI elements at higher resolution before clicking - addressed one of the most common failure modes from the beta.

AI visualization representing neural network capabilities and data processing Sonnet 4.6's adaptive thinking engine adjusts reasoning depth based on task complexity, from simple queries to multi-step agentic chains. Source: unsplash.com

Adaptive Thinking and Tool Use

Sonnet 4.6 adds the effort parameter (low/medium/high/max) to the Sonnet family. Anthropic recommends medium for most use cases as the best balance between speed, cost, and performance. At high and max, the model sustains 100+ step reasoning chains, approaching Opus-class behavior for complex multi-step problems. Token consumption is the cost: at max effort, Sonnet 4.6 can use 4-5x more tokens than Sonnet 4.5 on the same task, which at scale can close the price gap with Opus 4.6 more than the raw per-token comparison suggests.

Tool use is GA: code execution in a sandboxed environment, web search with automatic result filtering, memory across sessions, and function calling via MCP-Atlas (61.3% benchmark, first among tracked models). Context compaction - automatic summarization of older conversation history as sessions approach the context limit - ships as a beta feature that extends practical usable context without hitting the window ceiling.

Pricing and Availability

Tier	Input	Output
Standard (≤200K tokens)	$3.00/M	$15.00/M
Extended context (>200K tokens)	Premium pricing	Premium pricing
Batch API	$1.50/M (50% off)	$7.50/M (50% off)

Sonnet 4.6 is the default model for Free and Pro users on claude.ai, available in Claude Code, the Anthropic API, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Azure Foundry. Free tier users on claude.ai get limited access; the Pro plan ($20/month) provides standard access.

Compared to the competition: Gemini 3.1 Pro prices at $2/$12 per million tokens, which is cheaper, but trails on office tasks (GDPval-AA Elo of 1,317 vs Sonnet's 1,633) and computer use. GPT-5.2 sits at $2.50/$10 with better Terminal-bench scores. Sonnet 4.6 is the most expensive mid-tier option on paper, though the Batch API halves costs for high-volume offline workloads. For quality-adjusted cost comparisons across providers, see our cost efficiency leaderboard.

Strengths

GDPval-AA Elo 1,633 - first Sonnet to outscore Opus on a major benchmark
Computer use at 72.5% OSWorld, 94% on Pace insurance tasks, within 0.2% of Opus 4.6
1M context window that's now GA and retrievable (MRCR v2 performance comparable to Opus)
4.3x ARC-AGI-2 improvement in a single generation (24.4% to 58.3%)
Default free-tier model - frontier-class coding and computer use accessible without an API key
Batch API at $1.50/$7.50 makes high-volume workloads financially viable

Weaknesses

GPQA Diamond (74.1%) trails Opus 4.6 (91.3%) and GPT-5.2 (93.2%) - deep science tasks should route to Opus
Token consumption at max effort can be 4-5x higher than Sonnet 4.5, eroding the cost advantage
Terminal-bench (52.5%) lags GPT-5.2 (64.7%) and Opus 4.6 (56.7%) on terminal-based workflows
Extended context pricing (>200K tokens) adds a premium that Gemini 3.1 Pro doesn't charge
Parameters not disclosed; no self-hosted or open-weight option
Early post-launch reports of structured output errors and hallucinated function signatures (reportedly fixed in a later patch)

FAQ

Is Claude Sonnet 4.6 better than Opus 4.6?

It depends on the task. Sonnet 4.6 leads on office productivity and ties on computer use at 1/5 the price. Opus 4.6 leads by 17 points on GPQA Diamond and is better for deep scientific reasoning chains. For most software and knowledge work, Sonnet 4.6 is the right default.

What is the context window for Claude Sonnet 4.6?

The default is 200K tokens. A 1M token window is available via API flag and became GA on March 13, 2026. See our 1M context GA announcement for details on availability across platforms.

How does Sonnet 4.6 compare to GPT-5.2 for coding?

Both score within 0.4 points on SWE-bench Verified (79.6% vs 80.0%). GPT-5.2 leads on Terminal-bench (64.7% vs 52.5%). For practical daily development work - code review, complex fixes, multi-file projects - they're roughly equivalent. Sonnet 4.6 is cheaper ($3 vs $2.50 input, but with Batch API at $1.50 it's competitive).

What is the model ID for Claude Sonnet 4.6?

The API model ID is claude-sonnet-4-6. On Amazon Bedrock, it's anthropic.claude-sonnet-4-6. On Google Vertex AI and Azure Foundry, consult each platform's model catalog for the versioned ID.

Does Claude Sonnet 4.6 have computer use?

Yes. Sonnet 4.6 scores 72.5% on OSWorld and 94% accuracy on the Pace enterprise insurance benchmark. It supports the Computer Use API with the Zoom Action feature for inspecting small UI elements before interacting with them.

Is Claude Sonnet 4.6 free?

Free-tier users on claude.ai get access with usage limits. The Anthropic API and cloud deployments require a paid account. See our guide to choosing an LLM in 2026 for a breakdown of which tier fits different use cases.

Claude Sonnet 4.6 Review: The Workhorse That Ate the Flagship - Elena Marchetti's hands-on review covering coding, computer use, and the token efficiency problem in depth
Claude Sonnet 4.6 Arrives With 1M Context and Near-Opus Coding Performance - Original launch news coverage
Claude Opus 4.6 - The flagship model this Sonnet most directly competes with
Chatbot Arena Elo Rankings - Current human-preference leaderboard (Sonnet 4.6 Elo: ~1,438; Opus 4.6: ~1,503)