Name: Claude Opus 4.6
Author: Anthropic

Overview

Anthropic released Claude Opus 4.6 on February 5, 2026, and the benchmark data tells a clear story: this is the most capable model for agentic work and enterprise knowledge tasks available today. It scores 1,606 Elo on GDPval-AA (a measure of real-world office task performance), putting it 144 points ahead of GPT-5.2 and 411 points ahead of Gemini 3 Pro. It takes the #1 spot on Terminal-Bench 2.0, Humanity's Last Exam, and BrowseComp. It holds the top Chatbot Arena ranking at roughly 1,496 Elo.

The model ships with a 1M-token context window (GA since March 13, 2026) and supports up to 128K tokens of output - double the previous Opus limit. But the headline feature is agent teams: the ability to spawn and coordinate parallel sub-agents through Claude Code. Anthropic demonstrated this by having 16 parallel agents write a 100,000-line C compiler in two weeks that passes 99% of the GCC test suite. That is not a benchmark score. That's a shipped artifact.

Opus 4.6 does not win everywhere. Gemini 3.1 Pro beats it on GPQA Diamond (94.3% vs 91.3%) and ARC-AGI-2 (77.1% vs 68.8%). GPT-5.3 Codex posts a higher Terminal-Bench 2.0 score (77.3%) in its specialized agentic coding mode. But on the combined picture of reasoning, coding, tool use, and long-context retrieval, Opus 4.6 is the most consistently strong model across the board. See our overall LLM rankings for the full picture.

Key Specifications

Specification	Details
Provider	Anthropic
Model Family	Claude
Parameters	Not disclosed
Context Window	1,000,000 tokens input (GA as of March 13, 2026) / 128,000 tokens output
Input Price	$5.00/M tokens (flat, no long-context premium)
Output Price	$25.00/M tokens (flat, no long-context premium)
Release Date	February 5, 2026
License	Proprietary (API access, claude.ai, cloud platforms)
Input Modalities	Text, images
Output Modality	Text
Adaptive Thinking	Enabled by default (low/medium/high/max effort levels)
Model ID	`claude-opus-4-6`

Benchmark Performance

Benchmark	Claude Opus 4.6	GPT-5.2	Gemini 3.1 Pro	Claude Opus 4.5
Terminal-Bench 2.0 (agentic coding)	65.4%	64.7%	56.2%	59.8%
SWE-bench Verified (GitHub issues)	80.8%	80.0%	76.2%	80.9%
GPQA Diamond (PhD-level science)	91.3%	93.2%	94.3%	87.0%
ARC-AGI-2 (novel reasoning)	68.8%	54.2%	77.1%	37.6%
Humanity's Last Exam (with tools)	53.1%	50.0%	45.8%	43.4%
BrowseComp (web research)	84.0%	77.9%	59.2%	67.8%
GDPval-AA Elo (office tasks)	1,606	1,462	1,195	1,416
OSWorld (GUI automation)	72.7%	-	-	66.3%
tau2-bench Retail (tool calling)	91.9%	82.0%	85.3%	88.9%
MRCR v2 @ 1M (long-context retrieval)	76.0%	-	26.3%	-
MMMU Pro (visual reasoning, with tools)	77.3%	80.4%	81.0%	73.9%
MMMLU (multilingual)	91.1%	89.6%	91.8%	90.8%

The numbers separate into two stories. On reasoning-heavy academic benchmarks (GPQA, ARC-AGI-2, MMMU Pro), Gemini 3.1 Pro leads. On agentic tasks that require sustained execution, tool use, and real-world knowledge work (Terminal-Bench, GDPval-AA, BrowseComp, tau2-bench), Opus 4.6 is consistently #1. The GDPval-AA gap of 144 Elo points over GPT-5.2 translates to noticeably better performance on the kinds of tasks companies actually pay for: financial analysis, document synthesis, form automation, and multi-application workflows.

The long-context story is especially strong. At 1M tokens, Opus 4.6 scores 76.0% on MRCR v2 needle-in-haystack retrieval versus Gemini 3.1 Pro's 26.3% on the same test. Anthropic's 1M context window actually works - it isn't just an advertised number that degrades under real use.

Key Capabilities

Agent Teams. The most significant new capability is native multi-agent coordination through Claude Code. Opus 4.6 can spawn parallel sub-agents, delegate sub-tasks, and synthesize their outputs. This isn't prompt chaining - it is environment-level coordination where multiple model instances operate independently and merge results. The 100K-line C compiler experiment demonstrated the ceiling. For typical development workflows, agent teams translate to end-to-end multi-file project builds with less human oversight than any prior model.

Adaptive Thinking. Rather than offering discrete reasoning modes, Opus 4.6 dynamically adjusts its internal reasoning depth based on task complexity. You can override this with explicit effort controls (low, medium, high, max), but the automatic mode correctly identifies when deeper analysis is needed roughly 90% of the time. At the default setting, the model almost always engages some form of extended reasoning. Read our full review for hands-on testing of this feature.

Enterprise Integration. Opus 4.6 adds native PowerPoint and Excel support - reading, analyzing, and producing.pptx and.xlsx files directly. Combined with the 1M context window (which can ingest entire codebases, legal document sets, or multi-hundred-page research papers) and tool-calling scores above 90% on retail and telecom benchmarks, this makes it the strongest model for professional back-office automation. The Finance Agent benchmark score of 60.7% leads the field.

Pricing and Availability

Tier	Input	Output
Standard (<=200K tokens)	$5.00/M tokens	$25.00/M tokens
Long context (>200K tokens)	$10.00/M tokens	$37.50/M tokens
Fast mode (research preview)	$30.00/M tokens	$150.00/M tokens
Batch API	$2.50/M tokens (50% off)	$12.50/M tokens (50% off)
US-only inference	1.1x multiplier on all tiers	1.1x multiplier on all tiers

Claude Opus 4.6 is available on claude.ai (Pro and Max plans), the Anthropic API, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Azure Foundry. Free-tier users on claude.ai get limited access; the Pro plan ($20/month) provides standard access; the Max plan removes most usage caps.

Compared to competitors: Gemini 3.1 Pro undercuts at $2/$12 per million tokens, making it 2.5x cheaper on input. GPT-5.2 prices at $2.50/$10 for standard queries. Opus 4.6 is the most expensive option at the frontier tier, which makes the Batch API (50% discount) and model routing strategies important for cost-conscious deployments. For high-volume workloads, check our cost efficiency leaderboard to compare quality-adjusted costs across providers.

Strengths

Best-in-class agentic execution: #1 on Terminal-Bench 2.0, GDPval-AA, BrowseComp, and Humanity's Last Exam
Agent teams enable genuine multi-agent coordination, not just prompt chaining
1M-token context window that actually retrieves accurately (76% MRCR v2 at 1M tokens)
128K output tokens - double the previous Opus limit, enabling longer code generation and analysis
Strongest tool-calling performance (91.9% tau2-bench Retail, 99.3% Telecom)
Native PowerPoint/Excel integration for enterprise workflows
Lowest over-refusal rate among recent Claude models - less likely to refuse legitimate requests

Weaknesses

Most expensive frontier model at $5/$25 per million tokens (2.5x Gemini on input, 2x GPT-5.2 on output)
Long context pricing jumps to $10/$37.50 above 200K tokens
Falls behind Gemini 3.1 Pro on pure reasoning benchmarks (GPQA Diamond, ARC-AGI-2)
Visual reasoning (MMMU Pro) trails both GPT-5.2 and Gemini 3.1 Pro
Parameters not disclosed - no open-source or self-hosted option
Agent teams currently experimental and limited to Claude Code workflows
Latency higher than GPT-5.2 and Gemini on standard queries due to adaptive thinking overhead

Claude Opus 4.6 Review: Anthropic's Best-Aligned Frontier Model - Our full hands-on review covering adaptive thinking, agent teams, and the 1M context window
Claude Opus 4.6 Launches with Agent Teams - Launch coverage and announcement details
Claude Sonnet 4.6 - The more affordable Sonnet variant released 12 days later
Claude Max Opus 4.6 Usage Limits Backlash - Coverage of the Max plan usage cap controversy
Chatbot Arena Elo Rankings - Current human-preference leaderboard where Opus 4.6 holds #1
Coding Benchmarks Leaderboard - Terminal-Bench 2.0 and SWE-bench rankings
Codex vs Claude Code vs OpenCode - Tool comparison featuring Claude Code with Opus 4.6
ChatGPT vs Claude vs Gemini - Three-way platform comparison