Claude Opus 4.6
Anthropic's flagship model leads on agentic coding, enterprise knowledge work, and long-context retrieval with a 1M-token window, 128K output, and agent teams at $5/$25 per million tokens.

Overview
Anthropic released Claude Opus 4.6 on February 5, 2026, and the benchmark data tells a clear story: this is the most capable model for agentic work and enterprise knowledge tasks available today. It scores 1,606 Elo on GDPval-AA (a measure of real-world office task performance), putting it 144 points ahead of GPT-5.2 and 411 points ahead of Gemini 3 Pro. It takes the #1 spot on Terminal-Bench 2.0, Humanity's Last Exam, and BrowseComp. It holds the top Chatbot Arena ranking at roughly 1,496 Elo.
The model ships with a 1M-token context window (in beta) and supports up to 128K tokens of output - double the previous Opus limit. But the headline feature is agent teams: the ability to spawn and coordinate parallel sub-agents through Claude Code. Anthropic demonstrated this by having 16 parallel agents write a 100,000-line C compiler in two weeks that passes 99% of the GCC test suite. That is not a benchmark score. That is a shipped artifact.
Opus 4.6 does not win everywhere. Gemini 3.1 Pro beats it on GPQA Diamond (94.3% vs 91.3%) and ARC-AGI-2 (77.1% vs 68.8%). GPT-5.3 Codex posts a higher Terminal-Bench 2.0 score (77.3%) in its specialized agentic coding mode. But on the combined picture of reasoning, coding, tool use, and long-context retrieval, Opus 4.6 is the most consistently strong model across the board. See our overall LLM rankings for the full picture.
Key Specifications
| Specification | Details |
|---|---|
| Provider | Anthropic |
| Model Family | Claude |
| Parameters | Not disclosed |
| Context Window | 1,000,000 tokens input (beta) / 128,000 tokens output |
| Input Price | $5.00/M tokens (<=200K), $10.00/M tokens (>200K) |
| Output Price | $25.00/M tokens (<=200K), $37.50/M tokens (>200K) |
| Release Date | February 5, 2026 |
| License | Proprietary (API access, claude.ai, cloud platforms) |
| Input Modalities | Text, images |
| Output Modality | Text |
| Adaptive Thinking | Enabled by default (low/medium/high/max effort levels) |
| Model ID | claude-opus-4-6 |
Benchmark Performance
| Benchmark | Claude Opus 4.6 | GPT-5.2 | Gemini 3.1 Pro | Claude Opus 4.5 |
|---|---|---|---|---|
| Terminal-Bench 2.0 (agentic coding) | 65.4% | 64.7% | 56.2% | 59.8% |
| SWE-bench Verified (GitHub issues) | 80.8% | 80.0% | 76.2% | 80.9% |
| GPQA Diamond (PhD-level science) | 91.3% | 93.2% | 94.3% | 87.0% |
| ARC-AGI-2 (novel reasoning) | 68.8% | 54.2% | 77.1% | 37.6% |
| Humanity's Last Exam (with tools) | 53.1% | 50.0% | 45.8% | 43.4% |
| BrowseComp (web research) | 84.0% | 77.9% | 59.2% | 67.8% |
| GDPval-AA Elo (office tasks) | 1,606 | 1,462 | 1,195 | 1,416 |
| OSWorld (GUI automation) | 72.7% | - | - | 66.3% |
| tau2-bench Retail (tool calling) | 91.9% | 82.0% | 85.3% | 88.9% |
| MRCR v2 @ 1M (long-context retrieval) | 76.0% | - | 26.3% | - |
| MMMU Pro (visual reasoning, with tools) | 77.3% | 80.4% | 81.0% | 73.9% |
| MMMLU (multilingual) | 91.1% | 89.6% | 91.8% | 90.8% |
The numbers separate into two stories. On reasoning-heavy academic benchmarks (GPQA, ARC-AGI-2, MMMU Pro), Gemini 3.1 Pro leads. On agentic tasks that require sustained execution, tool use, and real-world knowledge work (Terminal-Bench, GDPval-AA, BrowseComp, tau2-bench), Opus 4.6 is consistently #1. The GDPval-AA gap of 144 Elo points over GPT-5.2 translates to noticeably better performance on the kinds of tasks companies actually pay for: financial analysis, document synthesis, form automation, and multi-application workflows.
The long-context story is particularly strong. At 1M tokens, Opus 4.6 scores 76.0% on MRCR v2 needle-in-haystack retrieval versus Gemini 3.1 Pro's 26.3% on the same test. Anthropic's 1M context window actually works - it is not just an advertised number that degrades under real use.
Key Capabilities
Agent Teams. The most significant new capability is native multi-agent coordination through Claude Code. Opus 4.6 can spawn parallel sub-agents, delegate sub-tasks, and synthesize their outputs. This is not prompt chaining - it is environment-level coordination where multiple model instances operate independently and merge results. The 100K-line C compiler experiment demonstrated the ceiling. For typical development workflows, agent teams translate to end-to-end multi-file project builds with less human oversight than any prior model.
Adaptive Thinking. Rather than offering discrete reasoning modes, Opus 4.6 dynamically adjusts its internal reasoning depth based on task complexity. You can override this with explicit effort controls (low, medium, high, max), but the automatic mode correctly identifies when deeper analysis is needed roughly 90% of the time. At the default setting, the model almost always engages some form of extended reasoning. Read our full review for hands-on testing of this feature.
Enterprise Integration. Opus 4.6 adds native PowerPoint and Excel support - reading, analyzing, and generating .pptx and .xlsx files directly. Combined with the 1M context window (which can ingest entire codebases, legal document sets, or multi-hundred-page research papers) and tool-calling scores above 90% on retail and telecom benchmarks, this makes it the strongest model for professional back-office automation. The Finance Agent benchmark score of 60.7% leads the field.
Pricing and Availability
| Tier | Input | Output |
|---|---|---|
| Standard (<=200K tokens) | $5.00/M tokens | $25.00/M tokens |
| Long context (>200K tokens) | $10.00/M tokens | $37.50/M tokens |
| Fast mode (research preview) | $30.00/M tokens | $150.00/M tokens |
| Batch API | $2.50/M tokens (50% off) | $12.50/M tokens (50% off) |
| US-only inference | 1.1x multiplier on all tiers | 1.1x multiplier on all tiers |
Claude Opus 4.6 is available on claude.ai (Pro and Max plans), the Anthropic API, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Azure Foundry. Free-tier users on claude.ai get limited access; the Pro plan ($20/month) provides standard access; the Max plan removes most usage caps.
Compared to competitors: Gemini 3.1 Pro undercuts at $2/$12 per million tokens, making it 2.5x cheaper on input. GPT-5.2 prices at $2.50/$10 for standard queries. Opus 4.6 is the most expensive option at the frontier tier, which makes the Batch API (50% discount) and model routing strategies important for cost-conscious deployments. For high-volume workloads, check our cost efficiency leaderboard to compare quality-adjusted costs across providers.
Strengths
- Best-in-class agentic execution: #1 on Terminal-Bench 2.0, GDPval-AA, BrowseComp, and Humanity's Last Exam
- Agent teams enable genuine multi-agent coordination, not just prompt chaining
- 1M-token context window that actually retrieves accurately (76% MRCR v2 at 1M tokens)
- 128K output tokens - double the previous Opus limit, enabling longer code generation and analysis
- Strongest tool-calling performance (91.9% tau2-bench Retail, 99.3% Telecom)
- Native PowerPoint/Excel integration for enterprise workflows
- Lowest over-refusal rate among recent Claude models - less likely to refuse legitimate requests
Weaknesses
- Most expensive frontier model at $5/$25 per million tokens (2.5x Gemini on input, 2x GPT-5.2 on output)
- Long context pricing jumps to $10/$37.50 above 200K tokens
- Falls behind Gemini 3.1 Pro on pure reasoning benchmarks (GPQA Diamond, ARC-AGI-2)
- Visual reasoning (MMMU Pro) trails both GPT-5.2 and Gemini 3.1 Pro
- Parameters not disclosed - no open-source or self-hosted option
- Agent teams currently experimental and limited to Claude Code workflows
- Latency higher than GPT-5.2 and Gemini on standard queries due to adaptive thinking overhead
Related Coverage
- Claude Opus 4.6 Review: Anthropic's Best-Aligned Frontier Model - Our full hands-on review covering adaptive thinking, agent teams, and the 1M context window
- Claude Opus 4.6 Launches with Agent Teams - Launch coverage and announcement details
- Claude Sonnet 4.6 - The more affordable Sonnet variant released 12 days later
- Claude Max Opus 4.6 Usage Limits Backlash - Coverage of the Max plan usage cap controversy
- Chatbot Arena Elo Rankings - Current human-preference leaderboard where Opus 4.6 holds #1
- Coding Benchmarks Leaderboard - Terminal-Bench 2.0 and SWE-bench rankings
- Codex vs Claude Code vs OpenCode - Tool comparison featuring Claude Code with Opus 4.6
- ChatGPT vs Claude vs Gemini - Three-way platform comparison
Sources
- Introducing Claude Opus 4.6 - Anthropic
- Claude Opus 4.6 System Card - Anthropic (PDF)
- Claude API Pricing - Anthropic
- Claude Opus 4.6 Benchmarks Explained - Vellum
- Claude Opus 4.6 vs Opus 4.5 - Medium/Barnacle Goose
- Anthropic Releases Opus 4.6 with Agent Teams - TechCrunch
- Claude Opus 4.6 Intelligence & Performance Analysis - Artificial Analysis
- LMSYS Chatbot Arena Leaderboard - February 2026
- Claude Opus 4.6 on Azure Foundry - Microsoft
