Claude 3.7 Sonnet
Anthropic's first hybrid reasoning model with togglable extended thinking, a 200K context window, and state-of-the-art SWE-bench performance at $3/$15 per million tokens.

Overview
Claude 3.7 Sonnet, released February 24, 2025, is Anthropic's first hybrid reasoning model. It doesn't choose between fast responses and deep reasoning at training time - it does both from a single set of weights, letting you pick the mode at inference. Standard mode gives you near-instant answers. Extended thinking mode unlocks a configurable reasoning budget of up to 128K tokens, where the model's chain-of-thought is fully visible, not hidden behind a curtain the way OpenAI's o1 handled it at launch.
TL;DR
- First Claude model with hybrid reasoning - toggle extended thinking on or off per request via API
- 200K input context window, up to 128K output tokens (including thinking tokens), same $3/$15 per million pricing as predecessors
- Hit 70.3% on SWE-bench Verified with scaffolding - well ahead of o1 (48.9%) at launch, but has since been deprecated for Claude Sonnet 4.6
The model shipped alongside Claude Code, Anthropic's terminal-based coding agent. The pairing was intentional: Claude Code routes hard multi-step tasks through extended thinking automatically, treating the hybrid architecture as infrastructure rather than a user-facing toggle. Cognition, Cursor, and several other coding tool vendors cited the model as the best they'd tested at the time of release, specifically on planning code changes and handling full-stack updates across large repositories.
Claude 3.7 Sonnet succeeded Claude 3.5 Sonnet within the 3.x generation before Anthropic moved to the 4.x naming scheme. It has since been deprecated - the API retired it in late 2025 - but it remains historically relevant as the model that introduced hybrid reasoning to the Claude family and set the SWE-bench records that later models tried to beat.
Key Specifications
| Specification | Details |
|---|---|
| Provider | Anthropic |
| Model Family | Claude |
| Parameters | Not disclosed |
| Context Window | 200K input tokens |
| Max Output Tokens | 128K (including thinking tokens, beta at launch) |
| Input Price | $3.00/M tokens |
| Output Price | $15.00/M tokens (thinking tokens priced the same) |
| API Identifier | claude-3-7-sonnet-20250219 (deprecated) |
| Release Date | February 24, 2025 |
| Deprecation Date | Late 2025 |
| License | Proprietary |
The pricing model deserves a note: thinking tokens cost the same as regular output tokens ($15/M). That's cheaper than OpenAI's o1 ($60/M output at launch) and significantly more than DeepSeek-R1 ($0.55/M output), but it removes the ambiguity of models that charge a separate surcharge for reasoning compute.
Benchmark Performance
| Benchmark | Claude 3.7 Sonnet | OpenAI o1 | DeepSeek-R1 |
|---|---|---|---|
| MMLU | 88.8% | 92.3% | ~90% |
| GPQA Diamond | 84.8% (extended thinking) | 78.0% | ~65% |
| SWE-bench Verified | 70.3% (with scaffolding) | 48.9% | 49.2% |
| HumanEval | 94.0% | ~92% | 92.9% |
| AIME 2025 | 80.0% | ~85% | 79.8% |
| GSM8K | 96.4% | ~96% | ~95% |
| TAU-bench Retail | 81.2% | 73.5% | Not reported |
| GAIA (HAL agent) | 64.24% | Not reported | Not reported |
The SWE-bench number is the headline. At 70.3% (with a custom scaffold) and 62.1-63.7% vanilla, Claude 3.7 Sonnet was the highest-scoring model on real-world software engineering tasks when it shipped, by a wide margin over o1 and DeepSeek-R1. The SWE-bench coding agent leaderboard has moved much since, but the gap Claude 3.7 opened over the competition in early 2025 helped cement the reputation for coding work that the Claude 4.x series inherited.
GPQA Diamond is where extended thinking makes the biggest difference. The 84.8% score places Claude 3.7 above the typical PhD expert baseline on this benchmark, and above o1's 78%. MMLU is the one area where o1 had a clearer edge. The 80% score on AIME 2025 is solid, but competition math remains one of the harder cases for any model that improves for real-world problem-solving rather than pure symbolic manipulation.
Claude 3.7 Sonnet set a new SWE-bench Verified record at launch, reaching 70.3% with scaffolding.
Source: unsplash.com
Key Capabilities
Extended Thinking
The core feature is the API parameter thinking: { type: "enabled", budget_tokens: N }. You set N between 1,024 and 128,000. Claude uses up to N tokens to reason before producing its visible answer. The thinking block is returned in the response, so you can inspect the reasoning chain - unlike o1 at launch, which hid the scratchpad. Anthropic recommends starting at 5,000-10,000 tokens for most tasks and going to 20,000+ for genuinely hard problems. Claude often doesn't exhaust the full budget on easier questions.
Extended thinking helps most on tasks with hard constraints or multiple interacting sub-problems: AIME-level math, multi-file refactoring, long-form analysis with conflicting evidence. For simple retrieval or summarization, standard mode is faster and cheaper.
Agentic Coding
Claude 3.7 Sonnet is the model that launched with Claude Code as its companion tool. The agentic coding performance is visible in the TAU-bench numbers: 81.2% on retail tasks and 58.4% on airline tasks, both above o1's 73.5% and 54.2%. On the Princeton HAL leaderboard using the HAL Generalist Agent, the high-compute variant reached 64.24% on GAIA with an average cost of $122 per run.
Cursor and Cognition both noted specific improvements in handling large codebases, planning gradual changes across multiple files, and following complex multi-step tool-use chains. The 45% reduction in unnecessary refusals compared to Claude 3.5 Sonnet also made it more usable in automated pipelines where the model previously aborted tasks prematurely.
Multimodal and Computer Use
The model handles text, code, and images. On ChartQA it scores 91.2% and on DocVQA 93.5%, making it reliable for extracting structured data from documents and visual charts. Anthropic described it as their most accurate model for computer use at launch - directing Claude to interact with desktop UIs, click elements, and navigate software. The computer use leaderboard now tracks several successors, but Claude 3.7 Sonnet's computer use capability was notably ahead of where 3.5 Sonnet left off.
Extended thinking mode gives developers a configurable token budget to control reasoning depth per request.
Source: unsplash.com
Pricing and Availability
Claude 3.7 Sonnet launched at $3/M input and $15/M output - identical to Claude 3.5 Sonnet. Thinking tokens count as output tokens, so a request that uses 10,000 thinking tokens costs $0.15 more than a request that doesn't. The minimum thinking budget is 1,024 tokens; the maximum is 128K.
Availability covered all Claude plans (Free, Pro, Team, Enterprise), the Anthropic API, Amazon Bedrock, and Google Cloud's Vertex AI. Extended thinking wasn't available on the free tier.
Deprecation: The model (claude-3-7-sonnet-20250219) was retired from the Anthropic API in late 2025. Vertex AI deprecated it November 11, 2025 with a shutdown date of May 11, 2026. All API requests to the retired identifier now return errors. The recommended migration path is Claude Sonnet 4.6, which succeeded it in the mid-tier slot.
Compared to its competition at launch: o1 cost $15/M input and $60/M output - five times the output price for a model that scored lower on SWE-bench. DeepSeek-R1 undercut both on price at $0.14/M in and $0.55/M out, but with much weaker agentic task results. Claude 3.7 Sonnet occupied the middle ground: competitive reasoning for less than o1, without giving up much on real-world coding and agent benchmarks.
Strengths and Weaknesses
Strengths
- Configurable reasoning depth without a separate model - one API endpoint for both fast and deep responses
- SWE-bench Verified record-holder at launch: 70.3% with scaffolding, far ahead of o1 (48.9%) and DeepSeek-R1 (49.2%)
- GPQA Diamond 84.8% in extended thinking mode - above PhD expert baseline
- Transparent reasoning chain - full thinking tokens returned in API response
- 45% fewer unnecessary refusals versus Claude 3.5 Sonnet, improving agentic pipeline reliability
- Strong document and visual understanding: ChartQA 91.2%, DocVQA 93.5%
- Thinking tokens priced the same as output tokens - no premium surcharge for reasoning
Weaknesses
- Deprecated as of late 2025 - no longer available via Anthropic API
- ARC-AGI scored just 12%, one of the harder abstract reasoning failure modes
- Extended thinking not available on free tier at launch
- MMLU trailed o1 (88.8% vs 92.3%)
- Can't interleave extended thinking with tool calls without a non-tool_result turn in between - a constraint that complicated some agentic workflows
- At $15/M output, more expensive than DeepSeek-R1 and comparable open-source reasoning alternatives
Related Coverage
- Claude Sonnet 4.6 - the current mid-tier successor
- Claude Opus 4.6 - current flagship
- SWE-bench Coding Agent Leaderboard - where 3.7 Sonnet set its early record
- Agentic AI Benchmarks Leaderboard - TAU-bench and GAIA context
- Computer Use Leaderboard - Claude 3.7's computer use scores in context
- Reasoning Benchmarks Leaderboard - GPQA and AIME comparisons
FAQ
What made Claude 3.7 Sonnet different from Claude 3.5 Sonnet?
Claude 3.7 Sonnet added hybrid reasoning - a configurable extended thinking mode that lets developers set a token budget for step-by-step reasoning per request. Claude 3.5 Sonnet had no reasoning mode. The 3.7 release also reduced unnecessary refusals by 45% and improved coding scores significantly.
Can I still use Claude 3.7 Sonnet?
No. Anthropic retired the model (claude-3-7-sonnet-20250219) from its API in late 2025. Requests to that model ID return errors. Migrate to claude-sonnet-4-6 or a later identifier.
Does extended thinking cost extra?
No separate surcharge. Thinking tokens are billed at the standard output token rate of $15 per million. The cost scales with how many thinking tokens you actually use, not a flat premium on enabling the feature.
What is the minimum thinking budget?
1,024 tokens is the minimum. Anthropic recommends 5,000-10,000 tokens for most use cases and 20,000+ for hard problems. Claude doesn't always use the full budget - it stops reasoning when it has enough to answer.
How does Claude 3.7 Sonnet compare to DeepSeek-R1?
On real-world coding (SWE-bench: 70.3% vs 49.2%) and agentic task benchmarks, Claude 3.7 Sonnet was clearly ahead. DeepSeek-R1 undercut it heavily on price ($0.55/M vs $15/M output) and was competitive on some math benchmarks. For production agentic workloads, the SWE-bench and TAU-bench gaps justified the cost premium at launch.
Why was it called 3.7 and not 4.0?
Anthropic's versioning at the time used minor increments for significant capability improvements within a generation. Claude 3.7 Sonnet was a major improvement over 3.5 Sonnet but shipped before Anthropic moved to the 4.x generation naming scheme with Claude Sonnet 4 (released mid-2025).
Sources:
- Claude 3.7 Sonnet and Claude Code - Anthropic
- Claude 3.7 Sonnet - HAL Leaderboard (Princeton)
- Claude 3.7 Sonnet benchmarks - automatio.ai
- Claude 3.7 Sonnet vs OpenAI o1 vs DeepSeek R1 - Vellum
- Model deprecations - Anthropic API Docs
- Claude 3.7 Sonnet extended thinking - Simon Willison
- Building with extended thinking - Claude Docs
- Evaluating Claude 3.7 Sonnet - W&B
✓ Last verified May 15, 2026
