Name: Claude 3.7 Sonnet
Author: Anthropic

Overview

Claude 3.7 Sonnet, released February 24, 2025, is Anthropic's first hybrid reasoning model. It doesn't choose between fast responses and deep reasoning at training time - it does both from a single set of weights, letting you pick the mode at inference. Standard mode gives you near-instant answers. Extended thinking mode unlocks a configurable reasoning budget of up to 128K tokens, where the model's chain-of-thought is fully visible, not hidden behind a curtain the way OpenAI's o1 handled it at launch.

TL;DR

First Claude model with hybrid reasoning - toggle extended thinking on or off per request via API
200K input context window, up to 128K output tokens (including thinking tokens), same $3/$15 per million pricing as predecessors
Hit 70.3% on SWE-bench Verified with scaffolding - well ahead of o1 (48.9%) at launch, but has since been deprecated for Claude Sonnet 4.6

The model shipped alongside Claude Code, Anthropic's terminal-based coding agent. The pairing was intentional: Claude Code routes hard multi-step tasks through extended thinking automatically, treating the hybrid architecture as infrastructure rather than a user-facing toggle. Cognition, Cursor, and several other coding tool vendors cited the model as the best they'd tested at the time of release, specifically on planning code changes and handling full-stack updates across large repositories.

Claude 3.7 Sonnet succeeded Claude 3.5 Sonnet within the 3.x generation before Anthropic moved to the 4.x naming scheme. It has since been deprecated - the API retired it in late 2025 - but it remains historically relevant as the model that introduced hybrid reasoning to the Claude family and set the SWE-bench records that later models tried to beat.

Key Specifications

Specification	Details
Provider	Anthropic
Model Family	Claude
Parameters	Not disclosed
Context Window	200K input tokens
Max Output Tokens	128K (including thinking tokens, beta at launch)
Input Price	$3.00/M tokens
Output Price	$15.00/M tokens (thinking tokens priced the same)
API Identifier	claude-3-7-sonnet-20250219 (deprecated)
Release Date	February 24, 2025
Deprecation Date	Late 2025
License	Proprietary

The pricing model deserves a note: thinking tokens cost the same as regular output tokens ($15/M). That's cheaper than OpenAI's o1 ($60/M output at launch) and significantly more than DeepSeek-R1 ($0.55/M output), but it removes the ambiguity of models that charge a separate surcharge for reasoning compute.

Benchmark Performance

Benchmark	Claude 3.7 Sonnet	OpenAI o1	DeepSeek-R1
MMLU	88.8%	92.3%	~90%
GPQA Diamond	84.8% (extended thinking)	78.0%	~65%
SWE-bench Verified	70.3% (with scaffolding)	48.9%	49.2%
HumanEval	94.0%	~92%	92.9%
AIME 2025	80.0%	~85%	79.8%
GSM8K	96.4%	~96%	~95%
TAU-bench Retail	81.2%	73.5%	Not reported
GAIA (HAL agent)	64.24%	Not reported	Not reported

The SWE-bench number is the headline. At 70.3% (with a custom scaffold) and 62.1-63.7% vanilla, Claude 3.7 Sonnet was the highest-scoring model on real-world software engineering tasks when it shipped, by a wide margin over o1 and DeepSeek-R1. The SWE-bench coding agent leaderboard has moved much since, but the gap Claude 3.7 opened over the competition in early 2025 helped cement the reputation for coding work that the Claude 4.x series inherited.

GPQA Diamond is where extended thinking makes the biggest difference. The 84.8% score places Claude 3.7 above the typical PhD expert baseline on this benchmark, and above o1's 78%. MMLU is the one area where o1 had a clearer edge. The 80% score on AIME 2025 is solid, but competition math remains one of the harder cases for any model that improves for real-world problem-solving rather than pure symbolic manipulation.

Code editor showing software development workflow Claude 3.7 Sonnet set a new SWE-bench Verified record at launch, reaching 70.3% with scaffolding. Source: unsplash.com

Key Capabilities

Extended Thinking

The core feature is the API parameter thinking: { type: "enabled", budget_tokens: N }. You set N between 1,024 and 128,000. Claude uses up to N tokens to reason before producing its visible answer. The thinking block is returned in the response, so you can inspect the reasoning chain - unlike o1 at launch, which hid the scratchpad. Anthropic recommends starting at 5,000-10,000 tokens for most tasks and going to 20,000+ for genuinely hard problems. Claude often doesn't exhaust the full budget on easier questions.

Extended thinking helps most on tasks with hard constraints or multiple interacting sub-problems: AIME-level math, multi-file refactoring, long-form analysis with conflicting evidence. For simple retrieval or summarization, standard mode is faster and cheaper.

Agentic Coding

Claude 3.7 Sonnet is the model that launched with Claude Code as its companion tool. The agentic coding performance is visible in the TAU-bench numbers: 81.2% on retail tasks and 58.4% on airline tasks, both above o1's 73.5% and 54.2%. On the Princeton HAL leaderboard using the HAL Generalist Agent, the high-compute variant reached 64.24% on GAIA with an average cost of $122 per run.

Cursor and Cognition both noted specific improvements in handling large codebases, planning gradual changes across multiple files, and following complex multi-step tool-use chains. The 45% reduction in unnecessary refusals compared to Claude 3.5 Sonnet also made it more usable in automated pipelines where the model previously aborted tasks prematurely.

Multimodal and Computer Use

The model handles text, code, and images. On ChartQA it scores 91.2% and on DocVQA 93.5%, making it reliable for extracting structured data from documents and visual charts. Anthropic described it as their most accurate model for computer use at launch - directing Claude to interact with desktop UIs, click elements, and navigate software. The computer use leaderboard now tracks several successors, but Claude 3.7 Sonnet's computer use capability was notably ahead of where 3.5 Sonnet left off.

Abstract AI reasoning visualization Extended thinking mode gives developers a configurable token budget to control reasoning depth per request. Source: unsplash.com

Pricing and Availability

Claude 3.7 Sonnet launched at $3/M input and $15/M output - identical to Claude 3.5 Sonnet. Thinking tokens count as output tokens, so a request that uses 10,000 thinking tokens costs $0.15 more than a request that doesn't. The minimum thinking budget is 1,024 tokens; the maximum is 128K.

Availability covered all Claude plans (Free, Pro, Team, Enterprise), the Anthropic API, Amazon Bedrock, and Google Cloud's Vertex AI. Extended thinking wasn't available on the free tier.

Deprecation: The model (claude-3-7-sonnet-20250219) was retired from the Anthropic API in late 2025. Vertex AI deprecated it November 11, 2025 with a shutdown date of May 11, 2026. All API requests to the retired identifier now return errors. The recommended migration path is Claude Sonnet 4.6, which succeeded it in the mid-tier slot.

Compared to its competition at launch: o1 cost $15/M input and $60/M output - five times the output price for a model that scored lower on SWE-bench. DeepSeek-R1 undercut both on price at $0.14/M in and $0.55/M out, but with much weaker agentic task results. Claude 3.7 Sonnet occupied the middle ground: competitive reasoning for less than o1, without giving up much on real-world coding and agent benchmarks.

Strengths and Weaknesses

Strengths

Configurable reasoning depth without a separate model - one API endpoint for both fast and deep responses
SWE-bench Verified record-holder at launch: 70.3% with scaffolding, far ahead of o1 (48.9%) and DeepSeek-R1 (49.2%)
GPQA Diamond 84.8% in extended thinking mode - above PhD expert baseline
Transparent reasoning chain - full thinking tokens returned in API response
45% fewer unnecessary refusals versus Claude 3.5 Sonnet, improving agentic pipeline reliability
Strong document and visual understanding: ChartQA 91.2%, DocVQA 93.5%
Thinking tokens priced the same as output tokens - no premium surcharge for reasoning

Weaknesses

Deprecated as of late 2025 - no longer available via Anthropic API
ARC-AGI scored just 12%, one of the harder abstract reasoning failure modes
Extended thinking not available on free tier at launch
MMLU trailed o1 (88.8% vs 92.3%)
Can't interleave extended thinking with tool calls without a non-tool_result turn in between - a constraint that complicated some agentic workflows
At $15/M output, more expensive than DeepSeek-R1 and comparable open-source reasoning alternatives

Claude Sonnet 4.6 - the current mid-tier successor
Claude Opus 4.6 - current flagship
SWE-bench Coding Agent Leaderboard - where 3.7 Sonnet set its early record
Agentic AI Benchmarks Leaderboard - TAU-bench and GAIA context
Computer Use Leaderboard - Claude 3.7's computer use scores in context
Reasoning Benchmarks Leaderboard - GPQA and AIME comparisons

FAQ

What made Claude 3.7 Sonnet different from Claude 3.5 Sonnet?

Claude 3.7 Sonnet added hybrid reasoning - a configurable extended thinking mode that lets developers set a token budget for step-by-step reasoning per request. Claude 3.5 Sonnet had no reasoning mode. The 3.7 release also reduced unnecessary refusals by 45% and improved coding scores significantly.

Can I still use Claude 3.7 Sonnet?

No. Anthropic retired the model (claude-3-7-sonnet-20250219) from its API in late 2025. Requests to that model ID return errors. Migrate to claude-sonnet-4-6 or a later identifier.

Does extended thinking cost extra?

No separate surcharge. Thinking tokens are billed at the standard output token rate of $15 per million. The cost scales with how many thinking tokens you actually use, not a flat premium on enabling the feature.

What is the minimum thinking budget?

1,024 tokens is the minimum. Anthropic recommends 5,000-10,000 tokens for most use cases and 20,000+ for hard problems. Claude doesn't always use the full budget - it stops reasoning when it has enough to answer.

How does Claude 3.7 Sonnet compare to DeepSeek-R1?

On real-world coding (SWE-bench: 70.3% vs 49.2%) and agentic task benchmarks, Claude 3.7 Sonnet was clearly ahead. DeepSeek-R1 undercut it heavily on price ($0.55/M vs $15/M output) and was competitive on some math benchmarks. For production agentic workloads, the SWE-bench and TAU-bench gaps justified the cost premium at launch.

Why was it called 3.7 and not 4.0?

Anthropic's versioning at the time used minor increments for significant capability improvements within a generation. Claude 3.7 Sonnet was a major improvement over 3.5 Sonnet but shipped before Anthropic moved to the 4.x generation naming scheme with Claude Sonnet 4 (released mid-2025).

Sources: