Name: Gemini 3 Deep Think
Author: Google

Overview

Google DeepMind shipped a major upgrade to Gemini 3 Deep Think on February 12, 2026, and the benchmark results demand attention. The reasoning mode scored 84.6% on ARC-AGI-2 - verified by the ARC Prize Foundation - which is 15.8 percentage points above Claude Opus 4.6 (68.8%) and 31.7 points above GPT-5.2 (52.9%). It hit a 3,455 Elo on Codeforces, earning Legendary Grandmaster status. And it solved 18 previously unsolved research problems, including disproving a mathematical conjecture that had stood for a decade.

TL;DR

Best-in-class reasoning model for science, math, and competitive programming - 84.6% ARC-AGI-2, gold-medal IMO performance, Legendary Grandmaster on Codeforces
Built on Gemini 3 Pro with inference-time compute scaling - 1M-token context, priced at $2/$12 per million tokens through the Gemini API
Controls reasoning benchmarks but trails Claude Opus 4.6 on agentic enterprise tasks and general-purpose coding

Deep Think is not a separate model. It is a reasoning mode layered on top of Gemini 3 Pro that scales inference-time compute - giving the model more time to decompose problems, explore parallel solution paths, and verify its answers before responding. Think of it as trading latency for accuracy, and on hard problems, the tradeoff pays off substantially. The approach is philosophically similar to OpenAI's o1 reasoning models, but Google's implementation highlights scientific and mathematical domains specifically.

The competitive picture depends on what you need the model for. Deep Think crushes abstract reasoning and competitive programming. Gemini 3.1 Pro - which adds dynamic thinking by default - offers a better general-purpose package at the same price. And Claude Opus 4.6 still leads on enterprise knowledge work, agentic coding, and tool-augmented workflows by a wide margin. Deep Think is the specialist that wins when the problem is genuinely hard and domain-specific. See our overall LLM rankings for the broader competitive landscape.

Key Specifications

Specification	Details
Provider	Google DeepMind
Model Family	Gemini 3 (reasoning mode)
Architecture	Transformer-based (inference-time compute scaling)
Parameters	Not disclosed (estimated ~30B active of 300-700B total)
Context Window	1,000,000 tokens input / 64,000 tokens output
Input Price	$2.00/M tokens (<=200K), $4.00/M tokens (>200K)
Output Price	$12.00/M tokens (<=200K), $18.00/M tokens (>200K)
Release Date	February 12, 2026
License	Proprietary
Input Modalities	Text, images, audio, video
Output Modality	Text
Thinking Mode	Deep Think (extended reasoning chains)

Benchmark Performance

Benchmark	Deep Think	Claude Opus 4.6	GPT-5.2	Gemini 3 Pro (standard)
ARC-AGI-2 (abstract reasoning)	84.6%	68.8%	52.9%	31.1%
ARC-AGI-1	96.0%	-	-	-
GPQA Diamond (PhD-level science)	93.8%	91.3%	93.2%	91.9%
Humanity's Last Exam (no tools)	48.4%	-	-	-
Humanity's Last Exam (search + code)	53.4%	53.1%	50.0%	-
Codeforces Elo	3,455	2,352	-	2,512
IMO 2025	81.5% (gold)	-	-	-
IPhO 2025	87.7%	-	-	-
IChO 2025	82.8%	-	-	-
MMMU-Pro (visual reasoning)	81.5%	77.3%	80.4%	-
SWE-bench Verified (coding)	76.2%	80.8%	80.0%	76.2%
GDPval-AA Elo (enterprise tasks)	~1,200	1,606	1,462	-

The benchmark story splits cleanly along task type. On abstract reasoning (ARC-AGI-2), scientific knowledge (GPQA Diamond), competitive programming (Codeforces), and olympiad-level math and science, Deep Think is the undisputed leader. The 53.5-point gap over standard Gemini 3 Pro on ARC-AGI-2 shows just how much inference-time compute scaling matters for truly hard problems.

But the model's advantage narrows sharply on practical engineering tasks. SWE-bench Verified - which measures ability to resolve real GitHub issues - shows Deep Think at 76.2% versus Claude Opus 4.6's 80.8%. And on GDPval-AA, the enterprise knowledge work benchmark, the gap is even wider. Deep Think's extended reasoning chains help with problems that have verifiable correct answers. They help less with the ambiguous, context-dependent work that makes up most professional workflows. For a breakdown of what these benchmarks actually measure, see our guide to understanding AI benchmarks.

Key Capabilities

Inference-Time Compute Scaling. Deep Think operates in three stages: problem decomposition (breaking queries into sub-problems), parallel solution search (exploring multiple approaches simultaneously), and verification (checking work before producing output). This isn't just "thinking longer" - the model constructs internal reasoning chains, assesses candidate solutions in parallel, and discards dead ends. On ARC-AGI-2, this approach turned a 31.1% score (standard Gemini 3 Pro) into an 84.6% - nearly tripling accuracy by spending more compute at inference time.

Scientific Research. The February upgrade was built in partnership with researchers to handle problems where there's no single correct answer and data is messy. Deep Think solved 18 previously unsolved research problems and disproved a decade-old mathematical conjecture from 2015. Researchers have used it to identify logical flaws in peer-reviewed papers, optimize crystal growth parameters, and convert hand-drawn sketches into 3D-printable models. It scored gold-medal level on the 2025 International Math Olympiad and posted 87.7% on the International Physics Olympiad.

Competitive Programming. A Codeforces Elo of 3,455 puts Deep Think at Legendary Grandmaster level - above most human competitive programmers. For context, Claude Opus 4.6 sits at 2,352 and standard Gemini 3 Pro at 2,512. If your workload involves algorithmic problem-solving, constraint optimization, or mathematical proof generation, Deep Think is the current best option by a significant margin. Check our coding benchmarks leaderboard for full competitive programming rankings.

Pricing and Availability

Deep Think uses the same token pricing as Gemini 3 Pro through the API:

Tier	Input	Output
Standard (<=200K tokens)	$2.00/M	$12.00/M
Extended (>200K tokens)	$4.00/M	$18.00/M
Batch API (50% discount)	$1.00/M	$6.00/M
Context Caching	$0.20/M	-

Note that Deep Think produces significantly more thinking tokens than standard Gemini 3 Pro, which inflates effective output costs. A query that produces 500 output tokens in standard mode might produce 5,000-10,000 thinking tokens in Deep Think mode, all billed at output rates. Plan for 3-10x higher effective costs compared to standard Gemini 3 Pro depending on problem complexity.

Compared to competitors: Claude Opus 4.6 at $5/$25 per million tokens is 2.5x more expensive on input. GPT-5.2 at $2.50/$10 sits in similar territory. On raw token price, Deep Think is the cheapest reasoning model at the frontier - but the thinking token overhead closes that gap quickly on hard problems. See our cost efficiency leaderboard for quality-adjusted cost comparisons.

Access: Available in the Gemini app for Google AI Ultra subscribers ($125/3 months). API access is through the Gemini API (AI Studio) and Vertex AI, but currently limited to early access for select researchers, engineers, and enterprises. This isn't a general availability launch - you need to express interest and be approved for API use.

Strengths

Dominant reasoning benchmarks. 84.6% ARC-AGI-2 is 15.8 points above the next best model - a massive lead in abstract reasoning
Unmatched competitive programming. Codeforces 3,455 Elo (Legendary Grandmaster) is nearly 1,000 points above Claude Opus 4.6
Scientific research utility. Solved 18 unsolved problems, gold-medal IMO performance, and strong scores across physics and chemistry olympiads
Cost-competitive base pricing. $2/$12 per million tokens matches Gemini 3 Pro and undercuts both Claude and GPT-5.2
1M-token context window. Processes large research paper collections and codebases in a single pass
Multimodal inputs. Handles text, images, audio, and video natively

Weaknesses

Thinking token overhead. Extended reasoning chains inflate effective costs 3-10x above base pricing on complex queries
Enterprise task weakness. GDPval-AA score of ~1,200 trails Claude Opus 4.6 (1,606) significantly on business-critical work
Limited API access. Not generally available - requires early access approval for API use
Software engineering gap. 76.2% SWE-bench trails Claude (80.8%) on practical coding tasks
High latency. Inference-time scaling means response times measured in minutes, not seconds, for hard problems
No agentic coordination. Lacks native multi-agent capabilities like Claude's agent teams or Codex's sub-agents
Parameters undisclosed. No self-hosted or fine-tuning options available

Gemini 3 Deep Think Review - Our full hands-on review of the reasoning mode
Gemini 3.1 Pro - The general-purpose sibling model with dynamic thinking
Claude Opus 4.6 - The leading model for enterprise and agentic tasks
GPT-5.3 Codex - OpenAI's agentic coding model
Google Launches Gemini 3.1 Pro - Latest Gemini Pro release
Gemini 3.1 Pro Doubles Reasoning Performance - ARC-AGI-2 progress across the Gemini family
Overall LLM Rankings: February 2026 - Full competitive landscape
ChatGPT vs Claude vs Gemini - Head-to-head platform comparison
How to Choose the Right LLM in 2026 - Decision framework for model selection
Reasoning Benchmarks Leaderboard - Where Deep Think ranks on reasoning tasks