Models

Gemini 3 Deep Think

Google DeepMind's reasoning mode scores 84.6% on ARC-AGI-2, 3455 Codeforces Elo, and solves 18 previously unsolved research problems - outpacing Claude Opus 4.6 and GPT-5.2 on reasoning-heavy tasks.

Gemini 3 Deep Think

Overview

Google DeepMind shipped a major upgrade to Gemini 3 Deep Think on February 12, 2026, and the benchmark results demand attention. The reasoning mode scored 84.6% on ARC-AGI-2 - verified by the ARC Prize Foundation - which is 15.8 percentage points above Claude Opus 4.6 (68.8%) and 31.7 points above GPT-5.2 (52.9%). It hit a 3,455 Elo on Codeforces, earning Legendary Grandmaster status. And it solved 18 previously unsolved research problems, including disproving a mathematical conjecture that had stood for a decade.

TL;DR

  • Best-in-class reasoning model for science, math, and competitive programming - 84.6% ARC-AGI-2, gold-medal IMO performance, Legendary Grandmaster on Codeforces
  • Built on Gemini 3 Pro with inference-time compute scaling - 1M-token context, priced at $2/$12 per million tokens through the Gemini API
  • Controls reasoning benchmarks but trails Claude Opus 4.6 on agentic enterprise tasks and general-purpose coding

Deep Think is not a separate model. It is a reasoning mode layered on top of Gemini 3 Pro that scales inference-time compute - giving the model more time to decompose problems, explore parallel solution paths, and verify its answers before responding. Think of it as trading latency for accuracy, and on hard problems, the tradeoff pays off substantially. The approach is philosophically similar to OpenAI's o1 reasoning models, but Google's implementation highlights scientific and mathematical domains specifically.

The competitive picture depends on what you need the model for. Deep Think crushes abstract reasoning and competitive programming. Gemini 3.1 Pro - which adds dynamic thinking by default - offers a better general-purpose package at the same price. And Claude Opus 4.6 still leads on enterprise knowledge work, agentic coding, and tool-augmented workflows by a wide margin. Deep Think is the specialist that wins when the problem is genuinely hard and domain-specific. See our overall LLM rankings for the broader competitive landscape.

Key Specifications

SpecificationDetails
ProviderGoogle DeepMind
Model FamilyGemini 3 (reasoning mode)
ArchitectureTransformer-based (inference-time compute scaling)
ParametersNot disclosed (estimated ~30B active of 300-700B total)
Context Window1,000,000 tokens input / 64,000 tokens output
Input Price$2.00/M tokens (<=200K), $4.00/M tokens (>200K)
Output Price$12.00/M tokens (<=200K), $18.00/M tokens (>200K)
Release DateFebruary 12, 2026
LicenseProprietary
Input ModalitiesText, images, audio, video
Output ModalityText
Thinking ModeDeep Think (extended reasoning chains)

Benchmark Performance

BenchmarkDeep ThinkClaude Opus 4.6GPT-5.2Gemini 3 Pro (standard)
ARC-AGI-2 (abstract reasoning)84.6%68.8%52.9%31.1%
ARC-AGI-196.0%---
GPQA Diamond (PhD-level science)93.8%91.3%93.2%91.9%
Humanity's Last Exam (no tools)48.4%---
Humanity's Last Exam (search + code)53.4%53.1%50.0%-
Codeforces Elo3,4552,352-2,512
IMO 202581.5% (gold)---
IPhO 202587.7%---
IChO 202582.8%---
MMMU-Pro (visual reasoning)81.5%77.3%80.4%-
SWE-bench Verified (coding)76.2%80.8%80.0%76.2%
GDPval-AA Elo (enterprise tasks)~1,2001,6061,462-

The benchmark story splits cleanly along task type. On abstract reasoning (ARC-AGI-2), scientific knowledge (GPQA Diamond), competitive programming (Codeforces), and olympiad-level math and science, Deep Think is the undisputed leader. The 53.5-point gap over standard Gemini 3 Pro on ARC-AGI-2 shows just how much inference-time compute scaling matters for truly hard problems.

But the model's advantage narrows sharply on practical engineering tasks. SWE-bench Verified - which measures ability to resolve real GitHub issues - shows Deep Think at 76.2% versus Claude Opus 4.6's 80.8%. And on GDPval-AA, the enterprise knowledge work benchmark, the gap is even wider. Deep Think's extended reasoning chains help with problems that have verifiable correct answers. They help less with the ambiguous, context-dependent work that makes up most professional workflows. For a breakdown of what these benchmarks actually measure, see our guide to understanding AI benchmarks.

Key Capabilities

Inference-Time Compute Scaling. Deep Think operates in three stages: problem decomposition (breaking queries into sub-problems), parallel solution search (exploring multiple approaches simultaneously), and verification (checking work before producing output). This isn't just "thinking longer" - the model constructs internal reasoning chains, assesses candidate solutions in parallel, and discards dead ends. On ARC-AGI-2, this approach turned a 31.1% score (standard Gemini 3 Pro) into an 84.6% - nearly tripling accuracy by spending more compute at inference time.

Scientific Research. The February upgrade was built in partnership with researchers to handle problems where there's no single correct answer and data is messy. Deep Think solved 18 previously unsolved research problems and disproved a decade-old mathematical conjecture from 2015. Researchers have used it to identify logical flaws in peer-reviewed papers, optimize crystal growth parameters, and convert hand-drawn sketches into 3D-printable models. It scored gold-medal level on the 2025 International Math Olympiad and posted 87.7% on the International Physics Olympiad.

Competitive Programming. A Codeforces Elo of 3,455 puts Deep Think at Legendary Grandmaster level - above most human competitive programmers. For context, Claude Opus 4.6 sits at 2,352 and standard Gemini 3 Pro at 2,512. If your workload involves algorithmic problem-solving, constraint optimization, or mathematical proof generation, Deep Think is the current best option by a significant margin. Check our coding benchmarks leaderboard for full competitive programming rankings.

Pricing and Availability

Deep Think uses the same token pricing as Gemini 3 Pro through the API:

TierInputOutput
Standard (<=200K tokens)$2.00/M$12.00/M
Extended (>200K tokens)$4.00/M$18.00/M
Batch API (50% discount)$1.00/M$6.00/M
Context Caching$0.20/M-

Note that Deep Think produces significantly more thinking tokens than standard Gemini 3 Pro, which inflates effective output costs. A query that produces 500 output tokens in standard mode might produce 5,000-10,000 thinking tokens in Deep Think mode, all billed at output rates. Plan for 3-10x higher effective costs compared to standard Gemini 3 Pro depending on problem complexity.

Compared to competitors: Claude Opus 4.6 at $5/$25 per million tokens is 2.5x more expensive on input. GPT-5.2 at $2.50/$10 sits in similar territory. On raw token price, Deep Think is the cheapest reasoning model at the frontier - but the thinking token overhead closes that gap quickly on hard problems. See our cost efficiency leaderboard for quality-adjusted cost comparisons.

Access: Available in the Gemini app for Google AI Ultra subscribers ($125/3 months). API access is through the Gemini API (AI Studio) and Vertex AI, but currently limited to early access for select researchers, engineers, and enterprises. This isn't a general availability launch - you need to express interest and be approved for API use.

Strengths

  • Dominant reasoning benchmarks. 84.6% ARC-AGI-2 is 15.8 points above the next best model - a massive lead in abstract reasoning
  • Unmatched competitive programming. Codeforces 3,455 Elo (Legendary Grandmaster) is nearly 1,000 points above Claude Opus 4.6
  • Scientific research utility. Solved 18 unsolved problems, gold-medal IMO performance, and strong scores across physics and chemistry olympiads
  • Cost-competitive base pricing. $2/$12 per million tokens matches Gemini 3 Pro and undercuts both Claude and GPT-5.2
  • 1M-token context window. Processes large research paper collections and codebases in a single pass
  • Multimodal inputs. Handles text, images, audio, and video natively

Weaknesses

  • Thinking token overhead. Extended reasoning chains inflate effective costs 3-10x above base pricing on complex queries
  • Enterprise task weakness. GDPval-AA score of ~1,200 trails Claude Opus 4.6 (1,606) significantly on business-critical work
  • Limited API access. Not generally available - requires early access approval for API use
  • Software engineering gap. 76.2% SWE-bench trails Claude (80.8%) on practical coding tasks
  • High latency. Inference-time scaling means response times measured in minutes, not seconds, for hard problems
  • No agentic coordination. Lacks native multi-agent capabilities like Claude's agent teams or Codex's sub-agents
  • Parameters undisclosed. No self-hosted or fine-tuning options available

Sources

Gemini 3 Deep Think
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.