Gemini 3 Deep Think
Google DeepMind's reasoning mode scores 84.6% on ARC-AGI-2, 3455 Codeforces Elo, and solves 18 previously unsolved research problems - outpacing Claude Opus 4.6 and GPT-5.2 on reasoning-heavy tasks.

Overview
Google DeepMind shipped a major upgrade to Gemini 3 Deep Think on February 12, 2026, and the benchmark results demand attention. The reasoning mode scored 84.6% on ARC-AGI-2 - verified by the ARC Prize Foundation - which is 15.8 percentage points above Claude Opus 4.6 (68.8%) and 31.7 points above GPT-5.2 (52.9%). It hit a 3,455 Elo on Codeforces, earning Legendary Grandmaster status. And it solved 18 previously unsolved research problems, including disproving a mathematical conjecture that had stood for a decade.
TL;DR
- Best-in-class reasoning model for science, math, and competitive programming - 84.6% ARC-AGI-2, gold-medal IMO performance, Legendary Grandmaster on Codeforces
- Built on Gemini 3 Pro with inference-time compute scaling - 1M-token context, priced at $2/$12 per million tokens through the Gemini API
- Controls reasoning benchmarks but trails Claude Opus 4.6 on agentic enterprise tasks and general-purpose coding
Deep Think is not a separate model. It is a reasoning mode layered on top of Gemini 3 Pro that scales inference-time compute - giving the model more time to decompose problems, explore parallel solution paths, and verify its answers before responding. Think of it as trading latency for accuracy, and on hard problems, the tradeoff pays off substantially. The approach is philosophically similar to OpenAI's o1 reasoning models, but Google's implementation highlights scientific and mathematical domains specifically.
The competitive picture depends on what you need the model for. Deep Think crushes abstract reasoning and competitive programming. Gemini 3.1 Pro - which adds dynamic thinking by default - offers a better general-purpose package at the same price. And Claude Opus 4.6 still leads on enterprise knowledge work, agentic coding, and tool-augmented workflows by a wide margin. Deep Think is the specialist that wins when the problem is genuinely hard and domain-specific. See our overall LLM rankings for the broader competitive landscape.
Key Specifications
| Specification | Details |
|---|---|
| Provider | Google DeepMind |
| Model Family | Gemini 3 (reasoning mode) |
| Architecture | Transformer-based (inference-time compute scaling) |
| Parameters | Not disclosed (estimated ~30B active of 300-700B total) |
| Context Window | 1,000,000 tokens input / 64,000 tokens output |
| Input Price | $2.00/M tokens (<=200K), $4.00/M tokens (>200K) |
| Output Price | $12.00/M tokens (<=200K), $18.00/M tokens (>200K) |
| Release Date | February 12, 2026 |
| License | Proprietary |
| Input Modalities | Text, images, audio, video |
| Output Modality | Text |
| Thinking Mode | Deep Think (extended reasoning chains) |
Benchmark Performance
| Benchmark | Deep Think | Claude Opus 4.6 | GPT-5.2 | Gemini 3 Pro (standard) |
|---|---|---|---|---|
| ARC-AGI-2 (abstract reasoning) | 84.6% | 68.8% | 52.9% | 31.1% |
| ARC-AGI-1 | 96.0% | - | - | - |
| GPQA Diamond (PhD-level science) | 93.8% | 91.3% | 93.2% | 91.9% |
| Humanity's Last Exam (no tools) | 48.4% | - | - | - |
| Humanity's Last Exam (search + code) | 53.4% | 53.1% | 50.0% | - |
| Codeforces Elo | 3,455 | 2,352 | - | 2,512 |
| IMO 2025 | 81.5% (gold) | - | - | - |
| IPhO 2025 | 87.7% | - | - | - |
| IChO 2025 | 82.8% | - | - | - |
| MMMU-Pro (visual reasoning) | 81.5% | 77.3% | 80.4% | - |
| SWE-bench Verified (coding) | 76.2% | 80.8% | 80.0% | 76.2% |
| GDPval-AA Elo (enterprise tasks) | ~1,200 | 1,606 | 1,462 | - |
The benchmark story splits cleanly along task type. On abstract reasoning (ARC-AGI-2), scientific knowledge (GPQA Diamond), competitive programming (Codeforces), and olympiad-level math and science, Deep Think is the undisputed leader. The 53.5-point gap over standard Gemini 3 Pro on ARC-AGI-2 shows just how much inference-time compute scaling matters for truly hard problems.
But the model's advantage narrows sharply on practical engineering tasks. SWE-bench Verified - which measures ability to resolve real GitHub issues - shows Deep Think at 76.2% versus Claude Opus 4.6's 80.8%. And on GDPval-AA, the enterprise knowledge work benchmark, the gap is even wider. Deep Think's extended reasoning chains help with problems that have verifiable correct answers. They help less with the ambiguous, context-dependent work that makes up most professional workflows. For a breakdown of what these benchmarks actually measure, see our guide to understanding AI benchmarks.
Key Capabilities
Inference-Time Compute Scaling. Deep Think operates in three stages: problem decomposition (breaking queries into sub-problems), parallel solution search (exploring multiple approaches simultaneously), and verification (checking work before producing output). This isn't just "thinking longer" - the model constructs internal reasoning chains, assesses candidate solutions in parallel, and discards dead ends. On ARC-AGI-2, this approach turned a 31.1% score (standard Gemini 3 Pro) into an 84.6% - nearly tripling accuracy by spending more compute at inference time.
Scientific Research. The February upgrade was built in partnership with researchers to handle problems where there's no single correct answer and data is messy. Deep Think solved 18 previously unsolved research problems and disproved a decade-old mathematical conjecture from 2015. Researchers have used it to identify logical flaws in peer-reviewed papers, optimize crystal growth parameters, and convert hand-drawn sketches into 3D-printable models. It scored gold-medal level on the 2025 International Math Olympiad and posted 87.7% on the International Physics Olympiad.
Competitive Programming. A Codeforces Elo of 3,455 puts Deep Think at Legendary Grandmaster level - above most human competitive programmers. For context, Claude Opus 4.6 sits at 2,352 and standard Gemini 3 Pro at 2,512. If your workload involves algorithmic problem-solving, constraint optimization, or mathematical proof generation, Deep Think is the current best option by a significant margin. Check our coding benchmarks leaderboard for full competitive programming rankings.
Pricing and Availability
Deep Think uses the same token pricing as Gemini 3 Pro through the API:
| Tier | Input | Output |
|---|---|---|
| Standard (<=200K tokens) | $2.00/M | $12.00/M |
| Extended (>200K tokens) | $4.00/M | $18.00/M |
| Batch API (50% discount) | $1.00/M | $6.00/M |
| Context Caching | $0.20/M | - |
Note that Deep Think produces significantly more thinking tokens than standard Gemini 3 Pro, which inflates effective output costs. A query that produces 500 output tokens in standard mode might produce 5,000-10,000 thinking tokens in Deep Think mode, all billed at output rates. Plan for 3-10x higher effective costs compared to standard Gemini 3 Pro depending on problem complexity.
Compared to competitors: Claude Opus 4.6 at $5/$25 per million tokens is 2.5x more expensive on input. GPT-5.2 at $2.50/$10 sits in similar territory. On raw token price, Deep Think is the cheapest reasoning model at the frontier - but the thinking token overhead closes that gap quickly on hard problems. See our cost efficiency leaderboard for quality-adjusted cost comparisons.
Access: Available in the Gemini app for Google AI Ultra subscribers ($125/3 months). API access is through the Gemini API (AI Studio) and Vertex AI, but currently limited to early access for select researchers, engineers, and enterprises. This isn't a general availability launch - you need to express interest and be approved for API use.
Strengths
- Dominant reasoning benchmarks. 84.6% ARC-AGI-2 is 15.8 points above the next best model - a massive lead in abstract reasoning
- Unmatched competitive programming. Codeforces 3,455 Elo (Legendary Grandmaster) is nearly 1,000 points above Claude Opus 4.6
- Scientific research utility. Solved 18 unsolved problems, gold-medal IMO performance, and strong scores across physics and chemistry olympiads
- Cost-competitive base pricing. $2/$12 per million tokens matches Gemini 3 Pro and undercuts both Claude and GPT-5.2
- 1M-token context window. Processes large research paper collections and codebases in a single pass
- Multimodal inputs. Handles text, images, audio, and video natively
Weaknesses
- Thinking token overhead. Extended reasoning chains inflate effective costs 3-10x above base pricing on complex queries
- Enterprise task weakness. GDPval-AA score of ~1,200 trails Claude Opus 4.6 (1,606) significantly on business-critical work
- Limited API access. Not generally available - requires early access approval for API use
- Software engineering gap. 76.2% SWE-bench trails Claude (80.8%) on practical coding tasks
- High latency. Inference-time scaling means response times measured in minutes, not seconds, for hard problems
- No agentic coordination. Lacks native multi-agent capabilities like Claude's agent teams or Codex's sub-agents
- Parameters undisclosed. No self-hosted or fine-tuning options available
Related Coverage
- Gemini 3 Deep Think Review - Our full hands-on review of the reasoning mode
- Gemini 3.1 Pro - The general-purpose sibling model with dynamic thinking
- Claude Opus 4.6 - The leading model for enterprise and agentic tasks
- GPT-5.3 Codex - OpenAI's agentic coding model
- Google Launches Gemini 3.1 Pro - Latest Gemini Pro release
- Gemini 3.1 Pro Doubles Reasoning Performance - ARC-AGI-2 progress across the Gemini family
- Overall LLM Rankings: February 2026 - Full competitive landscape
- ChatGPT vs Claude vs Gemini - Head-to-head platform comparison
- How to Choose the Right LLM in 2026 - Decision framework for model selection
- Reasoning Benchmarks Leaderboard - Where Deep Think ranks on reasoning tasks
Sources
- Gemini 3 Deep Think: Advancing science, research and engineering - Google Blog
- Gemini 3 Deep Think gets 'major upgrade' - 9to5Google
- Google Gemini 3 Deep Think Beats Opus 4.6 and GPT-5.2 - WinBuzzer
- Gemini 3 Deep Think: Reasoning Benchmarks and Complete Guide - Digital Applied
- Gemini Developer API Pricing - Google AI
- Gemini 3 Deep Think - Google DeepMind
- Gemini 3 Deep Think Complete Guide - NxCode
- Google's Gemini 3 Deep Think update pushes boundaries of AI reasoning - Chrome Unboxed
