Gemini 3.1 Pro
Google DeepMind's Gemini 3.1 Pro leads on 13 of 16 benchmarks with 77.1% ARC-AGI-2, 94.3% GPQA Diamond, and a 1M-token context window at $2/M input.

Overview
Google DeepMind released Gemini 3.1 Pro on February 19, 2026, and the numbers back up the hype. The model posted leading scores on 13 of 16 benchmarks evaluated by Google, including a 77.1% on ARC-AGI-2 - more than double the score of its predecessor Gemini 3 Pro. That single number tells you this is not a minor point release. It is a generational leap in reasoning capability.
Built on a Transformer-based Mixture-of-Experts architecture, Gemini 3.1 Pro processes text, images, audio, video, and code across a 1M-token context window with up to 64K tokens of output. It ships with dynamic thinking by default, meaning the model allocates more internal reasoning to harder problems rather than producing immediate responses. This puts it squarely in the "thinking model" category alongside Claude Opus 4.6 and GPT-5.2, trading latency for accuracy on complex tasks.
The competitive picture is tight at the top. On Chatbot Arena, Gemini 3.1 Pro sits at roughly 1500 Elo versus Claude Opus 4.6's 1504 - essentially tied in human preference voting. But the benchmark story diverges sharply depending on the task category. Gemini dominates on reasoning and scientific knowledge. Claude holds the edge on software engineering and economically valuable enterprise tasks. Picking between them now requires knowing exactly what you need the model for, which is a healthy sign for the field. See our overall LLM rankings for the full picture.
Key Specifications
| Specification | Details |
|---|---|
| Provider | Google DeepMind |
| Model Family | Gemini |
| Architecture | Transformer-based Mixture of Experts |
| Parameters | Not disclosed |
| Context Window | 1,000,000 tokens input / 64,000 tokens output |
| Input Price | $2.00/M tokens (<=200K), $4.00/M tokens (>200K) |
| Output Price | $12.00/M tokens (<=200K), $18.00/M tokens (>200K) |
| Release Date | February 19, 2026 |
| License | Proprietary (API access only) |
| Input Modalities | Text, images, audio, video |
| Output Modality | Text |
| Dynamic Thinking | Enabled by default |
Benchmark Performance
| Benchmark | Gemini 3.1 Pro | Claude Opus 4.6 | GPT-5.2 |
|---|---|---|---|
| ARC-AGI-2 (reasoning) | 77.1% | 68.8% | 52.9% |
| GPQA Diamond (science) | 94.3% | 91.3% | 92.4% |
| MMLU (knowledge) | 92.6% | 91.1% | 89.6% |
| HLE - no tools (reasoning) | 44.4% | 40.0% | 34.5% |
| SWE-Bench Verified (coding) | 80.6% | 80.8% | 80.0% |
| Terminal-Bench 2.0 (agentic) | 68.5% | 65.4% | 54.0% |
| AIME 2025 (math) | 100% | - | - |
| LiveCodeBench Pro (Elo) | 2887 | - | - |
| GDPval-AA (enterprise) | 1317 | 1606 | 1462 |
| MRCR v2 128K (long context) | 84.9% | - | - |
The headline stat is ARC-AGI-2 at 77.1%. This benchmark tests a model's ability to solve entirely novel logic patterns it has never seen, and Gemini 3.1 Pro's score represents a 2.5x improvement over its predecessor. On GPQA Diamond, a graduate-level science reasoning test, it hits 94.3% - the highest recorded score. It matches or exceeds competitors on most knowledge and reasoning benchmarks.
The cracks show on enterprise tasks. The GDPval-AA benchmark measures performance on economically valuable work like financial modeling, strategic planning, and business documentation. Gemini 3.1 Pro's Elo of 1317 trails Claude Opus 4.6 (1606) by a wide margin, and even Claude Sonnet 4.6 (1633) outscores it decisively. If your workload is business-critical document synthesis, the benchmarks point you toward Claude. For a deeper dive into what these numbers actually measure, see our guide to understanding AI benchmarks.
Key Capabilities
Gemini 3.1 Pro's strongest suit is deep reasoning across multiple modalities. It can ingest entire code repositories, lengthy research papers, hours of video, and complex multi-source datasets within its 1M-token context window, then reason across all of them simultaneously. The dynamic thinking system lets you tune reasoning depth via the thinking_level parameter, balancing quality against latency and cost depending on your use case.
Agentic performance is a highlight. On Terminal-Bench 2.0, which evaluates multi-step tool-use workflows, Gemini 3.1 Pro scored 68.5% - above both Claude Opus 4.6 (65.4%) and GPT-5.2 (54.0%). It also scored 69.2% on MCP Atlas, a tool coordination benchmark. Google has been pushing agentic capabilities hard, and the numbers suggest it is paying off. The model is available in GitHub Copilot and Google's own Gemini CLI for coding workflows.
Scientific and mathematical reasoning is where this model genuinely separates from the pack. A perfect 100% on AIME 2025 with code execution, 94.3% on GPQA Diamond, and 44.4% on HLE (a test designed to be too hard for current models) paint a clear picture: if your use case involves research synthesis, complex math, or domain-expert reasoning, Gemini 3.1 Pro is the current leader.
Pricing and Availability
Gemini 3.1 Pro uses tiered pricing based on prompt length:
| Tier | Input | Output |
|---|---|---|
| Standard (<=200K tokens) | $2.00/M | $12.00/M |
| Extended (>200K tokens) | $4.00/M | $18.00/M |
| Batch API (50% discount) | $1.00/M | $6.00/M |
| Context Caching | $0.20/M | - |
At the standard tier, Gemini 3.1 Pro is significantly cheaper than Claude Opus 4.6 ($15.00/$75.00 per M tokens) and GPT-5.2 ($10.00/$30.00 per M tokens). The Batch API halves costs for asynchronous workloads, and context caching can reduce repeated input costs by up to 75%. For cost-conscious teams, this pricing makes Gemini 3.1 Pro the best value among frontier reasoning models. Check our cost efficiency leaderboard for full comparisons.
The model is available in preview through the Gemini API via AI Studio, Vertex AI, Gemini Enterprise, Gemini CLI, Android Studio, and NotebookLM. It is also in public preview in GitHub Copilot. Google Gemini app users with AI Pro and Ultra plans get higher usage limits. There is no free tier for API access.
Note: "preview" status means Google is still validating agentic workflows before general availability. Developers have reported capacity issues including 99-hour lockouts and phantom quota drain during the preview period, so plan accordingly if you are building production systems.
Strengths
- Dominant reasoning benchmarks. 77.1% ARC-AGI-2 and 94.3% GPQA Diamond are the highest scores recorded by any model, with clear leads over Claude Opus 4.6 and GPT-5.2
- Best price-to-performance ratio. At $2/$12 per M tokens, it undercuts both Opus and GPT-5.2 while matching or exceeding them on most benchmarks
- 1M-token context window. Processes entire codebases, research paper collections, and long video in a single pass
- Strong agentic tool use. Leading scores on Terminal-Bench 2.0 and MCP Atlas for multi-step autonomous workflows
- Multimodal native. Text, image, audio, and video inputs processed natively, not bolted on
Weaknesses
- Enterprise task performance. GDPval-AA score of 1317 trails Claude significantly (1606), suggesting weakness on business-critical document synthesis and strategic analysis
- Long-context degradation. Reports of accuracy degradation beyond 120-150K tokens, with fabricated details appearing at extreme context lengths despite the 1M-token window
- Preview stability issues. Capacity constraints, quota bugs, and broken tool calling reported by developers during the preview rollout
- Structured output inconsistency. Only 84% schema-valid JSON responses without retries in testing, problematic for production pipelines requiring strict output formats
- Latency at scale. Time to first token averaging 29 seconds, and multi-file queries with images taking 7-9 seconds per response via API
Related Coverage
- Google Launches Gemini 3.1 Pro, Claims Top Spot on 13 of 16 Benchmarks - Launch coverage with benchmark details
- Google's Gemini 3.1 Pro Doubles Reasoning Performance - Deep dive on the ARC-AGI-2 breakthrough
- Gemini 3.1 Pro Is the Best Model You Can't Use - Real-world availability and quota issues
- Gemini 3 Pro Review - Our review of the predecessor model
- Gemini 3 Deep Think Review - Review of Google's reasoning-focused variant
- Overall LLM Rankings: February 2026 - Where Gemini 3.1 Pro fits in the competitive landscape
- ChatGPT vs Claude vs Gemini - Head-to-head assistant comparison
- How to Choose the Right LLM in 2026 - Decision framework for model selection
Sources
- Gemini 3.1 Pro Model Card - Google DeepMind
- Gemini 3.1 Pro Announcement - Google Blog
- Gemini Developer API Pricing - Google AI
- Gemini 3.1 Pro Intelligence & Price Analysis - Artificial Analysis
- Behind Gemini 3.1 Pro's '13 out of 16 Wins' - SmartScope
- Google Releases Gemini 3.1 Pro Benchmarks - OfficeChai
- Gemini 3.1 Pro Review - Medium
- Google's Gemini 3.1 Pro Is Mostly Great - The New Stack
- Gemini 3.1 Pro on Vertex AI - Google Cloud
