Name: Gemini 3.1 Pro
Author: Google

Overview

Google DeepMind released Gemini 3.1 Pro on February 19, 2026, and the numbers back up the hype. The model posted leading scores on 13 of 16 benchmarks evaluated by Google, including a 77.1% on ARC-AGI-2 - more than double the score of its predecessor Gemini 3 Pro. That single number tells you this isn't a minor point release. It is a generational leap in reasoning capability.

Built on a Transformer-based Mixture-of-Experts architecture, Gemini 3.1 Pro processes text, images, audio, video, and code across a 1M-token context window with up to 64K tokens of output. It ships with dynamic thinking by default, meaning the model allocates more internal reasoning to harder problems rather than producing immediate responses. This puts it squarely in the "thinking model" category with Claude Opus 4.6 and GPT-5.2, trading latency for accuracy on complex tasks.

The competitive picture is tight at the top. On Chatbot Arena, Gemini 3.1 Pro sits at roughly 1500 Elo versus Claude Opus 4.6's 1504 - basically tied in human preference voting. But the benchmark story diverges sharply depending on the task category. Gemini controls on reasoning and scientific knowledge. Claude holds the edge on software engineering and economically valuable enterprise tasks. Picking between them now requires knowing exactly what you need the model for, which is a healthy sign for the field. See our overall LLM rankings for the full picture.

Key Specifications

Specification	Details
Provider	Google DeepMind
Model Family	Gemini
Architecture	Transformer-based Mixture of Experts
Parameters	Not disclosed
Context Window	1,000,000 tokens input / 64,000 tokens output
Input Price	$2.00/M tokens (<=200K), $4.00/M tokens (>200K)
Output Price	$12.00/M tokens (<=200K), $18.00/M tokens (>200K)
Release Date	February 19, 2026
License	Proprietary (API access only)
Input Modalities	Text, images, audio, video
Output Modality	Text
Dynamic Thinking	Enabled by default

Benchmark Performance

Benchmark	Gemini 3.1 Pro	Claude Opus 4.6	GPT-5.2
ARC-AGI-2 (reasoning)	77.1%	68.8%	52.9%
GPQA Diamond (science)	94.3%	91.3%	92.4%
MMLU (knowledge)	92.6%	91.1%	89.6%
HLE - no tools (reasoning)	44.4%	40.0%	34.5%
SWE-Bench Verified (coding)	80.6%	80.8%	80.0%
Terminal-Bench 2.0 (agentic)	68.5%	65.4%	54.0%
AIME 2025 (math)	100%	-	-
LiveCodeBench Pro (Elo)	2887	-	-
GDPval-AA (enterprise)	1317	1606	1462
MRCR v2 128K (long context)	84.9%	-	-

The headline stat is ARC-AGI-2 at 77.1%. This benchmark tests a model's ability to solve entirely novel logic patterns it has never seen, and Gemini 3.1 Pro's score represents a 2.5x improvement over its predecessor. On GPQA Diamond, a graduate-level science reasoning test, it hits 94.3% - the highest recorded score. It matches or passes competitors on most knowledge and reasoning benchmarks.

The cracks show on enterprise tasks. The GDPval-AA benchmark measures performance on economically valuable work like financial modeling, strategic planning, and business documentation. Gemini 3.1 Pro's Elo of 1317 trails Claude Opus 4.6 (1606) by a wide margin, and even Claude Sonnet 4.6 (1633) outscores it decisively. If your workload is business-critical document synthesis, the benchmarks point you toward Claude. For a deeper dive into what these numbers actually measure, see our guide to understanding AI benchmarks.

Key Capabilities

Gemini 3.1 Pro's strongest suit is deep reasoning across multiple modalities. It can ingest entire code repositories, lengthy research papers, hours of video, and complex multi-source datasets within its 1M-token context window, then reason across all of them simultaneously. The dynamic thinking system lets you tune reasoning depth via the thinking_level parameter, balancing quality against latency and cost depending on your use case.

Agentic performance is a highlight. On Terminal-Bench 2.0, which assesses multi-step tool-use workflows, Gemini 3.1 Pro scored 68.5% - above both Claude Opus 4.6 (65.4%) and GPT-5.2 (54.0%). It also scored 69.2% on MCP Atlas, a tool coordination benchmark. Google has been pushing agentic capabilities hard, and the numbers suggest it's paying off. The model is available in GitHub Copilot and Google's own Gemini CLI for coding workflows.

Scientific and mathematical reasoning is where this model genuinely separates from the pack. A perfect 100% on AIME 2025 with code execution, 94.3% on GPQA Diamond, and 44.4% on HLE (a test designed to be too hard for current models) paint a clear picture: if your use case involves research synthesis, complex math, or domain-expert reasoning, Gemini 3.1 Pro is the current leader.

Pricing and Availability

Gemini 3.1 Pro uses tiered pricing based on prompt length:

Tier	Input	Output
Standard (<=200K tokens)	$2.00/M	$12.00/M
Extended (>200K tokens)	$4.00/M	$18.00/M
Batch API (50% discount)	$1.00/M	$6.00/M
Context Caching	$0.20/M	-

At the standard tier, Gemini 3.1 Pro is clearly cheaper than Claude Opus 4.6 ($15.00/$75.00 per M tokens) and GPT-5.2 ($10.00/$30.00 per M tokens). The Batch API halves costs for asynchronous workloads, and context caching can reduce repeated input costs by up to 75%. For cost-conscious teams, this pricing makes Gemini 3.1 Pro the best value among frontier reasoning models. Check our cost efficiency leaderboard for full comparisons.

The model is available in preview through the Gemini API via AI Studio, Vertex AI, Gemini Enterprise, Gemini CLI, Android Studio, and NotebookLM. It's also in public preview in GitHub Copilot. Google Gemini app users with AI Pro and Ultra plans get higher usage limits. There's no free tier for API access.

Note: "preview" status means Google is still validating agentic workflows before general availability. Developers have reported capacity issues including 99-hour lockouts and phantom quota drain during the preview period, so plan accordingly if you're building production systems.

Strengths

Dominant reasoning benchmarks. 77.1% ARC-AGI-2 and 94.3% GPQA Diamond are the highest scores recorded by any model, with clear leads over Claude Opus 4.6 and GPT-5.2
Best price-to-performance ratio. At $2/$12 per M tokens, it undercuts both Opus and GPT-5.2 while matching or passing them on most benchmarks
1M-token context window. Processes entire codebases, research paper collections, and long video in a single pass
Strong agentic tool use. Leading scores on Terminal-Bench 2.0 and MCP Atlas for multi-step autonomous workflows
Multimodal native. Text, image, audio, and video inputs processed natively, not bolted on

Weaknesses

Enterprise task performance. GDPval-AA score of 1317 trails Claude notably (1606), suggesting weakness on business-critical document synthesis and strategic analysis
Long-context degradation. Reports of accuracy degradation beyond 120-150K tokens, with fabricated details appearing at extreme context lengths despite the 1M-token window
Preview stability issues. Capacity constraints, quota bugs, and broken tool calling reported by developers during the preview rollout
Structured output inconsistency. Only 84% schema-valid JSON responses without retries in testing, problematic for production pipelines requiring strict output formats
Latency at scale. Time to first token averaging 29 seconds, and multi-file queries with images taking 7-9 seconds per response via API

Google Launches Gemini 3.1 Pro, Claims Top Spot on 13 of 16 Benchmarks - Launch coverage with benchmark details
Google's Gemini 3.1 Pro Doubles Reasoning Performance - Deep dive on the ARC-AGI-2 breakthrough
Gemini 3.1 Pro Is the Best Model You Can't Use - Real-world availability and quota issues
Gemini 3 Pro Review - Our review of the predecessor model
Gemini 3 Deep Think Review - Review of Google's reasoning-focused variant
Overall LLM Rankings: February 2026 - Where Gemini 3.1 Pro fits in the competitive landscape
ChatGPT vs Claude vs Gemini - Head-to-head assistant comparison
How to Choose the Right LLM in 2026 - Decision framework for model selection