Item: Gemini 3.1 Pro
Author: Elena Marchetti

Google's point-release naming convention usually signals minor polish. Gemini 3.1 Pro is not that. Released on February 19, this is the most significant mid-cycle upgrade Google has ever shipped for its flagship model family - a 46 percentage point jump on ARC-AGI-2, a new three-tier thinking system, and benchmark leadership on 13 of 16 published tasks. It arrives at the same price as Gemini 3 Pro, which makes it one of the best value propositions in the current model landscape. But benchmark charts and real-world utility aren't the same thing, and after a week of intensive testing, the picture is more complicated than the headlines suggest.

TL;DR

8.5/10 - a genuine reasoning breakthrough at an aggressive price point, held back by preview-status rough edges
Adjustable three-tier thinking is the standout feature, letting you dial reasoning depth per request instead of routing to separate models
Latency on high thinking mode can exceed 90 seconds, and structured output reliability drops under pressure
Best for developers who need strong reasoning and multimodal capabilities on a budget; skip if you need rock-solid agentic execution or maximum output length

The Reasoning Leap

Let's start with the number that grabbed everyone's attention. On ARC-AGI-2 - a benchmark specifically designed to test novel pattern recognition that models can't brute-force through memorization - Gemini 3.1 Pro scored 77.1%. Its predecessor, Gemini 3 Pro, scored 31.1%. That's not a gradual improvement. It is a 2.5x jump on a benchmark that was built to be hard.

For context, Claude Opus 4.6 scores 68.8% on the same test, and GPT-5.2 sits at 52.9%. Only Gemini 3 Deep Think - Google's specialized reasoning mode that trades latency for accuracy - scores higher at 84.6%. Getting within striking distance of Deep Think performance from a general-purpose model is remarkable. If you have been following our reasoning benchmarks leaderboard, you know how rare jumps of this magnitude are.

The reasoning gains extend beyond abstract puzzles. On GPQA Diamond, a benchmark of PhD-level science questions, Gemini 3.1 Pro hits 94.3% - ahead of Claude Opus 4.6 at roughly 90% and GPT-5.2 at 93.2%. On Humanity's Last Exam without tools, it scores 44.4%, though Claude pulls ahead once tool use is introduced (53.1% vs 51.4%), suggesting Anthropic's model remains better at tapping external capabilities during complex reasoning.

Adjustable Thinking - The Real Innovation

The benchmark numbers are impressive, but the feature that'll actually change how developers work with this model is the three-tier thinking system. Previous Gemini versions offered a binary choice: low or high compute. Gemini 3.1 Pro adds a medium tier and fundamentally redefines what "high" means.

Low mode is fast and cheap - suitable for summarization, classification, and straightforward Q&A. Medium mode is the new default sweet spot, roughly matching what "high" used to be on Gemini 3 Pro. High mode is where things get interesting: it effectively turns 3.1 Pro into a "Deep Think Mini," applying extended inference-time compute with multi-step verification.

This matters because it eliminates an operational headache that has plagued production AI deployments. Instead of maintaining separate model endpoints for different task complexities - fast model for simple queries, reasoning model for hard ones, with a routing layer in between - you can now hit a single API endpoint and adjust the thinking_level parameter per request. A customer support chatbot handles routine questions on low, escalates ambiguous cases to medium, and pushes truly complex diagnostic queries to high. One model, one deployment, three performance tiers.

VentureBeat called it a "Deep Think Mini," and that framing is accurate. At the high setting, I observed the model constructing parallel reasoning chains, checking its own logic, and backtracking when it hit contradictions. It's noticeably slower - responses at high can take 30 to 90 seconds - but the quality gap between medium and high on complex prompts is major.

Coding Performance

Gemini 3.1 Pro's coding benchmarks are strong across the board, and they matter more than ever given how central code generation has become to model evaluation. On LiveCodeBench Pro, it posts an Elo of 2,887 - notably ahead of GPT-5.2 at 2,393 and its own predecessor at 2,439. On SWE-Bench Verified, it scores 80.6%, basically tied with Claude Opus 4.6 at 80.8%.

Where it falls behind is terminal-heavy agentic execution. On Terminal-Bench 2.0, Gemini 3.1 Pro scores 68.5% - respectable, but trailing GPT-5.3-Codex at 77.3%. This aligns with developer feedback on Hacker News, where multiple users reported that while Gemini excels at one-shot code generation, it struggles with iterative, multi-step coding sessions where it needs to modify files, run tests, and adapt. As one former Google engineer put it, it "falls over a lot when actually trying to get things done" in agentic workflows.

For a deeper dive into how these scores stack up, check our coding benchmarks leaderboard.

Multimodal Capabilities

Gemini 3.1 Pro inherits its predecessor's natively multimodal architecture, and the expanded specs are worth noting. It processes up to 900 individual images per prompt, 8.4 hours of continuous audio, and one hour of video. The 1 million token input context window remains, with output capacity expanded to 65,536 tokens - a welcome improvement over the previous generation's limits.

In hands-on testing, the vision capabilities remain among the best available. The model handled complex diagram interpretation, multi-page document analysis, and video summarization with consistent accuracy. Google's demo videos showing it producing an aerospace dashboard from live telemetry data and converting static SVGs into animated web graphics are impressive, and in my testing the model did deliver on these kinds of multimodal-to-code tasks. For a broader comparison of vision models, see our multimodal benchmarks leaderboard.

Where the 1M context window gets complicated is reliability at scale. Multiple users have reported "lost in the middle" problems with very long documents - the model occasionally invents details or drifts in its summaries when processing content beyond roughly 120,000 to 150,000 tokens. Google's own MRCR v2 128k benchmark score of 84.9% is solid but not dominant, and the real-world gap between "128k benchmark" and "1M context window" remains a source of frustration.

Pricing and Value

This is where Gemini 3.1 Pro gets genuinely compelling. API pricing is $2.00 per million input tokens and $12.00 per million output tokens for prompts up to 200k tokens. For longer prompts, those numbers rise to $4.00 and $18.00 respectively. Context caching is available at $0.20 per million tokens, with batch mode offering a 50% discount on all prices.

For comparison, Claude Opus 4.6 charges $15.00 per million input tokens and $75.00 per million output tokens. That means Gemini 3.1 Pro's input cost is roughly 13% of Opus 4.6's, and output is about 16% as expensive. Even Claude Sonnet 4.6 at $3.00/$15.00 costs more on both dimensions. If you're running high-volume API workloads, this isn't a marginal difference - it's a fundamentally different cost structure.

Consumer access comes through Google AI Pro at $19.99/month or Google AI Ultra at $249.99/month. The Pro subscription gets you full access to 3.1 Pro with the three thinking tiers. Ultra adds Deep Think, longer usage limits, and priority access.

What Doesn't Work

Latency. High thinking mode is slow. In my testing, I saw responses take up to 104 seconds for inputs that shouldn't have required that much compute. Timeout errors from the API were not rare occurrences - they happened regularly during extended testing sessions. For interactive applications where responsiveness matters, this is a real problem.

Structured output. Even with a JSON schema and strict examples, Gemini 3.1 Pro occasionally slipped extra fields, reordered keys, or produced schema-invalid responses. One analysis found only 84% schema-valid responses without retries. For production pipelines that depend on deterministic structured output, this needs improvement.

Agentic workflows. The gap between benchmark performance and real-world agent behavior is wider for Gemini than for Claude. The APEX-Agents score of 33.5% is not competitive. Developer reports consistently describe issues with tool usage, file editing, and multi-step task completion. The model excels at creating correct answers in a single pass but struggles when it needs to iterate, adapt, and maintain state across multiple turns.

Thinking token transparency. Several developers noted that Gemini's thinking output - the visible reasoning chain - reads more like summarized narration ("I'm now completely immersed in the problem") than actual step-by-step logic. This makes debugging and understanding model failures harder than it should be.

Preview status. As of late February 2026, Gemini 3.1 Pro is still technically in preview. Google is verifying agentic flows and refining the model. Expect rough edges, and don't build production-critical systems on it without thorough testing.

Strengths

Massive reasoning improvement over Gemini 3 Pro, competitive with or passing Claude Opus 4.6 on abstract reasoning benchmarks
Three-tier adjustable thinking eliminates the need for multi-model routing in production deployments
Exceptional value: frontier-class performance at a fraction of Opus 4.6's price
Strong multimodal capabilities with support for images, audio, video, and code in a single context
1M token input context window with expanded 65k output
Same pricing as Gemini 3 Pro - a free upgrade for existing users

Weaknesses

High thinking mode latency makes it impractical for interactive or real-time applications
Agentic task execution clearly trails Claude Opus 4.6 and GPT-5.3-Codex
Structured output reliability is inconsistent, especially under complex schema constraints
Long-context reliability degrades past 120-150k tokens with "summary drift" and hallucinated details
Still in preview with potential for behavior changes before general availability
Thinking tokens lack the transparency needed for effective debugging

Verdict: 8.5/10

Gemini 3.1 Pro is a genuinely impressive upgrade that delivers on its central promise: dramatically better reasoning at the same price. The ARC-AGI-2 improvement alone would be newsworthy; combined with the three-tier thinking system, competitive coding benchmarks, and aggressive pricing, it makes a strong case as the default model for teams that need strong reasoning without the per-token cost of Claude Opus 4.6.

But the caveats are real. If your workload involves agentic execution - multi-step tool use, file manipulation, iterative debugging - Claude remains the safer choice. If you need deterministic structured output for production pipelines, test extensively before committing. And if latency matters for your use case, the high thinking tier may not be practical.

The smartest play right now, as several analysts have suggested, is a hybrid approach: route 80% of your workload through Gemini 3.1 Pro on medium thinking for cost-efficient reasoning, and reserve Claude Opus 4.6 for the 20% of tasks that demand elite agentic performance and tool coordination. That split captures the best of both models. As we discussed in our guide on how to choose the right LLM in 2026, the era of one model to rule them all is over. Gemini 3.1 Pro makes that reality more economically viable than ever.

Gemini 3.1 Pro Review: Google's Reasoning Leap Is Real - With Caveats

The Reasoning Leap

Adjustable Thinking - The Real Innovation

Coding Performance

Multimodal Capabilities

Pricing and Value

What Doesn't Work

Strengths

Weaknesses

Verdict: 8.5/10

Sources

The Reasoning Leap

Adjustable Thinking - The Real Innovation

Coding Performance

Multimodal Capabilities

Pricing and Value

What Doesn't Work

Strengths

Weaknesses

Verdict: 8.5/10

Sources

Google Analytics