News

Google's Gemini 3.1 Pro Doubles Reasoning Performance and Retakes the AI Crown

Google releases Gemini 3.1 Pro with 77.1% on ARC-AGI-2, more than doubling the reasoning capability of its predecessor and beating Claude Opus 4.6 and GPT-5.2 on most benchmarks.

Google's Gemini 3.1 Pro Doubles Reasoning Performance and Retakes the AI Crown

Google just shipped the most capable AI model in the world - and it did it with a point release.

Gemini 3.1 Pro, announced on February 19, scored 77.1% on the ARC-AGI-2 benchmark, more than doubling the 31.1% achieved by Gemini 3 Pro just three months ago. It tops Anthropic's Opus 4.6 and OpenAI's GPT-5.2 on 13 of 16 benchmarks Google evaluated. The price? Unchanged from the previous generation.

This is not a new generation. It is a surgical reasoning upgrade that turns Google's Pro tier into the benchmark leader - and the first time Google has used a ".1" version increment, breaking its own release cadence.

The Numbers

Key Specs

SpecValue
ARC-AGI-277.1% (vs 31.1% for 3 Pro)
Context Window1M tokens
Output Limit64K tokens
GPQA Diamond94.3%
Input Pricing$2.00 / 1M tokens
Output Pricing$12.00 / 1M tokens

The improvement on ARC-AGI-2 is the headline, but the gains are broad. Gemini 3.1 Pro leads on GPQA Diamond (94.3% vs Claude Opus 4.6's 91.3% and GPT-5.2's 92.4%), Terminal-Bench 2.0 (68.5%), LiveCodeBench Pro (2887 Elo), and SciCode (59%). These are not cherry-picked metrics - they span reasoning, coding, scientific knowledge, and agentic tasks.

The Full Benchmark Picture

BenchmarkGemini 3.1 ProClaude Opus 4.6GPT-5.2
ARC-AGI-277.1%68.8%52.9%
GPQA Diamond94.3%91.3%92.4%
Humanity's Last Exam (no tools)44.4%42.1%-
SWE-Bench Verified80.6%82.0%-
Terminal-Bench 2.068.5%-65.2%
GDPval-AA Elo13171606-

The table tells the real story. Gemini 3.1 Pro dominates reasoning and scientific benchmarks, but Claude Opus 4.6 still holds ground on SWE-Bench Verified and expert-level evaluation tasks. OpenAI's GPT-5.3-Codex, meanwhile, leads on SWE-Bench Pro (56.8% vs 54.2%) and its own Terminal-Bench harness (77.3%).

As we noted in our overall LLM rankings, benchmark leadership has become a rotating crown. This release does not change that pattern - it accelerates it.

What Changed Under the Hood

Deep Think Goes Mainstream

Google credits the reasoning leap to "upgraded core intelligence" first introduced in Gemini 3 Deep Think, released just one week earlier on February 12. VentureBeat described 3.1 Pro as a "Deep Think Mini" - the same extended reasoning capability, packaged for the Pro tier with adjustable thinking depth.

This is significant. Deep Think was a specialized model for hard problems. Gemini 3.1 Pro brings that reasoning to everyday use across the Gemini app, NotebookLM, Google AI Studio, Vertex AI, and even GitHub Copilot.

The Speed Trade-Off

Artificial Analysis ranks Gemini 3.1 Pro #1 in intelligence (score: 57 out of 116 models) but notes a cost. Time to first token averages 33.99 seconds - nearly 30x the median of 1.19 seconds for models in its price tier. Output speed is a respectable 104.7 tokens per second, but the initial thinking latency will frustrate developers building real-time applications.

This is the classic reasoning-model tax. Extended thinking produces better answers but makes the model feel slow on first response. Developers choosing between Gemini 3.1 Pro and, say, Claude Sonnet 4.6 will need to decide whether raw intelligence or snappy interaction matters more for their use case.

Developer Reception

Vladislav Tankov, Director of AI at JetBrains, reported a 15% improvement in evaluation runs over the best Gemini 3 Pro previews, specifically highlighting token efficiency - fewer output tokens for mathematically reliable results.

"We're seeing up to a 15% improvement in evaluation runs compared to the best previews of the previous Gemini 3 Pro model," Tankov said, noting the model's improved token efficiency.

But independent testers have flagged friction. Some report latency spikes during high-demand periods, with rudimentary inputs occasionally taking over 100 seconds to process. More concerning: developers using Gemini CLI have documented instances of state degradation during long coding sessions, where the model inadvertently deleted functional code during file modifications.

Availability and Pricing

Gemini 3.1 Pro is rolling out now in preview across:

  • Gemini app (Pro and Ultra plan users get higher limits)
  • NotebookLM (Pro and Ultra only)
  • Google AI Studio and Vertex AI
  • Gemini CLI and Android Studio
  • GitHub Copilot (public preview)
  • Microsoft Visual Studio and VS Code

Pricing holds at $2.00 per million input tokens and $12.00 per million output tokens for contexts under 200K tokens. Above 200K, rates rise to $4.00 and $18.00 respectively. This makes it roughly half the cost of Claude Opus 4.6 at comparable quality on most tasks - a pricing advantage Google will not let you forget.

For a broader comparison of where Gemini fits in the current model landscape, see our Gemini 3 Pro review and ChatGPT vs Claude vs Gemini comparison.

What It Does Not Tell You

Google's benchmark table is impressive, but there are gaps worth noting.

First, the ARC-AGI-2 score of 77.1% was achieved with the model's full extended reasoning mode. Google has not disclosed how long the model thinks per problem or the compute cost of that thinking. A model that takes 10 minutes per ARC puzzle is not the same as one that solves it in seconds.

Second, the areas where Gemini 3.1 Pro trails are not trivial. Claude Opus 4.6 leads on GDPval-AA Elo (1606 vs 1317) - a measure of expert-level task quality that matters more for real-world professional use than abstract reasoning puzzles. Claude Sonnet 4.6 in Thinking Max mode scores even higher at 1633. If your workload is less "solve novel logic puzzles" and more "produce expert-quality analysis," the benchmark crown may matter less than you think.

Third, the 33-second time-to-first-token is a real constraint. For AI coding assistants that need to feel responsive, or agent frameworks that chain multiple model calls, that latency compounds quickly. The reasoning benchmarks leaderboard captures raw capability, but latency-adjusted performance is a different ranking entirely.

Finally, the model is in preview. Google says it is validating updates before general availability, particularly for "ambitious agentic workflows." Preview means things can change - and in Google's history, they sometimes do.


Gemini 3.1 Pro is the strongest reasoning model available today by most measures, at a price that undercuts the competition. Google broke its own release cadence to ship it, and the ARC-AGI-2 result is genuinely impressive. But the 33-second thinking penalty, the preview status, and Claude's persistent edge on expert-level tasks mean the AI crown remains contested. The best model depends on what you are building - and that is unlikely to change anytime soon.

Sources:

About the author Senior AI Editor & Investigative Journalist

Elena is a technology journalist with over eight years of experience covering artificial intelligence, machine learning, and the startup ecosystem.