Google Launches Gemini 3.1 Pro, Claims Top Spot on 13 of 16 Benchmarks
Google releases Gemini 3.1 Pro with dramatically improved reasoning, topping Claude Opus 4.6 and GPT-5.2 on most industry benchmarks.

Google just dropped Gemini 3.1 Pro, and the benchmark numbers are hard to ignore.
The model, which went live today across the Gemini API, Vertex AI, Google AI Studio, and the consumer Gemini app, is the first incremental 0.1 release in the Gemini family's history. Previous upgrades jumped in 0.5 steps. This smaller increment belies a substantial leap: Google claims first place on 13 of the 16 benchmarks it evaluated, beating both Claude Opus 4.6 and GPT-5.2 in most categories.
The Numbers That Matter
The headline result is ARC-AGI-2, a benchmark that tests a model's ability to solve entirely novel logic patterns it has never seen before. Gemini 3.1 Pro scored 77.1%, more than doubling the 3 Pro predecessor's performance on the same test. That is a massive jump for a point release.
On GPQA Diamond, which evaluates expert-level scientific reasoning, 3.1 Pro hit 94.3%. For context, GPT-5.2 scored 92.4% and Claude Opus 4.6 came in at 91.3%. Not a blowout, but a consistent edge.
The coding benchmarks tell a similar story. Gemini 3.1 Pro posted 80.6% on SWE-Bench Verified, a test of real-world agentic coding ability, and 68.5% on Terminal-Bench 2.0. Both are first-place finishes.
Where things get really interesting is APEX-Agents, a benchmark for long-horizon professional tasks that chain together multiple reasoning steps. Gemini 3.1 Pro scored 33.5%, nearly doubling Gemini 3 Pro's 18.4% and beating both GPT-5.2 (23.0%) and Claude Opus 4.6 (29.8%). If you care about AI agent capabilities, this is the number to watch.
Where Competitors Still Win
Google's benchmark sweep is not total. Claude Opus 4.6 held its ground on Humanity's Last Exam (with-tools), scoring 53.1% against Gemini's 51.4%. Opus also dominated on GDPval-AA Elo, posting 1633 versus Gemini's 1317 - a significant gap on expert task evaluation.
On long-context retrieval (MRCR v2 at 128k tokens), Claude Sonnet 4.6 with extended thinking tied Gemini 3.1 Pro at 84.9%. And on the specialized Terminal-Bench 2.0 coding leaderboard, GPT-5.3-Codex outperformed everyone at 77.3%.
So while the "13 of 16" headline is real, the three benchmarks Gemini lost are not trivial ones. As always, which benchmarks actually matter depends entirely on your use case.
What Is Actually New Under the Hood
Google is positioning this as "a smarter, more capable baseline for complex problem-solving." The model retains the 1 million token input context window and up to 64k tokens of output from Gemini 3 Pro. It handles multimodal inputs - text, audio, images, video, and full code repositories.
The real story seems to be a step change in the model's reasoning chain quality. The ARC-AGI-2 improvement, the APEX-Agents leap, and the GPQA Diamond gains all point to a model that has gotten significantly better at chaining together multiple reasoning steps without losing coherence. Whether that came from architecture changes, training data improvements, or post-training techniques, Google is not saying.
This launch follows Gemini 3 Deep Think by about a week, suggesting Google is now iterating rapidly on its frontier model line rather than waiting for large generational jumps.
Availability and Pricing
Gemini 3.1 Pro is rolling out now across multiple surfaces:
- Consumers: Gemini app and NotebookLM, with expanded rate limits for Google AI Pro and Ultra subscribers
- Enterprise: Vertex AI and Gemini Enterprise
- Developers: Preview access via the Gemini API in Google AI Studio, Gemini CLI, Google Antigravity, and Android Studio
Pricing has not been updated for the 3.1 release specifically. Gemini 3 Pro's existing API pricing sits at $2/M input tokens and $12/M output tokens for contexts under 200k, doubling to $4/$18 for contexts above that threshold. Whether 3.1 Pro will carry a premium remains unclear.
For developers trying to decide between frontier models, the ChatGPT vs Claude vs Gemini comparison is worth revisiting now that the competitive landscape has shifted again.
The Bigger Picture
The AI model race has entered a phase where incremental releases deliver more meaningful gains than previous generational leaps. A 0.1 version bump that doubles reasoning performance on multiple benchmarks would have been a full generation upgrade a year ago.
Google CEO Sundar Pichai framed the release around bringing "this underlying leap in intelligence to your everyday applications right away." The immediate rollout to the consumer Gemini app and NotebookLM suggests Google is no longer content to let its best models sit behind API paywalls while ChatGPT dominates consumer mindshare.
Combined with the Qwen 3.5 open-source challenge from Alibaba earlier this week and xAI's Grok 4.20 multi-agent launch, February 2026 is shaping up as one of the most competitive months in the AI model wars. The frontier is getting crowded, and the gap between first and third place keeps shrinking.
Sources:
- Gemini 3.1 Pro: A smarter model for your most complex tasks - Google Blog
- Google Releases Gemini 3.1 Pro - Thurrott.com
- Google Releases Gemini 3.1 Pro, Beats Claude Opus 4.6, GPT 5.2 On Most Benchmarks - OfficeChai
- Google Launches Gemini 3.1 Pro for Complex Enterprise Tasks - TechBuzz
- Google launches Gemini 3.1 Pro - Constellation Research
- Gemini 3.1 Pro Model Card - Google DeepMind