Gemini 3.1 Pro Tops Benchmarks but Developers Can't Rely on It

Gemini 3.1 Pro leads ARC-AGI-2, LiveCodeBench, and 11 other benchmarks with 750 million users and 21.5% market share - but developers report stalled responses, leaked thinking tokens, and API outages that make it unusable for production coding and agent workflows.

Gemini 3.1 Pro Tops Benchmarks but Developers Can't Rely on It

Two weeks after launch, Gemini 3.1 Pro holds the top position on more AI benchmarks than any other model. It also holds the top position on the Google AI Developer Forum's list of complaint threads.

The model, released February 19, scores 77.1% on ARC-AGI-2 (more than doubling its predecessor), holds a LiveCodeBench Pro Elo of 2,887 (448 points above GPT-5.2), and leads in 13 of 16 benchmarks Google evaluated. Gemini's chatbot market share has surged to 21.5% with 750 million users, up from 5.4% in January 2025. On paper, it's the best model available at $2 per million input tokens.

In practice, developers are switching to Claude.

Benchmarks#1 on ARC-AGI-2 (77.1%), LiveCodeBench Pro (2,887 Elo), GPQA Diamond (94.3%), MMMU-Pro (80.5%), SWE-Bench Verified (80.6%)
Market share21.5% of AI chatbot users (750M), up from 5.4% in Jan 2025
Pricing$2/1M input, $8/1M output (unchanged from Gemini 3 Pro)
Developer complaintsAPI 503 errors during peak hours, leaked _thought blocks, stalled responses in Cursor and Gemini CLI, thinking tokens that narrate instead of reason
StatusStill in "Preview" - no GA date announced

The Numbers Are Real

The benchmark gains aren't cherry-picked. The DeepMind model card shows broad improvements:

BenchmarkGemini 3.1 ProPrevious best
ARC-AGI-277.1%31.1% (Gemini 3 Pro)
LiveCodeBench Pro2,887 Elo2,439 (Gemini 3 Pro)
SWE-Bench Verified80.6%-
GPQA Diamond94.3%-
MMMU-Pro80.5%-
BrowseComp85.9%-
Humanity's Last Exam51.4%-

The coding numbers are particularly strong. An Elo jump of 448 points on LiveCodeBench is massive - roughly the difference between a club player and a grandmaster in chess rating terms. SWE-Bench Verified at 80.6% puts it ahead of every other model on real-world GitHub issue resolution.

The Experience Is Not

The Google AI Developer Forum tells a different story. Developers report less than 10% API success rates during peak hours, with 503 errors lasting for hours while Google's status page shows green.

"Less than 10% success rate, making the Gemini API unusable currently," wrote one developer. Another replied: "how am I supposed to migrate my apps to a model that's constantly unavailable?"

The reliability problems go beyond simple capacity constraints. A bug report documents the model leaking raw _thought blocks in its output and getting stuck in infinite loops. In coding environments like Cursor and VS Code Copilot, the model exhibits a pattern that a former Google engineer described on Hacker News: "Gemini will almost completely use thinking tokens, and then just do something but not tell you what it did. If you don't look at the thinking tokens you can't tell what happened, but the thinking token stream is crap. It's all 'I'm now completely immersed in the problem...'"

The same developer - who said he "mildly roots for them" given personal connections to the team - described trying a "plan-in-Gemini, execute-in-Claude" workflow before concluding: "while I'm doing that I might as well just stay in Claude. The experience is just so much better."

The Marketing Gap

Google's Gemini 3.1 Pro has a marketing problem that compounds its reliability issues. Despite leading benchmarks, the model's strongest praise from developers often comes with caveats. Users praise its math skills and the Gemini app's UI design, but note that the actual coding and agent experience lags behind ChatGPT and Claude.

Part of this is structural. The model is still labeled "Preview" with no general availability date announced. The Gemini 3 Pro deprecation on March 9 forces developers onto the Preview whether they're ready or not. And Google's infrastructure team is, in Logan Kilpatrick's words, "battling right now" to handle the demand spike.

Part of it is perception. OpenAI and Anthropic have invested heavily in developer relations and coding-specific tuning. One Hacker News commenter summarized the gap: "Claude is definitively trained on the process of coding not just the code... Google is probably oriented towards a more general solution, and so it's stuck in 'jack of all trades master of none' mode."

What Comes Next

The leaked Gemini 3.1 Flash-Lite Preview that appeared today suggests Google is expanding the 3.1 lineup - and the compute defragmentation from retiring Gemini 3 Pro should free capacity for the models that remain. Whether that's enough to close the reliability gap is an open question.

For developers evaluating Gemini 3.1 Pro: the raw capability is there, and the price-to-performance ratio is unmatched. But if you need consistent uptime for production workloads or predictable behavior in agentic coding pipelines, test it thoroughly before committing. Benchmark charts don't capture the experience of watching a model narrate its own immersion in your problem instead of solving it.


Sources:

Gemini 3.1 Pro Tops Benchmarks but Developers Can't Rely on It
About the author AI Infrastructure & Open Source Reporter

Sophie is a journalist and former systems engineer who covers AI infrastructure, open-source models, and the developer tooling ecosystem.