Gemini 3.1 Pro Is the Best Model You Can't Use: 99-Hour Lockouts, Phantom Quota Drain, and Broken Tool Calling

Gemini 3.1 Pro launched on February 19 with the strongest benchmark numbers Google has ever posted. Its 77.1% on ARC-AGI-2 more than doubled its predecessor and beat both Claude Opus 4.6 and GPT-5.3 in abstract reasoning. Its LiveCodeBench Elo of 2887 is 448 points above GPT-5.2. At $2/M input tokens, it is the cheapest frontier model available.

Four days later, the Google AI Developer Forum is full of paying subscribers who can't use it.

TL;DR

Issue	Details
Lockout duration	90-99 hours reported by paid Pro subscribers
Phantom quota drain	Quota drops from 100% to 0% with no activity
Quota consumption rate	~2x faster than Gemini 3.0 on identical workloads
Launch-day latency	Up to 104 seconds for basic inputs
Tool-calling broken in	LangChain4j, n8n, RooCode, Cursor
Google's response	Acknowledged "phased rollout," no fix timeline

90-hour lockouts for paying customers

The most severe reports come from Google AI Pro subscribers -- people paying for premium access.

After using the model's 5-hour quota windows three times within 24 hours, users found themselves locked out for 90 hours. Others report 99-hour lockouts after a single hour of use:

"Just updated to 3.1 Pro and I'm already locked out for 99 hours after only an hour of use."

A separate thread documents a 3-day lockout where the system applied a hard weekly cap -- typically reserved for Free Tier accounts -- to a paying Pro subscriber after the 3.1 migration. Multiple users in the thread confirmed similar experiences, with lockouts ranging from 3 to 7 days.

Quota draining while idle

Beyond the lockouts, the quota system itself appears bugged. Multiple users report quota dropping from 100% to 0% with no corresponding activity. Reset timers show 35-50 hours. From the forum:

"Mine was at 100% and then 80% for quite a while, maybe 30 minutes of working, and then I sent one more prompt and it bumped to 0% all of a sudden."

Even when the model is reachable, the 5-hour quota is consumed roughly 2x faster with 3.1 Pro compared to 3.0 on identical workloads. A second thread confirms the pattern: the new model simply eats quota faster, with no explanation from Google.

Launch-day latency

Early reviewers documented the model taking up to 104 seconds for rudimentary inputs on launch day, with frequent timeout errors and "This model is currently experiencing high demand" messages. The API returned MODEL_CAPACITY_EXHAUSTED errors throughout the first 48 hours. In Cursor, the agent would get stuck on "Planning next moves" or "Taking longer than expected," frequently showing "Reconnecting" for minutes at a time.

Tool calling: broken across frameworks

For a model positioned as an agentic reasoning engine, the tool-calling bugs are the most damaging issue.

The root cause: when Gemini 3.1 Pro triggers a tool call, the API requires that a returned thoughtSignature field be passed back exactly in the next request. If missing or misplaced, the API returns a 400 error: "Function call is missing a thought_signature in functionCall parts." This breaks multi-turn tool use in every major framework that has attempted integration:

LangChain4j: Multi-turn conversations impossible after any tool call
n8n: AI Agent node crashes immediately on tool use
RooCode: Tool calls fail with signature errors
RooCode via OpenRouter: API handler strips content array but preserves reasoning_details, causing every follow-up to fail

Additional bugs include an infinite "Working" loop in Canvas code mode where generation ran for 88 minutes without completing or returning an error, and persistent infinite thinking loops in Android Studio.

Developer sentiment on Hacker News is blunt:

"It's bad at using tools and tries to edit files in weird ways instead of using the provided text editing tools."

"Gemini is superb at incredibly hard stuff, but falls apart on some of the most basic things (like tool calling)."

Developers are switching to Claude

The reliability gap is pushing developers toward alternatives. From the same HN thread:

"So I've tried to adopt a plan-in-Gemini, execute-in-Claude approach, but while I'm doing that I might as well just stay in Claude."

"I have a paid Antigravity subscription and most of the time I use Claude models due to the exact issues you have pointed out."

A head-to-head comparison by Shipyard found Claude Code finished tasks faster with full autonomy while Gemini CLI required manual nudging and retries -- at higher cost ($7.06 vs $4.80 for equivalent tasks). "Claude simply generates better code with fewer issues."

The emerging consensus among developers using both models: use Claude 4.5 as the default for planning and reliable execution, bring in Gemini for multimodal tasks and UI work where its vision capabilities have an edge.

Google's response

Google acknowledged the phased rollout in a GitHub discussion on February 20, confirming that AI Ultra subscribers get full access while other tiers receive partial rollout "as capacity permits." No timeline was given for completing the rollout.

What is notably absent: any Google employee response on the multiple forum threads documenting 90-hour lockouts, phantom quota drain, or migration bugs. The threads contain only user-to-user troubleshooting. Google's generic quota policy states that "when there's a large increase in activity, we may change limits to maintain quality" and that specified rate limits are "not guaranteed."

A separate issue compounded the frustration: gemini-3-pro-preview became unreachable after the 3.1 announcement, forcing users to migrate to a model they couldn't reliably access, with no transition period.

The pattern is familiar. Google launches a model with benchmark numbers that lead on 12 of 18 evaluations. The technical achievement is real. The ARC-AGI-2 score is not disputed. But the infrastructure to serve it at scale is not ready, the tool-calling integration is not tested against the frameworks developers actually use, and the paying customers who are supposed to have priority access are the ones filing bug reports.

Benchmarks measure what a model can do. Quotas determine what it will do. Right now, for many paying users, the answer is nothing.

Sources: