Name: Gemini 2.5 Pro
Author: Google DeepMind

Gemini 2.5 Pro is Google DeepMind's most capable model as of its 2025 launch - a natively multimodal reasoning model that handles text, images, audio, and video within a 1-million-token context window. Google first released it as an experimental preview on March 25, 2025, then pushed it to general availability on June 17, 2025. It was the first Gemini model built from the ground up with thinking capabilities as a core feature rather than a bolt-on addition.

TL;DR

Google's flagship reasoning model: 84% GPQA Diamond, 63.8% SWE-bench Verified, 86.7% AIME 2025
1M-token context window, $1.25/M input tokens for prompts under 200K
Topped the LMArena leaderboard at launch by 40 points over competitors; later models have since surpassed it

At release, Gemini 2.5 Pro landed at the top of the LMArena (Chatbot Arena) leaderboard with roughly 1370 Elo - a 40-point jump over the next-highest models at the time. Epoch AI independently verified its 84% GPQA Diamond score. The model sits above the measured PhD expert level on that benchmark. For developers and enterprises, it's available through the Gemini API and Vertex AI.

Key Specifications

Specification	Details
Provider	Google DeepMind
Model Family	Gemini 2.5
Parameters	Not disclosed
Context Window	1,048,576 tokens input / 65,536 tokens output
Input Price	$1.25/M tokens (prompts ≤200K); $2.50/M tokens (prompts >200K)
Output Price	$10.00/M tokens (prompts ≤200K); $15.00/M tokens (prompts >200K)
Release Date	June 17, 2025 (GA); March 25, 2025 (experimental preview)
License	Proprietary
Availability	Gemini API, Google AI Studio, Vertex AI, Gemini Advanced

Benchmark Performance

Google benchmarked the model against Claude 3.7 Sonnet, OpenAI o3-mini, DeepSeek R1, and Grok 3 Beta at launch. Numbers below are from the March-June 2025 evaluation period.

Benchmark	Gemini 2.5 Pro	Claude 3.7 Sonnet	o3-mini
GPQA Diamond	84.0%	84.8%	79.7%
AIME 2025 (pass@1)	86.7%	-	86.5%
SWE-bench Verified	63.8%	70.3%	49.3%
Humanity's Last Exam	18.8%	8.9%	14.0%
MRCR (128K context)	91.5%	-	36.3%
LiveCodeBench v5	70.4%	-	74.1%

The results tell a mixed story. Gemini 2.5 Pro dominates on long-context tasks - its MRCR score of 91.5% compares to 36.3% for o3-mini, which confirms real comprehension advantage with large documents. On Humanity's Last Exam, it's the only model in this group to break 18%, ahead of o3-mini's 14% and well above Claude 3.7's 8.9%.

On pure coding tasks, Claude 3.7 Sonnet's SWE-bench score of 70.3% beats Gemini 2.5 Pro's 63.8%. LiveCodeBench v5 also shows o3-mini edging ahead at 74.1% versus 70.4%. Where Gemini 2.5 Pro wins is reasoning under ambiguity and long-context coherence - not raw code generation speed.

Google AI Studio interface showing Gemini 2.5 Pro selected as the active reasoning model Google AI Studio with Gemini 2.5 Pro selected. The Thinking mode toggle and "Set thinking budget" controls are visible in the settings panel on the right. Source: commons.wikimedia.org

Key Capabilities

Native Multimodal Understanding

Gemini 2.5 Pro processes text, images, audio, video, and PDF documents in a single prompt. The video processing capability covers up to 3 hours of content per call, with state-of-the-art performance on video understanding benchmarks. Audio understanding extends to live streaming audio with the model distinguishing device-directed speech from background noise. On MMMU visual reasoning, it scores 79.6%, above Claude Opus 4's 76.5%.

Thinking Mode and Deep Think

The model ships with an integrated "thinking mode" that exposes its chain-of-thought reasoning before delivering a final answer. Google added a more aggressive variant called "Deep Think" - currently in testing with select users - which assesses multiple hypotheses before committing to a response. Thinking tokens count as standard output tokens, so there's no separate pricing tier for reasoning compute unlike some competitors. The model decides internally how long to deliberate based on the prompt's complexity.

Long-Context Coherence

The 1,048,576-token input window is the headline number. The more relevant figure is the MRCR score: 91.5% at 128K context versus 36.3% for o3-mini. Gemini 2.5 Pro can ingest entire code repositories, legal filings, or large research corpora while maintaining coherent reference across the full context - something the long-context benchmarks leaderboard shows most models struggle with past 100K tokens.

Function Calling and Agentic Use

The model supports structured outputs, function calling, batch APIs, and context caching. For agentic workflows, the SWE-bench coding agent leaderboard shows it competitive but not the top pick for pure coding agents. Its advantage in agentic settings is coherence over multi-step tasks that require both reasoning and reading large documents together.

Android XR smart glasses and headset collage from Google I/O 2025, showing Gemini-powered wearable devices Android XR platform announced at Google I/O 2025. Gemini 2.5 Pro powers the AI backbone for both smart glasses and XR headsets. Source: blog.google

Pricing and Availability

Pricing uses a two-tier input model based on prompt length. Prompts under 200K tokens cost $1.25/M input and $10.00/M output. Longer prompts (over 200K tokens) step up to $2.50/M input and $15.00/M output. At 1/8th the cost of OpenAI o3 at comparable performance levels for many tasks, this pricing is competitive.

Batch processing drops to $0.625/M input and $5.00/M output under 200K tokens. Priority tier (guaranteed capacity) costs $2.25/M input and $18.00/M output.

The free tier through Google AI Studio gives developers access to Gemini 2.5 Pro at no cost, capped at 5 requests per minute and 100 requests per day. Note that Google cut the free tier limits in December 2025 - 100 RPD is roughly usable for personal experimentation but inadequate for production testing.

Compared to Gemini 2.5 Flash-Lite ($0.10/M input), Pro costs 12.5x more. For most high-volume tasks where reasoning depth isn't required, Flash or Flash-Lite wins on economics. Pro is the right choice for complex multi-step analysis, legal document review, scientific research, and tasks that actually need the 1M context window.

Context caching is available at a 90% discount on cached tokens ($0.125/M input) for repeated large-context operations.

Strengths

Long-context coherence at 1M tokens, with MRCR 91.5% at 128K - the clearest measurable advantage over most competitors
Native multimodal in a single model: no separate image/audio/video model to route to
Thinking mode with transparent chain-of-thought at no extra cost
Strong science and math reasoning: 84% GPQA Diamond (verified by Epoch AI), 86.7% AIME 2025
Competitive pricing relative to o3 and Claude Opus for similar reasoning capability

Weaknesses

SWE-bench Verified score of 63.8% trails Claude 3.7 Sonnet (70.3%) for pure agentic coding
High time-to-first-token at 21.98 seconds versus a median of 2.76 seconds for non-thinking models
User reports (June-July 2025) of hallucination rate increases and context abandonment in long multi-turn conversations, particularly for complex instruction following
Free tier reduced to 100 RPD in December 2025, limiting developer evaluation
Parameters not disclosed, which makes independent architectural analysis impossible

Gemini 3.1 Pro - Google's subsequent flagship, for comparison
Gemini 2.5 Flash-Lite - The budget tier in the same model family
SWE-bench Coding Agent Leaderboard - Full coding benchmark rankings
Chatbot Arena Elo Rankings - Human preference rankings where this model debuted at #1

Sources: