Gemini 2.5 Pro
Google DeepMind's flagship thinking model with 1M-token context, 84% GPQA Diamond, and native multimodal understanding of text, images, audio, and video.

Gemini 2.5 Pro is Google DeepMind's most capable model as of its 2025 launch - a natively multimodal reasoning model that handles text, images, audio, and video within a 1-million-token context window. Google first released it as an experimental preview on March 25, 2025, then pushed it to general availability on June 17, 2025. It was the first Gemini model built from the ground up with thinking capabilities as a core feature rather than a bolt-on addition.
TL;DR
- Google's flagship reasoning model: 84% GPQA Diamond, 63.8% SWE-bench Verified, 86.7% AIME 2025
- 1M-token context window, $1.25/M input tokens for prompts under 200K
- Topped the LMArena leaderboard at launch by 40 points over competitors; later models have since surpassed it
At release, Gemini 2.5 Pro landed at the top of the LMArena (Chatbot Arena) leaderboard with roughly 1370 Elo - a 40-point jump over the next-highest models at the time. Epoch AI independently verified its 84% GPQA Diamond score. The model sits above the measured PhD expert level on that benchmark. For developers and enterprises, it's available through the Gemini API and Vertex AI.
Key Specifications
| Specification | Details |
|---|---|
| Provider | Google DeepMind |
| Model Family | Gemini 2.5 |
| Parameters | Not disclosed |
| Context Window | 1,048,576 tokens input / 65,536 tokens output |
| Input Price | $1.25/M tokens (prompts ≤200K); $2.50/M tokens (prompts >200K) |
| Output Price | $10.00/M tokens (prompts ≤200K); $15.00/M tokens (prompts >200K) |
| Release Date | June 17, 2025 (GA); March 25, 2025 (experimental preview) |
| License | Proprietary |
| Availability | Gemini API, Google AI Studio, Vertex AI, Gemini Advanced |
Benchmark Performance
Google benchmarked the model against Claude 3.7 Sonnet, OpenAI o3-mini, DeepSeek R1, and Grok 3 Beta at launch. Numbers below are from the March-June 2025 evaluation period.
| Benchmark | Gemini 2.5 Pro | Claude 3.7 Sonnet | o3-mini |
|---|---|---|---|
| GPQA Diamond | 84.0% | 84.8% | 79.7% |
| AIME 2025 (pass@1) | 86.7% | - | 86.5% |
| SWE-bench Verified | 63.8% | 70.3% | 49.3% |
| Humanity's Last Exam | 18.8% | 8.9% | 14.0% |
| MRCR (128K context) | 91.5% | - | 36.3% |
| LiveCodeBench v5 | 70.4% | - | 74.1% |
The results tell a mixed story. Gemini 2.5 Pro dominates on long-context tasks - its MRCR score of 91.5% compares to 36.3% for o3-mini, which confirms real comprehension advantage with large documents. On Humanity's Last Exam, it's the only model in this group to break 18%, ahead of o3-mini's 14% and well above Claude 3.7's 8.9%.
On pure coding tasks, Claude 3.7 Sonnet's SWE-bench score of 70.3% beats Gemini 2.5 Pro's 63.8%. LiveCodeBench v5 also shows o3-mini edging ahead at 74.1% versus 70.4%. Where Gemini 2.5 Pro wins is reasoning under ambiguity and long-context coherence - not raw code generation speed.
Google AI Studio with Gemini 2.5 Pro selected. The Thinking mode toggle and "Set thinking budget" controls are visible in the settings panel on the right.
Source: commons.wikimedia.org
Key Capabilities
Native Multimodal Understanding
Gemini 2.5 Pro processes text, images, audio, video, and PDF documents in a single prompt. The video processing capability covers up to 3 hours of content per call, with state-of-the-art performance on video understanding benchmarks. Audio understanding extends to live streaming audio with the model distinguishing device-directed speech from background noise. On MMMU visual reasoning, it scores 79.6%, above Claude Opus 4's 76.5%.
Thinking Mode and Deep Think
The model ships with an integrated "thinking mode" that exposes its chain-of-thought reasoning before delivering a final answer. Google added a more aggressive variant called "Deep Think" - currently in testing with select users - which assesses multiple hypotheses before committing to a response. Thinking tokens count as standard output tokens, so there's no separate pricing tier for reasoning compute unlike some competitors. The model decides internally how long to deliberate based on the prompt's complexity.
Long-Context Coherence
The 1,048,576-token input window is the headline number. The more relevant figure is the MRCR score: 91.5% at 128K context versus 36.3% for o3-mini. Gemini 2.5 Pro can ingest entire code repositories, legal filings, or large research corpora while maintaining coherent reference across the full context - something the long-context benchmarks leaderboard shows most models struggle with past 100K tokens.
Function Calling and Agentic Use
The model supports structured outputs, function calling, batch APIs, and context caching. For agentic workflows, the SWE-bench coding agent leaderboard shows it competitive but not the top pick for pure coding agents. Its advantage in agentic settings is coherence over multi-step tasks that require both reasoning and reading large documents together.
Android XR platform announced at Google I/O 2025. Gemini 2.5 Pro powers the AI backbone for both smart glasses and XR headsets.
Source: blog.google
Pricing and Availability
Pricing uses a two-tier input model based on prompt length. Prompts under 200K tokens cost $1.25/M input and $10.00/M output. Longer prompts (over 200K tokens) step up to $2.50/M input and $15.00/M output. At 1/8th the cost of OpenAI o3 at comparable performance levels for many tasks, this pricing is competitive.
Batch processing drops to $0.625/M input and $5.00/M output under 200K tokens. Priority tier (guaranteed capacity) costs $2.25/M input and $18.00/M output.
The free tier through Google AI Studio gives developers access to Gemini 2.5 Pro at no cost, capped at 5 requests per minute and 100 requests per day. Note that Google cut the free tier limits in December 2025 - 100 RPD is roughly usable for personal experimentation but inadequate for production testing.
Compared to Gemini 2.5 Flash-Lite ($0.10/M input), Pro costs 12.5x more. For most high-volume tasks where reasoning depth isn't required, Flash or Flash-Lite wins on economics. Pro is the right choice for complex multi-step analysis, legal document review, scientific research, and tasks that actually need the 1M context window.
Context caching is available at a 90% discount on cached tokens ($0.125/M input) for repeated large-context operations.
Strengths
- Long-context coherence at 1M tokens, with MRCR 91.5% at 128K - the clearest measurable advantage over most competitors
- Native multimodal in a single model: no separate image/audio/video model to route to
- Thinking mode with transparent chain-of-thought at no extra cost
- Strong science and math reasoning: 84% GPQA Diamond (verified by Epoch AI), 86.7% AIME 2025
- Competitive pricing relative to o3 and Claude Opus for similar reasoning capability
Weaknesses
- SWE-bench Verified score of 63.8% trails Claude 3.7 Sonnet (70.3%) for pure agentic coding
- High time-to-first-token at 21.98 seconds versus a median of 2.76 seconds for non-thinking models
- User reports (June-July 2025) of hallucination rate increases and context abandonment in long multi-turn conversations, particularly for complex instruction following
- Free tier reduced to 100 RPD in December 2025, limiting developer evaluation
- Parameters not disclosed, which makes independent architectural analysis impossible
Related Coverage
- Gemini 3.1 Pro - Google's subsequent flagship, for comparison
- Gemini 2.5 Flash-Lite - The budget tier in the same model family
- SWE-bench Coding Agent Leaderboard - Full coding benchmark rankings
- Chatbot Arena Elo Rankings - Human preference rankings where this model debuted at #1
Sources:
- Google DeepMind: Gemini 2.5 release announcement
- Gemini 2.5 Flash and Pro GA - Google Cloud Blog
- Gemini Developer API pricing
- Epoch AI GPQA Diamond verification
- LMArena #1 ranking announcement
- Google I/O 2025: Gemini on Android XR
- Gemini 2.5 Pro model card (Google DeepMind)
- Artificial Analysis - Gemini 2.5 Pro performance
- Gemini API rate limits
✓ Last verified May 19, 2026
