Gemini 2.5 Flash-Lite
Google's cheapest Gemini model pairs a 1M-token context window with $0.10/$0.40 per million token pricing, multimodal input, and 359 tokens/second throughput for high-volume production workloads.

Overview
Google released Gemini 2.5 Flash-Lite in preview on June 17, 2025, and promoted it to stable general availability on February 19, 2026. The pitch is straightforward: this is the cheapest model in the Gemini 2.5 family, optimized for latency and throughput rather than peak intelligence. At $0.10 per million input tokens and $0.40 per million output tokens, it matches Qwen3.5-Flash on price while offering Google's infrastructure, a 1M-token context window, and native multimodal processing of text, images, video, and audio.
The performance numbers tell a specific story. Flash-Lite scores 81.1% on Global-MMLU-Lite, 64.6% on GPQA Diamond, and 49.8% on AIME 2025 in non-thinking mode. These are not frontier-tier results - they are solidly mid-range. But the model generates output at 358.9 tokens per second with a time-to-first-token of 0.23 seconds, which makes it one of the fastest proprietary models available. For workloads where latency matters more than peak reasoning - classification, extraction, summarization, routing - that speed advantage is the real product.
What makes Flash-Lite interesting in the current market is the combination of 1M context, multimodal input, and rock-bottom pricing. You can feed it an entire codebase, a long video, or hundreds of pages of documents at a cost that makes experimentation essentially free. The tradeoff is clear: you are not getting Claude Opus 4.6 or Gemini 3.1 Pro reasoning quality. You are getting a fast, cheap workhorse that handles volume.
TL;DR
- $0.10/$0.40 per million tokens - tied with Qwen3.5-Flash for cheapest frontier-family API
- 1M token context window with multimodal input (text, images, video, audio)
- 358.9 tokens/second output speed, 0.23s time-to-first-token
- Mid-range intelligence (GPQA: 64.6%, AIME 2025: 49.8%) - built for throughput, not peak reasoning
Key Specifications
| Specification | Details |
|---|---|
| Provider | Google DeepMind |
| Model Family | Gemini 2.5 |
| Architecture | Not disclosed |
| Parameters | Not disclosed |
| Context Window | 1,048,576 tokens (1M) input |
| Max Output | 65,536 tokens |
| Input Modalities | Text, Image, Video, Audio, PDF |
| Output Modality | Text |
| Thinking Mode | Supported (toggleable with budgets) |
| Tool Use | Google Search grounding, code execution |
| Knowledge Cutoff | January 2025 |
| Input Price | $0.10/M tokens |
| Output Price | $0.40/M tokens |
| Release Date | June 17, 2025 (preview), February 19, 2026 (GA) |
| License | Proprietary (hosted API only) |
| Availability | Google AI Studio, Gemini API, Vertex AI |
Benchmark Performance
Flash-Lite sits in the budget tier, so the right comparisons are against GPT-4o mini and other cost-optimized models, not against frontier reasoning systems. The benchmarks below use non-thinking mode results, which is the default configuration for latency-sensitive production use:
| Benchmark | Gemini 2.5 Flash-Lite | GPT-4o mini | Gemini 2.0 Flash |
|---|---|---|---|
| Global-MMLU-Lite (multilingual) | 81.1 | 82.0 (MMLU) | - |
| GPQA Diamond (PhD-level science) | 64.6 | 40.2 | - |
| AIME 2025 (competition math) | 49.8 | - | - |
| MMMU (visual reasoning) | 72.9 | - | - |
| LiveCodeBench (coding) | 33.7 | - | - |
| SWE-bench Verified (agentic coding) | 31.6 | - | - |
| FACTS Grounding (factual accuracy) | 84.1 | - | - |
The GPQA Diamond gap is striking: 64.6% versus GPT-4o mini's 40.2%. That is a 24-point lead on graduate-level science reasoning, which suggests Flash-Lite inherits meaningful reasoning capability from the Gemini 2.5 architecture even at its reduced size. The FACTS Grounding score of 84.1% (ranked #3 overall) indicates the model is unusually reliable at staying grounded in provided context - a valuable property for retrieval-augmented generation pipelines.
The coding numbers are less impressive. LiveCodeBench at 33.7% and SWE-bench Verified at 31.6% are functional but not competitive with dedicated coding models. For code-heavy workloads, GPT-5.3 Codex or the Qwen 3.5 series are better choices.
Key Capabilities
Flash-Lite's primary value proposition is throughput at scale. At 358.9 tokens per second, it processes requests roughly 2-3x faster than most competing models in the same price tier. The 0.23-second time-to-first-token means users get near-instant responses, which matters for interactive applications and real-time pipelines. If you are building a classification service, a document extraction pipeline, or a routing layer that needs to process thousands of requests per minute, Flash-Lite's speed profile is purpose-built for that use case.
The 1M-token context window at $0.10 per million input tokens changes the economics of long-context applications. Processing a 500-page PDF costs roughly $0.05 in input tokens. Ingesting an entire medium-sized codebase for analysis costs under $1. Combined with multimodal input support - images, video frames, audio transcription - this opens up high-volume document processing, video analysis, and multi-format extraction workflows that would be cost-prohibitive with frontier models. The FACTS Grounding score of 84.1% suggests the model actually uses that long context effectively rather than losing information in the middle.
Thinking mode adds an optional reasoning layer. When enabled, the model generates internal reasoning chains before producing output, trading latency for improved accuracy on harder problems. Google provides thinking budgets to control this tradeoff. For most production deployments, non-thinking mode will be the default - the speed advantage is why you choose Flash-Lite in the first place. But having the option to enable deeper reasoning for specific request types within the same model simplifies architecture decisions.
Pricing and Availability
Flash-Lite is available through Google AI Studio (free tier for experimentation), the Gemini API, and Vertex AI for enterprise deployments.
| Provider | Input Cost/M | Output Cost/M | Context |
|---|---|---|---|
| Gemini 2.5 Flash-Lite | $0.10 | $0.40 | 1M |
| Qwen3.5-Flash | $0.10 | $0.40 | 1M |
| GPT-4o mini | $0.15 | $0.60 | 128K |
| Claude Haiku 3.5 | $0.80 | $4.00 | 200K |
Flash-Lite ties with Qwen3.5-Flash on raw token pricing but offers the advantage of Google's global infrastructure, which typically means lower latency for users in North America and Europe. GPT-4o mini is 50% more expensive on input and has a significantly smaller context window. Claude Haiku 3.5 costs 8x more on input. For pure cost optimization, Flash-Lite and Qwen3.5-Flash are the current price floor for capable multimodal APIs.
Strengths
- Cheapest Gemini model at $0.10/$0.40 per million tokens - tied for lowest-cost multimodal API
- 1M token context window with genuine long-context retrieval capability
- 358.9 tokens/second throughput and 0.23s TTFT - one of the fastest proprietary models
- Native multimodal: text, images, video, and audio in a single endpoint
- Strong factual grounding (84.1% FACTS Grounding, ranked #3 overall)
- Optional thinking mode with configurable budgets for harder tasks
- Google infrastructure with global availability and enterprise compliance
Weaknesses
- Parameters and architecture not disclosed - no self-hosting or fine-tuning option
- Coding benchmarks are mediocre (LiveCodeBench: 33.7%, SWE-bench: 31.6%)
- Intelligence ceiling is noticeably below Flash and Pro variants on reasoning tasks
- Long-context retrieval at 128K scored only 16.6% on MRCR v2 - the 1M window may degrade on needle-in-haystack tasks
- No open-weight variant - you are locked into Google's API ecosystem
- Limited community tooling compared to OpenAI's ecosystem
Related Coverage
- Qwen3.5-Flash - Alibaba's price-matched competitor with identical $0.10/$0.40 pricing
- Gemini 3.1 Pro - Google's frontier reasoning model for harder tasks
- Open Source vs Proprietary AI - The tradeoff between API convenience and self-hosting control
- Open Source LLM Leaderboard - Where open alternatives stand
Sources
- Gemini 2.5 Flash-Lite is now stable and generally available - Google Developers Blog
- Gemini 2.5 Flash-Lite - Google DeepMind
- Gemini Developer API Pricing - Google AI for Developers
- Gemini 2.5 Flash-Lite Intelligence & Performance Analysis - Artificial Analysis
- Gemini 2.5 Flash-Lite: Pricing, Context Window, Benchmarks - LLM Stats
- Gemini Models Documentation - Google AI for Developers
