Overview

Google released Gemini 2.5 Flash-Lite in preview on June 17, 2025, and promoted it to stable general availability on February 19, 2026. The pitch is straightforward: this is the cheapest model in the Gemini 2.5 family, optimized for latency and throughput rather than peak intelligence. At $0.10 per million input tokens and $0.40 per million output tokens, it matches Qwen3.5-Flash on price while offering Google's infrastructure, a 1M-token context window, and native multimodal processing of text, images, video, and audio.

The performance numbers tell a specific story. Flash-Lite scores 81.1% on Global-MMLU-Lite, 64.6% on GPQA Diamond, and 49.8% on AIME 2025 in non-thinking mode. These aren't frontier-tier results - they're solidly mid-range. But the model creates output at 358.9 tokens per second with a time-to-first-token of 0.23 seconds, which makes it one of the fastest proprietary models available. For workloads where latency matters more than peak reasoning - classification, extraction, summarization, routing - that speed advantage is the real product.

What makes Flash-Lite interesting in the current market is the combination of 1M context, multimodal input, and rock-bottom pricing. You can feed it an entire codebase, a long video, or hundreds of pages of documents at a cost that makes experimentation basically free. The tradeoff is clear: you're not getting Claude Opus 4.6 or Gemini 3.1 Pro reasoning quality. You are getting a fast, cheap workhorse that handles volume.

TL;DR

$0.10/$0.40 per million tokens - tied with Qwen3.5-Flash for cheapest frontier-family API
1M token context window with multimodal input (text, images, video, audio)
358.9 tokens/second output speed, 0.23s time-to-first-token
Mid-range intelligence (GPQA: 64.6%, AIME 2025: 49.8%) - built for throughput, not peak reasoning

Key Specifications

Specification	Details
Provider	Google DeepMind
Model Family	Gemini 2.5
Architecture	Not disclosed
Parameters	Not disclosed
Context Window	1,048,576 tokens (1M) input
Max Output	65,536 tokens
Input Modalities	Text, Image, Video, Audio, PDF
Output Modality	Text
Thinking Mode	Supported (toggleable with budgets)
Tool Use	Google Search grounding, code execution
Knowledge Cutoff	January 2025
Input Price	$0.10/M tokens
Output Price	$0.40/M tokens
Release Date	June 17, 2025 (preview), February 19, 2026 (GA)
License	Proprietary (hosted API only)
Availability	Google AI Studio, Gemini API, Vertex AI

Benchmark Performance

Flash-Lite sits in the budget tier, so the right comparisons are against GPT-4o mini and other cost-optimized models, not against frontier reasoning systems. The benchmarks below use non-thinking mode results, which is the default configuration for latency-sensitive production use:

Benchmark	Gemini 2.5 Flash-Lite	GPT-4o mini	Gemini 2.0 Flash
Global-MMLU-Lite (multilingual)	81.1	82.0 (MMLU)	-
GPQA Diamond (PhD-level science)	64.6	40.2	-
AIME 2025 (competition math)	49.8	-	-
MMMU (visual reasoning)	72.9	-	-
LiveCodeBench (coding)	33.7	-	-
SWE-bench Verified (agentic coding)	31.6	-	-
FACTS Grounding (factual accuracy)	84.1	-	-

The GPQA Diamond gap is striking: 64.6% versus GPT-4o mini's 40.2%. That is a 24-point lead on graduate-level science reasoning, which suggests Flash-Lite inherits meaningful reasoning capability from the Gemini 2.5 architecture even at its reduced size. The FACTS Grounding score of 84.1% (ranked #3 overall) indicates the model is unusually reliable at staying grounded in provided context - a valuable property for retrieval-augmented generation pipelines.

The coding numbers are less impressive. LiveCodeBench at 33.7% and SWE-bench Verified at 31.6% are functional but not competitive with dedicated coding models. For code-heavy workloads, GPT-5.3 Codex or the Qwen 3.5 series are better choices.

Key Capabilities

Flash-Lite's primary value proposition is throughput at scale. At 358.9 tokens per second, it processes requests roughly 2-3x faster than most competing models in the same price tier. The 0.23-second time-to-first-token means users get near-instant responses, which matters for interactive applications and real-time pipelines. If you're building a classification service, a document extraction pipeline, or a routing layer that needs to process thousands of requests per minute, Flash-Lite's speed profile is purpose-built for that use case.

The 1M-token context window at $0.10 per million input tokens changes the economics of long-context applications. Processing a 500-page PDF costs roughly $0.05 in input tokens. Ingesting an entire medium-sized codebase for analysis costs under $1. Combined with multimodal input support - images, video frames, audio transcription - this opens up high-volume document processing, video analysis, and multi-format extraction workflows that would be cost-prohibitive with frontier models. The FACTS Grounding score of 84.1% suggests the model actually uses that long context effectively rather than losing information in the middle.

Thinking mode adds an optional reasoning layer. When enabled, the model creates internal reasoning chains before producing output, trading latency for improved accuracy on harder problems. Google provides thinking budgets to control this tradeoff. For most production deployments, non-thinking mode will be the default - the speed advantage is why you choose Flash-Lite in the first place. But having the option to enable deeper reasoning for specific request types within the same model simplifies architecture decisions.

Pricing and Availability

Flash-Lite is available through Google AI Studio (free tier for experimentation), the Gemini API, and Vertex AI for enterprise deployments.

Provider	Input Cost/M	Output Cost/M	Context
Gemini 2.5 Flash-Lite	$0.10	$0.40	1M
Qwen3.5-Flash	$0.10	$0.40	1M
GPT-4o mini	$0.15	$0.60	128K
Claude Haiku 3.5	$0.80	$4.00	200K

Flash-Lite ties with Qwen3.5-Flash on raw token pricing but offers the advantage of Google's global infrastructure, which normally means lower latency for users in North America and Europe. GPT-4o mini is 50% more expensive on input and has a significantly smaller context window. Claude Haiku 3.5 costs 8x more on input. For pure cost optimization, Flash-Lite and Qwen3.5-Flash are the current price floor for capable multimodal APIs.

Strengths

Cheapest Gemini model at $0.10/$0.40 per million tokens - tied for lowest-cost multimodal API
1M token context window with genuine long-context retrieval capability
358.9 tokens/second throughput and 0.23s TTFT - one of the fastest proprietary models
Native multimodal: text, images, video, and audio in a single endpoint
Strong factual grounding (84.1% FACTS Grounding, ranked #3 overall)
Optional thinking mode with configurable budgets for harder tasks
Google infrastructure with global availability and enterprise compliance

Weaknesses

Parameters and architecture not disclosed - no self-hosting or fine-tuning option
Coding benchmarks are mediocre (LiveCodeBench: 33.7%, SWE-bench: 31.6%)
Intelligence ceiling is noticeably below Flash and Pro variants on reasoning tasks
Long-context retrieval at 128K scored only 16.6% on MRCR v2 - the 1M window may degrade on needle-in-haystack tasks
No open-weight variant - you're locked into Google's API ecosystem
Limited community tooling compared to OpenAI's ecosystem

Qwen3.5-Flash - Alibaba's price-matched competitor with identical $0.10/$0.40 pricing
Gemini 3.1 Pro - Google's frontier reasoning model for harder tasks
Open Source vs Proprietary AI - The tradeoff between API convenience and self-hosting control
Open Source LLM Leaderboard - Where open alternatives stand