Models

Gemini 2.5 Flash-Lite

Google's cheapest Gemini model pairs a 1M-token context window with $0.10/$0.40 per million token pricing, multimodal input, and 359 tokens/second throughput for high-volume production workloads.

Gemini 2.5 Flash-Lite

Overview

Google released Gemini 2.5 Flash-Lite in preview on June 17, 2025, and promoted it to stable general availability on February 19, 2026. The pitch is straightforward: this is the cheapest model in the Gemini 2.5 family, optimized for latency and throughput rather than peak intelligence. At $0.10 per million input tokens and $0.40 per million output tokens, it matches Qwen3.5-Flash on price while offering Google's infrastructure, a 1M-token context window, and native multimodal processing of text, images, video, and audio.

The performance numbers tell a specific story. Flash-Lite scores 81.1% on Global-MMLU-Lite, 64.6% on GPQA Diamond, and 49.8% on AIME 2025 in non-thinking mode. These are not frontier-tier results - they are solidly mid-range. But the model generates output at 358.9 tokens per second with a time-to-first-token of 0.23 seconds, which makes it one of the fastest proprietary models available. For workloads where latency matters more than peak reasoning - classification, extraction, summarization, routing - that speed advantage is the real product.

What makes Flash-Lite interesting in the current market is the combination of 1M context, multimodal input, and rock-bottom pricing. You can feed it an entire codebase, a long video, or hundreds of pages of documents at a cost that makes experimentation essentially free. The tradeoff is clear: you are not getting Claude Opus 4.6 or Gemini 3.1 Pro reasoning quality. You are getting a fast, cheap workhorse that handles volume.

TL;DR

  • $0.10/$0.40 per million tokens - tied with Qwen3.5-Flash for cheapest frontier-family API
  • 1M token context window with multimodal input (text, images, video, audio)
  • 358.9 tokens/second output speed, 0.23s time-to-first-token
  • Mid-range intelligence (GPQA: 64.6%, AIME 2025: 49.8%) - built for throughput, not peak reasoning

Key Specifications

SpecificationDetails
ProviderGoogle DeepMind
Model FamilyGemini 2.5
ArchitectureNot disclosed
ParametersNot disclosed
Context Window1,048,576 tokens (1M) input
Max Output65,536 tokens
Input ModalitiesText, Image, Video, Audio, PDF
Output ModalityText
Thinking ModeSupported (toggleable with budgets)
Tool UseGoogle Search grounding, code execution
Knowledge CutoffJanuary 2025
Input Price$0.10/M tokens
Output Price$0.40/M tokens
Release DateJune 17, 2025 (preview), February 19, 2026 (GA)
LicenseProprietary (hosted API only)
AvailabilityGoogle AI Studio, Gemini API, Vertex AI

Benchmark Performance

Flash-Lite sits in the budget tier, so the right comparisons are against GPT-4o mini and other cost-optimized models, not against frontier reasoning systems. The benchmarks below use non-thinking mode results, which is the default configuration for latency-sensitive production use:

BenchmarkGemini 2.5 Flash-LiteGPT-4o miniGemini 2.0 Flash
Global-MMLU-Lite (multilingual)81.182.0 (MMLU)-
GPQA Diamond (PhD-level science)64.640.2-
AIME 2025 (competition math)49.8--
MMMU (visual reasoning)72.9--
LiveCodeBench (coding)33.7--
SWE-bench Verified (agentic coding)31.6--
FACTS Grounding (factual accuracy)84.1--

The GPQA Diamond gap is striking: 64.6% versus GPT-4o mini's 40.2%. That is a 24-point lead on graduate-level science reasoning, which suggests Flash-Lite inherits meaningful reasoning capability from the Gemini 2.5 architecture even at its reduced size. The FACTS Grounding score of 84.1% (ranked #3 overall) indicates the model is unusually reliable at staying grounded in provided context - a valuable property for retrieval-augmented generation pipelines.

The coding numbers are less impressive. LiveCodeBench at 33.7% and SWE-bench Verified at 31.6% are functional but not competitive with dedicated coding models. For code-heavy workloads, GPT-5.3 Codex or the Qwen 3.5 series are better choices.

Key Capabilities

Flash-Lite's primary value proposition is throughput at scale. At 358.9 tokens per second, it processes requests roughly 2-3x faster than most competing models in the same price tier. The 0.23-second time-to-first-token means users get near-instant responses, which matters for interactive applications and real-time pipelines. If you are building a classification service, a document extraction pipeline, or a routing layer that needs to process thousands of requests per minute, Flash-Lite's speed profile is purpose-built for that use case.

The 1M-token context window at $0.10 per million input tokens changes the economics of long-context applications. Processing a 500-page PDF costs roughly $0.05 in input tokens. Ingesting an entire medium-sized codebase for analysis costs under $1. Combined with multimodal input support - images, video frames, audio transcription - this opens up high-volume document processing, video analysis, and multi-format extraction workflows that would be cost-prohibitive with frontier models. The FACTS Grounding score of 84.1% suggests the model actually uses that long context effectively rather than losing information in the middle.

Thinking mode adds an optional reasoning layer. When enabled, the model generates internal reasoning chains before producing output, trading latency for improved accuracy on harder problems. Google provides thinking budgets to control this tradeoff. For most production deployments, non-thinking mode will be the default - the speed advantage is why you choose Flash-Lite in the first place. But having the option to enable deeper reasoning for specific request types within the same model simplifies architecture decisions.

Pricing and Availability

Flash-Lite is available through Google AI Studio (free tier for experimentation), the Gemini API, and Vertex AI for enterprise deployments.

ProviderInput Cost/MOutput Cost/MContext
Gemini 2.5 Flash-Lite$0.10$0.401M
Qwen3.5-Flash$0.10$0.401M
GPT-4o mini$0.15$0.60128K
Claude Haiku 3.5$0.80$4.00200K

Flash-Lite ties with Qwen3.5-Flash on raw token pricing but offers the advantage of Google's global infrastructure, which typically means lower latency for users in North America and Europe. GPT-4o mini is 50% more expensive on input and has a significantly smaller context window. Claude Haiku 3.5 costs 8x more on input. For pure cost optimization, Flash-Lite and Qwen3.5-Flash are the current price floor for capable multimodal APIs.

Strengths

  • Cheapest Gemini model at $0.10/$0.40 per million tokens - tied for lowest-cost multimodal API
  • 1M token context window with genuine long-context retrieval capability
  • 358.9 tokens/second throughput and 0.23s TTFT - one of the fastest proprietary models
  • Native multimodal: text, images, video, and audio in a single endpoint
  • Strong factual grounding (84.1% FACTS Grounding, ranked #3 overall)
  • Optional thinking mode with configurable budgets for harder tasks
  • Google infrastructure with global availability and enterprise compliance

Weaknesses

  • Parameters and architecture not disclosed - no self-hosting or fine-tuning option
  • Coding benchmarks are mediocre (LiveCodeBench: 33.7%, SWE-bench: 31.6%)
  • Intelligence ceiling is noticeably below Flash and Pro variants on reasoning tasks
  • Long-context retrieval at 128K scored only 16.6% on MRCR v2 - the 1M window may degrade on needle-in-haystack tasks
  • No open-weight variant - you are locked into Google's API ecosystem
  • Limited community tooling compared to OpenAI's ecosystem

Sources

Gemini 2.5 Flash-Lite
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.