Gemini Omni Flash
Google DeepMind's multimodal video generation model that creates 10-second clips with native audio from text, images, or video inputs - and lets you refine results through conversation.

TL;DR
- Creates 3-10 second 720p videos with native audio from text, images, or existing video clips - and takes iterative edits through chat prompts
- $0.10 per second via API ($1 per 10-second clip); free for YouTube Shorts users and included in Google AI Plus/Pro/Ultra subscriptions
- Behind Veo 3.1 on raw output quality and max resolution, but unique for conversational editing - you can refine a shot without regenerating from scratch
Overview
Gemini Omni Flash is Google DeepMind's first model in the Gemini Omni family, announced at Google I/O in May 2026 and released to the API on June 30, 2026. The core idea behind it's different from every other video generation model on the market: rather than treating each generation as a one-shot text-to-video prompt, Omni Flash is designed for iterative, conversational workflows where you point at what's wrong and the model fixes only that.
The model accepts any combination of text, images, and video as input and outputs 720p video with synchronized audio up to 10 seconds per clip. What's unusual is the previous_interaction_id API parameter, which lets you chain edits across multiple turns. You can create a clip, tell it "change the background to a rainy street but keep the character," and it carries the original context forward instead of starting fresh. No other widely launched video generation model offers this natively through an API today.
Google positions Omni Flash as a complement to Veo 3.1 rather than a replacement. Veo handles single-shot quality generation up to 4K. Omni handles iterative, context-aware production that might involve five or six rounds of edits before a clip is finalized. Both exist within the Google ecosystem, and the intended use case is running Omni for storyboarding and drafts, then passing the best candidate to Veo for a final high-resolution render.
Key Specifications
| Specification | Details |
|---|---|
| Provider | Google DeepMind |
| Model Family | Gemini Omni |
| Parameters | Not disclosed |
| Context Window | 1,048,576 tokens |
| Input Modalities | Text, image (JPEG/PNG), video (via Files API) |
| Output Format | MP4, 720p, 24fps |
| Max Clip Length | 10 seconds |
| Aspect Ratios | 16:9 (landscape), 9:16 (portrait) |
| API Name | gemini-omni-flash-preview |
| Release Date | 2026-05-19 (consumer); 2026-06-30 (API preview) |
| License | Proprietary; SynthID watermark on all outputs |
Benchmark Performance
No formal public benchmarks are yet available for Gemini Omni Flash on the major video generation evaluation suites. It was added to the Artificial Analysis Video Arena Text-to-Video leaderboard on June 11, 2026 and the Video Edit leaderboard on June 30, 2026, but rankings take time to build up votes. The table below uses available third-party qualitative assessments and timing data.
| Model | Max Resolution | Max Duration | Native Audio | API Price/Sec | Conversational Editing | Arena Status |
|---|---|---|---|---|---|---|
| Gemini Omni Flash | 720p | 10s | Yes | $0.10 | Yes | Ranking building up |
| Veo 3.1 | 4K | 60s+ | Yes | $0.40 (1080p) | No | Top 3 |
| Sora 2 | 1080p | 25s | Yes | $0.10-$0.70 | No | Sunsetting Sept 2026 |
| Seedance 2.0 | 1080p | 15s | Yes | Varies | No | #1 Text-to-Video |
| Kling 3.0 | 1080p | 15s | Yes | Varies | No | Top 5 |
The relevant benchmark for Omni Flash isn't a standard text-to-video quality score. It's the Video Edit arena, where the ability to apply targeted changes matters more than peak generation quality. Early user testing reports that Omni Flash handles background swaps, style transfers, and simple object replacements well, but struggles with complex motion edits where two moving elements need to interact differently after the edit.
Pure generation quality - judged on a single prompt with no editing - places Omni Flash below Seedance 2.0 and Veo 3.1 by most accounts. The model's physics handling on complex scenes also trails Sora 2, which led most blind tests on physics realism before its announced sunsetting.
Gemini Omni Flash is built around the idea that video creation is a conversation, not a single prompt.
Source: unsplash.com
Key Capabilities
Omni Flash is best understood through its four distinct task types, each specified via the task parameter in the API: text_to_video, image_to_video, reference_to_video, and edit. The reference_to_video task is the most novel - you supply up to seven reference images and up to three short video clips, and the model synthesizes a new clip that maintains character identity, setting, and visual style across all of them. For anyone doing character-consistent narrative video, this is hard to copy without it.
The edit task, paired with previous_interaction_id, is what separates Omni Flash from every other API on this list. You generate a clip, get the interaction ID back, and pass it in your next call along with a natural-language edit instruction. The model uses the prior context to apply only the change you specified. This isn't perfect - complex structural edits still sometimes cause unintended changes to other elements - but it dramatically reduces the number of full regenerations needed to get a clip right.
Native audio generation is built in, not bolted on. Every clip includes synchronized audio by default. You can't upload audio as a separate input track yet (the documentation notes this is unsupported), but the model produces ambient sound, dialogue sync, and background music automatically based on the visual context and any text cues you provide.
Pricing and Availability
Gemini Omni Flash arrived at consumer level first: all Google AI Plus, Pro, and Ultra subscribers get access through the Gemini app and Google Flow as of the May 19, 2026 launch. YouTube Shorts users get access at no cost. API and enterprise access rolled out on June 30, 2026 via Google AI Studio and the Gemini Enterprise Agent Platform.
API pricing works out to $0.10 per second of 720p video output, calculated at 5,792 output tokens per second under the standard $17.50/M video token rate. Input tokens are $1.50/M. A 10-second clip - the current maximum - costs roughly $1.00 in output tokens plus input token costs. The 50% Batch API discount applies when processing video at scale, bringing batch cost to $0.05/second.
For comparison, Veo 3.1 Standard charges $0.40/second at 1080p. Sora 2 ranges from $0.10/second (720p) to $0.70/second (1080p Pro). Omni Flash is competitively priced for 720p work and significantly cheaper than Veo for the same resolution tier.
Free access through YouTube Shorts and the consumer Gemini app has no direct API equivalent - you can't get free video generation through the API even on free-tier accounts.
Conversational video editing is the distinguishing capability: edit a clip by describing what to change, not by regenerating from scratch.
Source: unsplash.com
Strengths and Weaknesses
Strengths
- Conversational editing via
previous_interaction_idis unique among production-grade APIs - Accepts up to seven reference images and three video clips for cross-modal consistency
- Native synchronized audio on every clip, no post-processing required
- $0.10/second places it at parity with Sora 2's base tier while offering editing features Sora doesn't have
- SynthID watermarking built in for provenance verification
- 1M-token context window supports long multi-turn edit sessions
- Free access at consumer level through YouTube Shorts and Google AI subscriptions
Weaknesses
- Capped at 720p - no 1080p or 4K path currently, unlike Veo 3.1
- Maximum 10 seconds per clip, below the 15-25 second limits of most competitors
- Pure generation quality trails Seedance 2.0 and Veo 3.1 on blind evaluation
- Still in preview; pricing and availability may change
- English only at full quality; other languages vary
- No system instructions, temperature control, or
top_psettings - less configurable than text models
Related Coverage
- Veo 3.1 - Google's dedicated high-quality video model, produces up to 4K, 60+ second clips
- Seedance 2.0 - Current #1 on the Artificial Analysis Video Arena text-to-video leaderboard
- Sora 2 - OpenAI's video generation model; sunsetting September 2026
- Kling 3.0 - Strong competitor at 1080p and 15-second clips
- Runway Gen 4.5 - Established video generation platform with broader feature set
- Nano Banana 2 (Gemini 3.1 Flash Image) - Google's companion image generation model, also available via API
- Video Generation Benchmarks Leaderboard - Current rankings across all major video generation models
Sources:
- Introducing Gemini Omni - Google Blog
- Gemini Omni Flash Developer Docs
- Gemini Omni Flash Model Specification
- Gemini API Pricing
- Start building with Nano Banana 2 Lite and Gemini Omni Flash - Google Blog
- Google's Gemini Omni turns images, audio, and text into video - TechCrunch
- Runware launches developer API access for Gemini Omni Flash
- Gemini Omni Flash (Public Preview) - Google Cloud Docs
- Gemini Omni vs Veo 3.1: What Changed, Which Is Better?
- Omni Flash vs Veo, Sora 2 & Seedance 2.0 - WaveSpeed
✓ Last verified July 2, 2026
