Name: Gemini Omni Flash
Author: Google DeepMind

TL;DR

Creates 3-10 second 720p videos with native audio from text, images, or existing video clips - and takes iterative edits through chat prompts
$0.10 per second via API ($1 per 10-second clip); free for YouTube Shorts users and included in Google AI Plus/Pro/Ultra subscriptions
Behind Veo 3.1 on raw output quality and max resolution, but unique for conversational editing - you can refine a shot without regenerating from scratch

Overview

Gemini Omni Flash is Google DeepMind's first model in the Gemini Omni family, announced at Google I/O in May 2026 and released to the API on June 30, 2026. The core idea behind it's different from every other video generation model on the market: rather than treating each generation as a one-shot text-to-video prompt, Omni Flash is designed for iterative, conversational workflows where you point at what's wrong and the model fixes only that.

The model accepts any combination of text, images, and video as input and outputs 720p video with synchronized audio up to 10 seconds per clip. What's unusual is the previous_interaction_id API parameter, which lets you chain edits across multiple turns. You can create a clip, tell it "change the background to a rainy street but keep the character," and it carries the original context forward instead of starting fresh. No other widely launched video generation model offers this natively through an API today.

Google positions Omni Flash as a complement to Veo 3.1 rather than a replacement. Veo handles single-shot quality generation up to 4K. Omni handles iterative, context-aware production that might involve five or six rounds of edits before a clip is finalized. Both exist within the Google ecosystem, and the intended use case is running Omni for storyboarding and drafts, then passing the best candidate to Veo for a final high-resolution render.

Key Specifications

Specification	Details
Provider	Google DeepMind
Model Family	Gemini Omni
Parameters	Not disclosed
Context Window	1,048,576 tokens
Input Modalities	Text, image (JPEG/PNG), video (via Files API)
Output Format	MP4, 720p, 24fps
Max Clip Length	10 seconds
Aspect Ratios	16:9 (landscape), 9:16 (portrait)
API Name	`gemini-omni-flash-preview`
Release Date	2026-05-19 (consumer); 2026-06-30 (API preview)
License	Proprietary; SynthID watermark on all outputs

Benchmark Performance

No formal public benchmarks are yet available for Gemini Omni Flash on the major video generation evaluation suites. It was added to the Artificial Analysis Video Arena Text-to-Video leaderboard on June 11, 2026 and the Video Edit leaderboard on June 30, 2026, but rankings take time to build up votes. The table below uses available third-party qualitative assessments and timing data.

Model	Max Resolution	Max Duration	Native Audio	API Price/Sec	Conversational Editing	Arena Status
Gemini Omni Flash	720p	10s	Yes	$0.10	Yes	Ranking building up
Veo 3.1	4K	60s+	Yes	$0.40 (1080p)	No	Top 3
Sora 2	1080p	25s	Yes	$0.10-$0.70	No	Sunsetting Sept 2026
Seedance 2.0	1080p	15s	Yes	Varies	No	#1 Text-to-Video
Kling 3.0	1080p	15s	Yes	Varies	No	Top 5

The relevant benchmark for Omni Flash isn't a standard text-to-video quality score. It's the Video Edit arena, where the ability to apply targeted changes matters more than peak generation quality. Early user testing reports that Omni Flash handles background swaps, style transfers, and simple object replacements well, but struggles with complex motion edits where two moving elements need to interact differently after the edit.

Pure generation quality - judged on a single prompt with no editing - places Omni Flash below Seedance 2.0 and Veo 3.1 by most accounts. The model's physics handling on complex scenes also trails Sora 2, which led most blind tests on physics realism before its announced sunsetting.

Video camera lens - the shift toward multimodal video AI Gemini Omni Flash is built around the idea that video creation is a conversation, not a single prompt. Source: unsplash.com

Key Capabilities

Omni Flash is best understood through its four distinct task types, each specified via the task parameter in the API: text_to_video, image_to_video, reference_to_video, and edit. The reference_to_video task is the most novel - you supply up to seven reference images and up to three short video clips, and the model synthesizes a new clip that maintains character identity, setting, and visual style across all of them. For anyone doing character-consistent narrative video, this is hard to copy without it.

The edit task, paired with previous_interaction_id, is what separates Omni Flash from every other API on this list. You generate a clip, get the interaction ID back, and pass it in your next call along with a natural-language edit instruction. The model uses the prior context to apply only the change you specified. This isn't perfect - complex structural edits still sometimes cause unintended changes to other elements - but it dramatically reduces the number of full regenerations needed to get a clip right.

Native audio generation is built in, not bolted on. Every clip includes synchronized audio by default. You can't upload audio as a separate input track yet (the documentation notes this is unsupported), but the model produces ambient sound, dialogue sync, and background music automatically based on the visual context and any text cues you provide.

Pricing and Availability

Gemini Omni Flash arrived at consumer level first: all Google AI Plus, Pro, and Ultra subscribers get access through the Gemini app and Google Flow as of the May 19, 2026 launch. YouTube Shorts users get access at no cost. API and enterprise access rolled out on June 30, 2026 via Google AI Studio and the Gemini Enterprise Agent Platform.

API pricing works out to $0.10 per second of 720p video output, calculated at 5,792 output tokens per second under the standard $17.50/M video token rate. Input tokens are $1.50/M. A 10-second clip - the current maximum - costs roughly $1.00 in output tokens plus input token costs. The 50% Batch API discount applies when processing video at scale, bringing batch cost to $0.05/second.

For comparison, Veo 3.1 Standard charges $0.40/second at 1080p. Sora 2 ranges from $0.10/second (720p) to $0.70/second (1080p Pro). Omni Flash is competitively priced for 720p work and significantly cheaper than Veo for the same resolution tier.

Free access through YouTube Shorts and the consumer Gemini app has no direct API equivalent - you can't get free video generation through the API even on free-tier accounts.

Video editing workflow - professional video production Conversational video editing is the distinguishing capability: edit a clip by describing what to change, not by regenerating from scratch. Source: unsplash.com

Strengths and Weaknesses

Strengths

Conversational editing via previous_interaction_id is unique among production-grade APIs
Accepts up to seven reference images and three video clips for cross-modal consistency
Native synchronized audio on every clip, no post-processing required
$0.10/second places it at parity with Sora 2's base tier while offering editing features Sora doesn't have
SynthID watermarking built in for provenance verification
1M-token context window supports long multi-turn edit sessions
Free access at consumer level through YouTube Shorts and Google AI subscriptions

Weaknesses

Capped at 720p - no 1080p or 4K path currently, unlike Veo 3.1
Maximum 10 seconds per clip, below the 15-25 second limits of most competitors
Pure generation quality trails Seedance 2.0 and Veo 3.1 on blind evaluation
Still in preview; pricing and availability may change
English only at full quality; other languages vary
No system instructions, temperature control, or top_p settings - less configurable than text models

Veo 3.1 - Google's dedicated high-quality video model, produces up to 4K, 60+ second clips
Seedance 2.0 - Current #1 on the Artificial Analysis Video Arena text-to-video leaderboard
Sora 2 - OpenAI's video generation model; sunsetting September 2026
Kling 3.0 - Strong competitor at 1080p and 15-second clips
Runway Gen 4.5 - Established video generation platform with broader feature set
Nano Banana 2 (Gemini 3.1 Flash Image) - Google's companion image generation model, also available via API
Video Generation Benchmarks Leaderboard - Current rankings across all major video generation models

Sources: