Name: Gemini 3.1 Flash TTS
Author: Google

Overview

Gemini 3.1 Flash TTS is Google's text-to-speech model, launched in public preview on April 15, 2026 through the Gemini API, Google AI Studio, Vertex AI, and Google Vids for Workspace. The model ID is gemini-3.1-flash-tts-preview. Its headline pitch isn't raw voice naturalness. It's inline control: 200-plus audio tags that let you direct emotion, pacing, and non-verbal sounds mid-sentence using square-bracket syntax embedded in the input text.

TL;DR

Best for multi-language voice content where directors need mid-sentence control over tone, pace, and non-verbal sounds.
30 voices, 70-plus languages, 200-plus audio tags, PCM 24kHz mono 16-bit output, 2-speaker cap, SynthID watermarked, no streaming.
Elo 1,211 at launch on the Artificial Analysis TTS Arena - #2 behind Inworld TTS 1.5 Max, ahead of ElevenLabs v3.

Flash TTS is a batch synthesis system, not a streaming one. For audiobooks, video voice-overs, and IVR scripts where you can afford a full-response wait, this is the product. For live phone agents you want Gemini 3.1 Flash Live, the real-time sibling model that runs interactive conversation at roughly 960ms latency. Both shipped in the same generation. Google Senior Product Manager Vilobh Meshram and Principal Research Engineer Max Gubin are credited on the announcement post, and the model also shipped inside Google Vids with 16 new languages on top of the eight Vids already supported.

Google Gemini 3.1 Flash TTS announcement graphic from the official blog post The official announcement graphic from Google's Gemini 3.1 Flash TTS launch post on April 15, 2026. Source: blog.google

Key Specifications

Specification	Details
Provider	Google
Model Family	Gemini
Model ID	`gemini-3.1-flash-tts-preview`
Parameters	Not disclosed
Input Token Limit	8,192
Output Token Limit	16,384
Input Types	Text
Output Type	Audio (PCM 24kHz mono, 16-bit)
Prebuilt Voices	30 (named after astronomical objects)
Languages	70-plus
Audio Tags	200-plus
Multi-Speaker	Up to 2 speakers per request
Streaming	Not supported
SynthID Watermarking	Enabled on all outputs
Input Price	$1.00/M tokens (text)
Output Price	$20.00/M tokens (audio), 25 audio tokens per second
Batch Price	$0.50/M input, $10.00/M output (50% discount)
Release Date	April 15, 2026
Status	Public preview
License	Proprietary (API access only)
Knowledge Cutoff	January 2025

The 30 voices carry astronomical names (Zephyr, Puck, Charon, Kore, Fenrir, Leda, Orus, Aoede, and 22 others). Audio tokens convert at 25 tokens per second, putting standard-tier pricing at roughly $0.018 per minute and batch at $0.009 per minute.

Benchmark Performance

The main benchmark is the Artificial Analysis TTS Arena, a blind human preference test. At launch on April 15, Google reported Elo 1,211, which was the #2 spot on the board.

Model	TTS Arena Elo	Languages	Style Control	Streaming
Inworld TTS 1.5 Max	1,212	15	Natural-language prompts	Yes
Gemini 3.1 Flash TTS	1,211 (launch) / 1,206 (current)	70-plus	200+ audio tags	No
Eleven v3	1,179	30-plus	Style sliders, voice cloning	Yes
Speech 2.8 HD	1,165	Not published	Prompt-level style	Yes
Inworld TTS 1 Max	1,164	15	Natural-language prompts	Yes

Elo numbers on the TTS Arena move with each new vote, so the 1,211 launch figure and the current 1,206 reading are both accurate snapshots from the same leaderboard. Inworld has since edged ahead by a single point.

The Elo score undersells Flash TTS as a category outlier. Inworld's 15-language range sits well short of Google's 70-plus. ElevenLabs charges roughly $0.12 per 1,000 characters for Multilingual v3, four to six times Flash TTS at the per-minute level. And the Arena ranks general voice attractiveness, not task-specific performance: naturalness on medical terminology, stability on 30-minute documents, or behavior under adversarial Unicode isn't captured. Our AI Voice and Speech Leaderboard tracks the broader set.

Audio waveform illustration representing Gemini 3.1 Flash TTS output Flash TTS outputs PCM 24kHz mono 16-bit audio, the same format used by most downstream audio pipelines. Source: unsplash.com

Key Capabilities

The audio tags system is what sets Flash TTS apart from every other commercial TTS API. Tags use square brackets embedded inline with the text and cover four categories: emotional state ([determination], [frustration], [excitement], [nervousness]), non-verbal sounds ([laughs], [sighs], [gasp], [whispers]), pacing ([slow], [fast], [short pause], [long pause]), and tone ([sarcastic], [positive], [serious], [neutral]). Two rules apply: tags can't sit adjacent without intervening text or punctuation, and the tag vocabulary is English-only even when the spoken output is in another language.

The Voice Director framework shipped with the model documents the recommended prompting pattern: an Audio Profile for the speaker, a Scene for the environment, Director's Notes on tone and accent, and then the transcript with tags embedded. Simon Willison's launch-day writeup confirmed accent steering works - swapping a "Brixton, London" character description to "Newcastle" or "Exeter, Devon" produced measurably different regional British accents from the same script.

Multi-speaker output caps at two speakers per API call. You assign names and voice configurations, and the model handles turn-taking. Podcast content with three-plus voices requires multiple calls and your own stitching. SynthID watermarking is enabled on every output, though Google doesn't publish false-positive or false-negative rates, and robustness under adversarial re-encoding isn't documented.

Pricing and Availability

Standard paid tier: $1.00/M input tokens (text), $20.00/M output tokens (audio). With audio tokens at 25 per second, that's roughly $0.018 per minute including a typical input prompt. Batch mode halves both numbers to $0.50 and $10.00, landing around $0.009 per minute. A free tier exists on the developer API, subject to preview rate limits Google hasn't published numerically.

Competitor pricing for context:

OpenAI gpt-4o-mini-tts: $0.60/M input tokens, $12.00/M audio output, roughly $0.015 per minute.
ElevenLabs v3: $0.12 per 1,000 characters at the API tier, about $0.108 per minute for a 900-character cadence - six times Flash TTS.
Inworld TTS 1.5 Max: Roughly $0.01 per minute ($10 per million characters), the cheapest of the top-ranked options.

Access channels: the Gemini API endpoint at generativelanguage.googleapis.com/v1beta/models/gemini-3.1-flash-tts-preview:generateContent, the AI Studio audio playground, and Vertex AI for enterprise accounts. A Colab quickstart sits in the google-gemini/cookbook repository. Google Vids ships 30 conversational voices across 24 Workspace languages, a subset of the full 70-plus API list. Preview status means pricing, rate limits, and feature set can shift before GA.

Strengths and Weaknesses

Strengths

Unique inline audio tags system with 200-plus entries. No other commercial TTS API ships this pattern at this scope.
70-plus languages, the broadest language coverage among the top-ranked commercial TTS models.
30 prebuilt voices with explicit regional accent control via Voice Director framework.
Pricing close to OpenAI gpt-4o-mini-tts and far below ElevenLabs at the API tier.
Native two-speaker dialogue in a single API call.
Every output carries a SynthID watermark.
Benchmark-leading at launch (Elo 1,211 on Artificial Analysis TTS Arena), still #2 on current readings.

Weaknesses

No streaming support. For live voice applications use Gemini 3.1 Flash Live or OpenAI's streaming TTS instead.
Two-speaker cap on multi-speaker output. Podcast content with three-plus distinct voices needs multiple API calls and manual stitching.
No voice cloning. ElevenLabs and Mistral's Voxtral both support zero-shot or short-sample cloning; Flash TTS doesn't.
Audio tag vocabulary is English-only even when the spoken output is in another language - extra cognitive load for non-English workflows.
Still preview status. Pricing, rate limits, and feature set may change before GA.
SynthID robustness under adversarial post-processing isn't publicly documented.
Proprietary API only, no open weights or on-premises option.

News: Google Ships Gemini 3.1 Flash TTS With 200 Audio Tags - full launch report.
Sibling model: Gemini 3.1 Pro - the non-audio flagship in the same generation.
Voice rival: Mistral Voxtral - open-weights TTS alternative. Our Voxtral TTS review has hands-on notes.
Leaderboard: AI Voice and Speech Leaderboard - TTS and STT rankings across the market.
Tool comparison: Best AI Voice Generators 2026 covers how Flash TTS stacks against ElevenLabs, OpenAI TTS, and the rest.
Streaming counterpart: Gemini Flash Live edges GPT-4 Realtime - the real-time voice companion model.