Gemini 3.1 Flash TTS - Google Voice With 200 Audio Tags
Google's Gemini 3.1 Flash TTS ships in preview with 30 voices, 70-plus languages, 200-plus inline audio tags, and Elo 1,211 on the Artificial Analysis TTS Arena.

Overview
Gemini 3.1 Flash TTS is Google's text-to-speech model, launched in public preview on April 15, 2026 through the Gemini API, Google AI Studio, Vertex AI, and Google Vids for Workspace. The model ID is gemini-3.1-flash-tts-preview. Its headline pitch isn't raw voice naturalness. It's inline control: 200-plus audio tags that let you direct emotion, pacing, and non-verbal sounds mid-sentence using square-bracket syntax embedded in the input text.
TL;DR
- Best for multi-language voice content where directors need mid-sentence control over tone, pace, and non-verbal sounds.
- 30 voices, 70-plus languages, 200-plus audio tags, PCM 24kHz mono 16-bit output, 2-speaker cap, SynthID watermarked, no streaming.
- Elo 1,211 at launch on the Artificial Analysis TTS Arena - #2 behind Inworld TTS 1.5 Max, ahead of ElevenLabs v3.
Flash TTS is a batch synthesis system, not a streaming one. For audiobooks, video voice-overs, and IVR scripts where you can afford a full-response wait, this is the product. For live phone agents you want Gemini 3.1 Flash Live, the real-time sibling model that runs interactive conversation at roughly 960ms latency. Both shipped in the same generation. Google Senior Product Manager Vilobh Meshram and Principal Research Engineer Max Gubin are credited on the announcement post, and the model also shipped inside Google Vids with 16 new languages on top of the eight Vids already supported.
The official announcement graphic from Google's Gemini 3.1 Flash TTS launch post on April 15, 2026.
Source: blog.google
Key Specifications
| Specification | Details |
|---|---|
| Provider | |
| Model Family | Gemini |
| Model ID | gemini-3.1-flash-tts-preview |
| Parameters | Not disclosed |
| Input Token Limit | 8,192 |
| Output Token Limit | 16,384 |
| Input Types | Text |
| Output Type | Audio (PCM 24kHz mono, 16-bit) |
| Prebuilt Voices | 30 (named after astronomical objects) |
| Languages | 70-plus |
| Audio Tags | 200-plus |
| Multi-Speaker | Up to 2 speakers per request |
| Streaming | Not supported |
| SynthID Watermarking | Enabled on all outputs |
| Input Price | $1.00/M tokens (text) |
| Output Price | $20.00/M tokens (audio), 25 audio tokens per second |
| Batch Price | $0.50/M input, $10.00/M output (50% discount) |
| Release Date | April 15, 2026 |
| Status | Public preview |
| License | Proprietary (API access only) |
| Knowledge Cutoff | January 2025 |
The 30 voices carry astronomical names (Zephyr, Puck, Charon, Kore, Fenrir, Leda, Orus, Aoede, and 22 others). Audio tokens convert at 25 tokens per second, putting standard-tier pricing at roughly $0.018 per minute and batch at $0.009 per minute.
Benchmark Performance
The main benchmark is the Artificial Analysis TTS Arena, a blind human preference test. At launch on April 15, Google reported Elo 1,211, which was the #2 spot on the board.
| Model | TTS Arena Elo | Languages | Style Control | Streaming |
|---|---|---|---|---|
| Inworld TTS 1.5 Max | 1,212 | 15 | Natural-language prompts | Yes |
| Gemini 3.1 Flash TTS | 1,211 (launch) / 1,206 (current) | 70-plus | 200+ audio tags | No |
| Eleven v3 | 1,179 | 30-plus | Style sliders, voice cloning | Yes |
| Speech 2.8 HD | 1,165 | Not published | Prompt-level style | Yes |
| Inworld TTS 1 Max | 1,164 | 15 | Natural-language prompts | Yes |
Elo numbers on the TTS Arena move with each new vote, so the 1,211 launch figure and the current 1,206 reading are both accurate snapshots from the same leaderboard. Inworld has since edged ahead by a single point.
The Elo score undersells Flash TTS as a category outlier. Inworld's 15-language range sits well short of Google's 70-plus. ElevenLabs charges roughly $0.12 per 1,000 characters for Multilingual v3, four to six times Flash TTS at the per-minute level. And the Arena ranks general voice attractiveness, not task-specific performance: naturalness on medical terminology, stability on 30-minute documents, or behavior under adversarial Unicode isn't captured. Our AI Voice and Speech Leaderboard tracks the broader set.
Flash TTS outputs PCM 24kHz mono 16-bit audio, the same format used by most downstream audio pipelines.
Source: unsplash.com
Key Capabilities
The audio tags system is what sets Flash TTS apart from every other commercial TTS API. Tags use square brackets embedded inline with the text and cover four categories: emotional state ([determination], [frustration], [excitement], [nervousness]), non-verbal sounds ([laughs], [sighs], [gasp], [whispers]), pacing ([slow], [fast], [short pause], [long pause]), and tone ([sarcastic], [positive], [serious], [neutral]). Two rules apply: tags can't sit adjacent without intervening text or punctuation, and the tag vocabulary is English-only even when the spoken output is in another language.
The Voice Director framework shipped with the model documents the recommended prompting pattern: an Audio Profile for the speaker, a Scene for the environment, Director's Notes on tone and accent, and then the transcript with tags embedded. Simon Willison's launch-day writeup confirmed accent steering works - swapping a "Brixton, London" character description to "Newcastle" or "Exeter, Devon" produced measurably different regional British accents from the same script.
Multi-speaker output caps at two speakers per API call. You assign names and voice configurations, and the model handles turn-taking. Podcast content with three-plus voices requires multiple calls and your own stitching. SynthID watermarking is enabled on every output, though Google doesn't publish false-positive or false-negative rates, and robustness under adversarial re-encoding isn't documented.
Pricing and Availability
Standard paid tier: $1.00/M input tokens (text), $20.00/M output tokens (audio). With audio tokens at 25 per second, that's roughly $0.018 per minute including a typical input prompt. Batch mode halves both numbers to $0.50 and $10.00, landing around $0.009 per minute. A free tier exists on the developer API, subject to preview rate limits Google hasn't published numerically.
Competitor pricing for context:
- OpenAI gpt-4o-mini-tts: $0.60/M input tokens, $12.00/M audio output, roughly $0.015 per minute.
- ElevenLabs v3: $0.12 per 1,000 characters at the API tier, about $0.108 per minute for a 900-character cadence - six times Flash TTS.
- Inworld TTS 1.5 Max: Roughly $0.01 per minute ($10 per million characters), the cheapest of the top-ranked options.
Access channels: the Gemini API endpoint at generativelanguage.googleapis.com/v1beta/models/gemini-3.1-flash-tts-preview:generateContent, the AI Studio audio playground, and Vertex AI for enterprise accounts. A Colab quickstart sits in the google-gemini/cookbook repository. Google Vids ships 30 conversational voices across 24 Workspace languages, a subset of the full 70-plus API list. Preview status means pricing, rate limits, and feature set can shift before GA.
Strengths and Weaknesses
Strengths
- Unique inline audio tags system with 200-plus entries. No other commercial TTS API ships this pattern at this scope.
- 70-plus languages, the broadest language coverage among the top-ranked commercial TTS models.
- 30 prebuilt voices with explicit regional accent control via Voice Director framework.
- Pricing close to OpenAI gpt-4o-mini-tts and far below ElevenLabs at the API tier.
- Native two-speaker dialogue in a single API call.
- Every output carries a SynthID watermark.
- Benchmark-leading at launch (Elo 1,211 on Artificial Analysis TTS Arena), still #2 on current readings.
Weaknesses
- No streaming support. For live voice applications use Gemini 3.1 Flash Live or OpenAI's streaming TTS instead.
- Two-speaker cap on multi-speaker output. Podcast content with three-plus distinct voices needs multiple API calls and manual stitching.
- No voice cloning. ElevenLabs and Mistral's Voxtral both support zero-shot or short-sample cloning; Flash TTS doesn't.
- Audio tag vocabulary is English-only even when the spoken output is in another language - extra cognitive load for non-English workflows.
- Still preview status. Pricing, rate limits, and feature set may change before GA.
- SynthID robustness under adversarial post-processing isn't publicly documented.
- Proprietary API only, no open weights or on-premises option.
Related Coverage
- News: Google Ships Gemini 3.1 Flash TTS With 200 Audio Tags - full launch report.
- Sibling model: Gemini 3.1 Pro - the non-audio flagship in the same generation.
- Voice rival: Mistral Voxtral - open-weights TTS alternative. Our Voxtral TTS review has hands-on notes.
- Leaderboard: AI Voice and Speech Leaderboard - TTS and STT rankings across the market.
- Tool comparison: Best AI Voice Generators 2026 covers how Flash TTS stacks against ElevenLabs, OpenAI TTS, and the rest.
- Streaming counterpart: Gemini Flash Live edges GPT-4 Realtime - the real-time voice companion model.
Sources
- Google Blog: Gemini 3.1 Flash TTS announcement
- Google Cloud Blog: Gemini 3.1 Flash TTS on Vertex AI
- Gemini API: Gemini 3.1 Flash TTS model card
- Gemini API: Speech generation guide
- Gemini API: Pricing
- Artificial Analysis: TTS Arena leaderboard
- Google Workspace Updates: Google Vids integration
- Simon Willison: Gemini 3.1 Flash TTS hands-on
- MarkTechPost: Gemini 3.1 Flash TTS launch coverage
- Google-Gemini cookbook repository
✓ Last verified April 21, 2026
