Gemini 3.5 Live Translate Rolls Out With 70+ Languages
Google's new streaming audio model translates speech in real time across 70+ languages - available now in Google Translate and via the Gemini Live API.

Google shipped Gemini 3.5 Live Translate on June 9, 2026 - a streaming audio model that translates spoken conversation in real time across 70 languages. It's live now in Google Translate on Android and iOS, available to developers via the Gemini Live API, and entering private enterprise preview in Google Meet. The upgrade replaces a system that had been limited to five languages in Meet and required English as an intermediate pivot.
Key Specs
| Spec | Value |
|---|---|
| Model ID | gemini-3.5-live-translate-preview |
| Base model | Gemini 3 Pro architecture |
| Languages | 70+ (2,000+ combinations in one session) |
| Context - input | 128K tokens |
| Context - output | 64K tokens |
| Input audio format | 16-bit PCM, 16kHz mono, 100ms chunks |
| Output audio format | 16-bit PCM, 24kHz mono |
| API pricing (paid) | $3.50 / M input tokens, $21.00 / M audio output tokens |
| Effective per-minute cost | ~$0.037/min (25 tokens/sec billing) |
| Safety | SynthID watermark in all output audio |
How the Model Works
From Turn-Based to Streaming
Most translation systems wait for the speaker to finish a sentence, run inference on the complete utterance, then output the result. That approach produces better accuracy but inserts a noticeable pause - enough to break conversational flow.
Gemini 3.5 Live Translate doesn't wait. It processes the audio stream continuously, trading some initial context for latency. The output runs a few seconds behind the speaker, close enough that two people can hold a real back-and-forth without losing the thread of a conversation. The model handles the tradeoff internally, using context from what's already been said to improve quality on words it hasn't fully committed to yet.
The API takes audio in 100ms chunks. Output comes back at 24kHz - slightly higher quality than the 16kHz input, since the synthesized translation can be rendered cleanly without the noise floor of a live microphone feed.
Tone and Identity Preservation
The model aims to preserve the speaker's intonation, pacing, and pitch in the translated output - not just transfer meaning, but carry the feel of how something was said. Google describes this as a "natural-sounding" output, though the model card is candid that voice inconsistency and occasional gender shifts can appear during long pauses in speech.
SynthID watermarking embeds an inaudible signal into every audio output. The watermark is there to flag AI-produced audio for downstream detection systems, not for users - conversations played back through a SynthID-aware detector will show the content as machine-generated.
Deployment Channels
Consumer: Google Translate
The Google Translate app on Android and iOS has a new "Live translate" button in the lower left of the interface. Tap it, put on headphones, and the app starts translating what it hears through the earpiece. Android adds a "listening mode" that routes translated audio directly to whatever earpiece is connected, without requiring the phone to be held up as a physical translator.
Enterprise: Google Meet
Google Meet's translation feature was previously restricted to five language pairs, all of which ran through English as a pivot. Gemini 3.5 Live Translate removes both constraints - 70 languages, 2,000+ pairings in a single session, and no forced English intermediary. A host on an eligible Workspace plan enables it once with a new button in the Meet control row; it applies to all participants automatically.
Real-time translation now routes to earpieces on Android, keeping the phone in a pocket rather than between two people.
Source: unsplash.com
Enterprise rollout is in private preview this month for select Workspace customers on Business Standard and Plus, Enterprise Standard and Plus, and Frontline Plus. Google is targeting a broader rollout later in 2026.
Developer: Gemini Live API
Developers access the model using the ID gemini-3.5-live-translate-preview through the Gemini Live API or Google AI Studio. It's in public preview. The API accepts the audio stream and handles language detection automatically - no need to specify source language before starting. Input comes in as 16-bit PCM at 16kHz; the API returns 24kHz PCM, ready to push to an audio output or record.
Billing runs at 25 tokens per second of audio, which works out to roughly $0.037 per minute of translated conversation at standard paid tier rates. Google AI Studio testing is free.
What Builders Are Connecting
The release comes with integrations from five real-time media infrastructure companies: Agora, Fishjam, LiveKit, Pipecat, and Vision Agents. These platforms handle the routing of live audio streams in applications like video calls, customer support tools, and ambient translation. Adding Gemini 3.5 Live Translate to an existing LiveKit or Agora setup means connecting it where the audio already flows.
Gemini 3.5 Live Translate enters enterprise preview in Google Meet, expanding from 5 language pairs to 70+ languages.
Source: unsplash.com
Grab is testing the model in production. The Southeast Asian ride-hailing company runs over 10 million voice calls monthly between drivers and passengers, many of them crossing language boundaries. The company is using the API for driver-passenger communication - a use case where latency matters and the consequences of a misunderstanding are concrete.
How It Compares to Prior Google Translation
| Capability | Google Meet (before June 2026) | Gemini 3.5 Live Translate |
|---|---|---|
| Languages in Meet | 5 | 70+ |
| Language combinations | ~10 pairs, English pivot only | 2,000+ per session |
| Delivery mode | Turn-based (full utterance) | Streaming, seconds behind speaker |
| Developer API | Not available | gemini-3.5-live-translate-preview |
| Voice preservation | Not applied | Intonation, pacing, pitch carried |
| Audio watermarking | None | SynthID on all output |
| Consumer rollout | Not available | Android + iOS (available now) |
The previous Gemini 3.1 Flash TTS covered text-to-speech generation; this model sits further up the stack, handling full speech-to-speech without requiring a separate ASR step or intermediate text representation.
What To Watch
The model card is specific about where it fails. Voice inconsistency - including occasional gender shifts - shows up during long pauses, when the model loses its grip on a speaker's vocal profile. Language detection is weaker for non-native accents and rapid switching between languages; users who code-switch mid-sentence may see accuracy drops.
Background noise handling is described as "incomplete." The model filters noise, but mixing the translated output with the target-language source audio can introduce echo artifacts in noisy environments. Multi-speaker sessions carry a "voice entanglement" risk - translated voices bleeding into each other when two people talk at once.
AutoMQM - an error-based metric that classifies translation mistakes by type and severity - is the primary quality evaluation benchmark Google uses internally. Word-level latency measures the time between the end of a source word and the start of its corresponding translated word. Neither metric has published external numbers in the launch materials; the evaluation is methodology disclosure, not performance disclosure.
For developers building voice-first applications, the practical question is whether their use case is single-speaker or multi-speaker, and whether the ambient noise floor in their deployment environment is predictable. Grab's taxi scenario - driver and passenger, controlled noise, defined roles - is closer to the model's design target than a noisy group call with unknown speakers.
Sources:
- Fluid, natural voice translation with Gemini 3.5 Live Translate - Google Blog
- Gemini 3.5 Audio (Live Translate) - Model Card - Google DeepMind
- Gemini 3.5 Live Translate rolling out to Google Meet and Translate - 9to5Google
- Gemini API Pricing - Google AI for Developers
- Google rolls out Gemini 3.5 Live Translate with real-time speech translation - FoneArena
