Gemini 3.5 Live Translate Rolls Out With 70+ Languages

Google shipped Gemini 3.5 Live Translate on June 9, 2026 - a streaming audio model that translates spoken conversation in real time across 70 languages. It's live now in Google Translate on Android and iOS, available to developers via the Gemini Live API, and entering private enterprise preview in Google Meet. The upgrade replaces a system that had been limited to five languages in Meet and required English as an intermediate pivot.

Key Specs

Spec	Value
Model ID	`gemini-3.5-live-translate-preview`
Base model	Gemini 3 Pro architecture
Languages	70+ (2,000+ combinations in one session)
Context - input	128K tokens
Context - output	64K tokens
Input audio format	16-bit PCM, 16kHz mono, 100ms chunks
Output audio format	16-bit PCM, 24kHz mono
API pricing (paid)	$3.50 / M input tokens, $21.00 / M audio output tokens
Effective per-minute cost	~$0.037/min (25 tokens/sec billing)
Safety	SynthID watermark in all output audio

How the Model Works

From Turn-Based to Streaming

Most translation systems wait for the speaker to finish a sentence, run inference on the complete utterance, then output the result. That approach produces better accuracy but inserts a noticeable pause - enough to break conversational flow.

Gemini 3.5 Live Translate doesn't wait. It processes the audio stream continuously, trading some initial context for latency. The output runs a few seconds behind the speaker, close enough that two people can hold a real back-and-forth without losing the thread of a conversation. The model handles the tradeoff internally, using context from what's already been said to improve quality on words it hasn't fully committed to yet.

The API takes audio in 100ms chunks. Output comes back at 24kHz - slightly higher quality than the 16kHz input, since the synthesized translation can be rendered cleanly without the noise floor of a live microphone feed.

Tone and Identity Preservation

The model aims to preserve the speaker's intonation, pacing, and pitch in the translated output - not just transfer meaning, but carry the feel of how something was said. Google describes this as a "natural-sounding" output, though the model card is candid that voice inconsistency and occasional gender shifts can appear during long pauses in speech.

SynthID watermarking embeds an inaudible signal into every audio output. The watermark is there to flag AI-produced audio for downstream detection systems, not for users - conversations played back through a SynthID-aware detector will show the content as machine-generated.

Deployment Channels

Consumer: Google Translate

The Google Translate app on Android and iOS has a new "Live translate" button in the lower left of the interface. Tap it, put on headphones, and the app starts translating what it hears through the earpiece. Android adds a "listening mode" that routes translated audio directly to whatever earpiece is connected, without requiring the phone to be held up as a physical translator.

Enterprise: Google Meet

Google Meet's translation feature was previously restricted to five language pairs, all of which ran through English as a pivot. Gemini 3.5 Live Translate removes both constraints - 70 languages, 2,000+ pairings in a single session, and no forced English intermediary. A host on an eligible Workspace plan enables it once with a new button in the Meet control row; it applies to all participants automatically.

A person uses wireless earbuds during a conversation Real-time translation now routes to earpieces on Android, keeping the phone in a pocket rather than between two people. Source: unsplash.com

Enterprise rollout is in private preview this month for select Workspace customers on Business Standard and Plus, Enterprise Standard and Plus, and Frontline Plus. Google is targeting a broader rollout later in 2026.

Developer: Gemini Live API

Developers access the model using the ID gemini-3.5-live-translate-preview through the Gemini Live API or Google AI Studio. It's in public preview. The API accepts the audio stream and handles language detection automatically - no need to specify source language before starting. Input comes in as 16-bit PCM at 16kHz; the API returns 24kHz PCM, ready to push to an audio output or record.

Billing runs at 25 tokens per second of audio, which works out to roughly $0.037 per minute of translated conversation at standard paid tier rates. Google AI Studio testing is free.

What Builders Are Connecting

The release comes with integrations from five real-time media infrastructure companies: Agora, Fishjam, LiveKit, Pipecat, and Vision Agents. These platforms handle the routing of live audio streams in applications like video calls, customer support tools, and ambient translation. Adding Gemini 3.5 Live Translate to an existing LiveKit or Agora setup means connecting it where the audio already flows.

A video conference with multiple participants on screen Gemini 3.5 Live Translate enters enterprise preview in Google Meet, expanding from 5 language pairs to 70+ languages. Source: unsplash.com

Grab is testing the model in production. The Southeast Asian ride-hailing company runs over 10 million voice calls monthly between drivers and passengers, many of them crossing language boundaries. The company is using the API for driver-passenger communication - a use case where latency matters and the consequences of a misunderstanding are concrete.

How It Compares to Prior Google Translation

Capability	Google Meet (before June 2026)	Gemini 3.5 Live Translate
Languages in Meet	5	70+
Language combinations	~10 pairs, English pivot only	2,000+ per session
Delivery mode	Turn-based (full utterance)	Streaming, seconds behind speaker
Developer API	Not available	`gemini-3.5-live-translate-preview`
Voice preservation	Not applied	Intonation, pacing, pitch carried
Audio watermarking	None	SynthID on all output
Consumer rollout	Not available	Android + iOS (available now)

The previous Gemini 3.1 Flash TTS covered text-to-speech generation; this model sits further up the stack, handling full speech-to-speech without requiring a separate ASR step or intermediate text representation.

What To Watch

The model card is specific about where it fails. Voice inconsistency - including occasional gender shifts - shows up during long pauses, when the model loses its grip on a speaker's vocal profile. Language detection is weaker for non-native accents and rapid switching between languages; users who code-switch mid-sentence may see accuracy drops.

Background noise handling is described as "incomplete." The model filters noise, but mixing the translated output with the target-language source audio can introduce echo artifacts in noisy environments. Multi-speaker sessions carry a "voice entanglement" risk - translated voices bleeding into each other when two people talk at once.

AutoMQM - an error-based metric that classifies translation mistakes by type and severity - is the primary quality evaluation benchmark Google uses internally. Word-level latency measures the time between the end of a source word and the start of its corresponding translated word. Neither metric has published external numbers in the launch materials; the evaluation is methodology disclosure, not performance disclosure.

For developers building voice-first applications, the practical question is whether their use case is single-speaker or multi-speaker, and whether the ambient noise floor in their deployment environment is predictable. Grab's taxi scenario - driver and passenger, controlled noise, defined roles - is closer to the model's design target than a noisy group call with unknown speakers.

Sources:

Fluid, natural voice translation with Gemini 3.5 Live Translate - Google Blog
Gemini 3.5 Audio (Live Translate) - Model Card - Google DeepMind
Gemini 3.5 Live Translate rolling out to Google Meet and Translate - 9to5Google
Gemini API Pricing - Google AI for Developers
Google rolls out Gemini 3.5 Live Translate with real-time speech translation - FoneArena