OpenAI's Realtime API Goes GA with Three New Models

OpenAI's Realtime API exits beta with GPT-Realtime-2, Translate, and Whisper - three specialized voice models splitting reasoning, translation, and transcription into distinct endpoints.

OpenAI's Realtime API Goes GA with Three New Models

OpenAI's Realtime API exits beta on May 7, and it didn't arrive with a single upgraded model. Three specialized endpoints ship together: GPT-Realtime-2 for voice agents with GPT-5-class reasoning, GPT-Realtime-Translate for live speech-to-speech translation, and GPT-Realtime-Whisper for streaming transcription. The previous approach of routing all real-time audio through one endpoint is gone.

TL;DR

  • Realtime API leaves beta and goes generally available as of May 7, 2026
  • GPT-Realtime-2 scores 96.6% on Big Bench Audio vs 81.4% for GPT-Realtime-1.5 - a 15.2-point jump
  • Three session types now: voice-agent, translation, transcription - each routed to a purpose-built model
  • Early adopters report 26-43% gains; Zillow saw call success rates jump from 69% to 95%
ModelUse CasePricing
GPT-Realtime-2Voice agents$32/$64 per 1M tokens
GPT-Realtime-TranslateLive translation$0.034/min
GPT-Realtime-WhisperStreaming transcription$0.017/min

Three Models, Three Jobs

The split is deliberate. OpenAI's previous Realtime API tried to handle voice agents, translation, and transcription through a single endpoint. The GA release drops that model for three distinct session types - voice-agent, translation, or transcription - each routed to a model built for that specific task.

GPT-Realtime-2 - Reasoning in Real Time

GPT-Realtime-2 is the anchor of the release. It's the first OpenAI voice model with GPT-5-class reasoning, which means it handles multi-step tasks without losing the thread, calls tools in parallel, and narrates what it's doing as it does it. The context window jumps from 32K to 128K tokens, meaning longer conversations stay coherent where previous Realtime models would start dropping context.

Five reasoning levels let developers trade latency for intelligence. At "low" - the default - first audio arrives in 1.12 seconds. At "xhigh," that stretches to 2.33 seconds, but instruction-following improves sharply. OpenAI's own Audio MultiChallenge numbers put the xhigh variant at 48.5% versus 34.7% for GPT-Realtime-1.5. Scale AI's independent Audio MultiChallenge S2S leaderboard tracked instruction retention rising from 36.7% to 70.8%. Two new voices ship with the model - Cedar and Marin.

Pricing is $32 per million input tokens ($0.40 for cached inputs) and $64 per million output tokens. That's token-based, not minute-based - you pay for the compute consumed by reasoning, not just wall-clock conversation time. See the GPT-Realtime-2 model card for the full spec breakdown.

GPT-Realtime-Translate - Live Interpretation

This isn't a general-purpose assistant that also speaks other languages. GPT-Realtime-Translate is a dedicated pipeline: speech in, speech out, in real time, without text translation as an intermediate step. It handles 70+ input languages and outputs in 13 - designed for bilingual support queues, international events, and live dubbing pipelines.

At $0.034 per minute, it's competitive with enterprise translation APIs. BolnaAI, which builds voice AI for Indian language markets, reported 12.5% lower word error rates on Hindi, Tamil, and Telugu in early testing.

GPT-Realtime-Whisper - Streaming Transcription

GPT-Realtime-Whisper produces text from audio as the audio arrives. Standard transcription APIs wait for an utterance to finish before processing; this model does it steadily. Latency is configurable, with a tradeoff between responsiveness and accuracy depending on the application. At $0.017 per minute, it undercuts most third-party transcription services while adding the real-time streaming that batch processors don't offer.

A condenser microphone in a recording setup The three new Realtime models split the voice stack by task: reasoning agents, live translation, and streaming transcription each get a dedicated endpoint. Source: unsplash.com

Benchmark Performance

BenchmarkGPT-Realtime-2 (high)GPT-Realtime-1.5
Big Bench Audio96.6%81.4%
Audio MultiChallenge - xhigh (OpenAI)48.5%34.7%
Audio MultiChallenge - S2S (Scale AI)70.8%36.7%
Time to first audio (low reasoning)1.12s-
Time to first audio (xhigh reasoning)2.33s-

Big Bench Audio tests audio-native reasoning - the model receives audio input and responds in audio without converting to text in between. The 15.2-percentage-point improvement over GPT-Realtime-1.5 is the largest jump between successive Realtime models. The xhigh variant is where the gap widens most: longer latency but substantially better multi-turn coherence, which is where earlier Realtime models were weakest.

What Customers Are Seeing

Production data from early-access customers is unusually specific for an OpenAI launch. Zillow used GPT-Realtime-2 on adversarial benchmarks and measured call-success rates moving from 69% to 95%. Glean, which builds enterprise search, saw 42.9% relative helpfulness improvement. Genspark's Call for Me agent - which handles phone calls for users - reported 26% higher effective conversation rates with fewer dropped calls.

The consistency across different domains suggests the gains aren't application-specific. Tasks requiring multiple clarifying turns now often resolve in one.

"The first OpenAI speech-to-speech model good enough for real work in complex agents." - developer Kyle Windland, via Latent Space

A microphone positioned for recording in front of a monitor Zillow, Glean, and Genspark were among early adopters who tested GPT-Realtime-2 before the GA launch. Source: unsplash.com

What To Watch

Token pricing compounds fast at volume

GPT-Realtime-2's token-based model looks reasonable until you run the math on production workloads. A one-hour conversation at normal speaking pace creates roughly 300,000 tokens. For high-volume deployments - call centers, support pipelines, real-time meeting tools - the cost modeling requires careful analysis before committing. Per-minute pricing from specialized voice AI vendors may still win on pure economics at scale.

The reasoning tradeoff constrains architecture

Five reasoning levels give real control, but the 2.33-second first-audio delay at xhigh is noticeable in conversational contexts. Applications that need responses under 1.5 seconds will stay at low or medium, which means giving up some of the coherence improvements that make GPT-Realtime-2 worth the upgrade in the first place. Benchmark your latency budget against the reasoning levels before choosing an architecture - the right choice depends completely on your application's tolerance for pause.

Open source is closing the gap

Mistral's Voxtral showed earlier this year that capable open-source voice models are achievable. GPT-Realtime-2's real advantage is the GPT-5 reasoning layer and the tight platform integration, not the audio processing itself. For teams running their own infrastructure, that premium requires the reasoning capabilities to actually matter for their task. The WebRTC engineering work OpenAI published last week shows how much infrastructure sustains that advantage - major, but also reproducible.


Sources:

Sophie Zhang
About the author AI Infrastructure & Open Source Reporter

Sophie is a journalist and former systems engineer who covers AI infrastructure, open-source models, and the developer tooling ecosystem.