GPT-Realtime-2

OpenAI's second-generation real-time audio model with GPT-5-class reasoning, 128K context, five reasoning levels, and parallel tool calling - now generally available in the Realtime API.

GPT-Realtime-2

OpenAI's Realtime API exited beta on May 7-8, 2026, and GPT-Realtime-2 is the model at the center of that release. The launch also introduced two companion models - GPT-Realtime-Translate and GPT-Realtime-Whisper - giving developers three distinct session types for voice-agent, translation, and transcription workloads.

TL;DR

  • GPT-5-class reasoning in a real-time audio model with 128K context (up from 32K in the prior version)
  • Scored 96.6% on Big Bench Audio vs. 81.4% for GPT-Realtime-1.5 - a 15.2 point gain; priced at $32/$64 per million tokens in/out
  • Significantly stronger than GPT-Realtime-1.5 on audio reasoning; no direct competitor ships a publicly comparable model at this latency tier

Overview

GPT-Realtime-2 is the successor to GPT-Realtime-1.5, designed for applications that need low-latency voice interaction - customer service bots, voice agents, live translation, and real-time transcription. It runs inside OpenAI's Realtime API, which the company rebuilt at scale earlier in 2026 to handle 900 million weekly active users on a WebRTC/Kubernetes architecture.

The model ships with five adjustable reasoning levels (minimal, low, medium, high, xhigh), defaulting to low for latency-sensitive deployments. That flexibility lets developers dial in the trade-off between response quality and first-audio latency: the low setting delivers a first audio token in roughly 1.12 seconds, while xhigh pushes that to 2.33 seconds in exchange for stronger reasoning. The 128K context window - four times larger than GPT-Realtime-1.5's 32K - opens up document-grounded voice workflows that weren't practical before.

Two new voices, Cedar and Marin, ship alongside the model. Parallel tool calling with audible narration lets the model announce what it's doing mid-turn ("let me check that"), which is materially useful for production voice agents where silence during a tool lookup sounds like a dropped call.

Key Specifications

SpecificationDetails
ProviderOpenAI
Model FamilyGPT-Realtime
ParametersNot disclosed
Context Window128K tokens
Input Price$32.00/M tokens ($0.40/M cached)
Output Price$64.00/M tokens
Release Date2026-05-07
LicenseProprietary (API access only)
API AccessOpenAI Realtime API (GA)
Session TypesVoice-agent, translation, transcription
New VoicesCedar, Marin
First-Audio Latency1.12s (low) to 2.33s (xhigh)

Benchmark Performance

OpenAI published two benchmark comparisons against GPT-Realtime-1.5 at launch. Both use audio-native evaluations rather than text-based proxies, which makes them more representative of how the model actually performs in voice workflows.

BenchmarkGPT-Realtime-2GPT-Realtime-1.5Delta
Big Bench Audio (high reasoning)96.6%81.4%+15.2 pp
Audio MultiChallenge (xhigh reasoning)48.5%34.7%+13.8 pp

The Big Bench Audio result is strong - 96.6% is near ceiling on that benchmark. The Audio MultiChallenge score is harder to contextualize without a broader competitive field: 48.5% at xhigh reasoning means the model still misses more than half the hardest audio reasoning tasks. That said, the 13.8 percentage point gain over the prior version suggests the architecture improvements aren't marginal. For context on where these benchmarks sit relative to the broader field, see our audio understanding benchmarks leaderboard.

No third-party audio reasoning evaluations had published results as of launch day. The numbers above come directly from OpenAI's release materials.

Key Capabilities

The reasoning level system is the most developer-relevant addition. Setting reasoning to minimal or low keeps latency under 1.5 seconds for applications where speed matters more than accuracy - think simple FAQ bots or appointment schedulers. Cranking it to high or xhigh is appropriate for technical support, medical triage, or any domain where the model needs to reason across a long conversation before responding. This isn't a new idea (OpenAI has used similar budgeting in its text models), but it's the first time it's shipped in a production real-time audio context.

Parallel tool calling with narration addresses a real pain point in voice agent design. In prior implementations, a tool call meant silence: the model would pause, make the API call, and resume. GPT-Realtime-2 can narrate mid-turn ("I'm pulling up your account now"), which keeps the conversation feeling alive during processing. Combined with preamble support - phrases the model inserts before launching a longer response - this makes voice agents built on the model sound less robotic. For a practical walkthrough of building on top of models like this, our AI voice agent setup guide covers the architecture end-to-end.

The 128K context window opens up use cases that weren't viable with 32K. A voice agent can now hold a full customer interaction history in context, reference a loaded knowledge base, or maintain state across a long troubleshooting session without needing to summarize and compress. That matters most in enterprise deployments where conversations can run 30-60 minutes.

Companion Models

GPT-Realtime-2 ships with two specialized models on the same Realtime API infrastructure:

GPT-Realtime-Translate targets live interpretation: 70+ input languages, 13 output languages, priced at $0.034 per minute. The pricing is designed for session-length billing rather than token counting, which makes it straightforward to estimate costs for real-time translation features. It's a dedicated pipe - not conversational - so it's purpose-built for the translation task rather than a general model with translation bolted on.

GPT-Realtime-Whisper is a streaming speech-to-text model at $0.017 per minute. Latency is controllable, which is the practical differentiator over batch transcription APIs. For workloads that need both transcription and understanding, the session types let developers route traffic cleanly without running multiple API calls.

Pricing and Availability

GPT-Realtime-2 uses token-based billing: $32/M input tokens, $64/M output tokens, with cached input at $0.40/M. Audio tokens are priced higher than text because they encode more information per token, but the cache discount is substantial - applications with repeated context (system prompts, knowledge bases) can recover a lot of that cost.

Compared to GPT-Realtime-1.5, the pricing structure is similar. The main change is the feature set per dollar: 4x context, stronger reasoning, and better tooling at roughly equivalent token rates.

GPT-Realtime-Translate at $0.034/min and GPT-Realtime-Whisper at $0.017/min both use minute-based billing. A 5-minute translation session costs $0.17; a 5-minute transcription session costs $0.085. Those numbers are low enough that cost isn't the primary decision factor for most applications.

All three models are generally available through the OpenAI Realtime API - no beta waitlist. See our AI voice and speech leaderboard for how these models compare on quality and cost metrics against other voice AI providers.


Strengths

  • GPT-5-class reasoning in a real-time audio context, with five adjustable levels
  • 128K context - 4x GPT-Realtime-1.5 - practical for long-session voice agents
  • Parallel tool calling with live narration reduces dead-air during tool lookups
  • Big Bench Audio score of 96.6% at high reasoning is near ceiling
  • Generally available right away - no beta queue

Weaknesses

  • Audio MultiChallenge score of 48.5% means it still fails most hard audio reasoning tasks
  • Pricing at $32/$64 per million tokens is expensive relative to text APIs - production scale requires careful cost modeling
  • No open-weight alternative - entirely proprietary
  • Latency at xhigh reasoning (2.33s) may be too slow for latency-sensitive voice applications
  • Limited to 13 output languages for translation, despite 70+ input language support


Sources

✓ Last verified May 8, 2026

James Kowalski
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.