Thinking Machines Builds AI That Listens While Talking

The standard approach to voice AI is built on a flawed assumption: that a computer needs to hear everything before it can respond. Mira Murati's Thinking Machines Lab is betting that design choice is exactly why every existing voice assistant feels robotic - and on May 11, the startup published its answer.

TL;DR

TML-Interaction-Small is a 276B MoE model (12B active parameters) built for full-duplex conversation, processing audio, video, and text concurrently in 200ms micro-turns while producing speech at the same time
Response latency hits 0.40 seconds against GPT-4o Realtime-2's 1.18 seconds - nearly 3x faster on the same benchmark
Proactive capabilities (TimeSpeak, CueSpeak) score 64.7% and 81.7% respectively; commercial competitors score near zero because they don't monitor input between user turns
Research preview only with no pricing or public API; limited access expected within months, wider release later in 2026

System	Turn Latency	FD-bench V1.5 Quality	TimeSpeak
TML-Interaction-Small	0.40s	77.8	64.7%
GPT-4o Realtime-2	1.18s	~47.8-54.3	4.3%
Gemini Live	not disclosed	~47.8-54.3	~0%

The numbers look strong. Whether they hold outside the company's own evaluation framework is a different question.

The Architecture Behind Full-Duplex

The conventional voice AI pipeline - speech detection, transcription, model inference, text-to-speech synthesis - is a production harness, not an integrated system. Thinking Machines calls this the "harness problem": each component adds latency, and the seams between them produce the stilted, turn-by-turn feel of every existing voice assistant.

TML-Interaction-Small takes a different route. Rather than stacking pretrained components, the model is trained end-to-end from scratch as a single system.

Encoder-Free Input Fusion

Audio arrives as dMel representations through a lightweight embedding layer. Video splits into 40x40 image patches processed via a small MLP (called hMLP). Both streams merge with text tokens and flow into a 276-billion-parameter transformer directly - no separate audio encoder, no vision model bolted on.

This matters because encoder-based approaches lock in pretraining assumptions. Training all modalities together forces the model to integrate audio, video, and text at the representation level rather than at the interface.

200ms Micro-Turns

The core mechanism is a processing loop running in 200ms chunks. TML-Interaction-Small takes in 200ms of combined audio and video input, creates 200ms of audio output, then starts on the next chunk right away - without waiting for a turn to finish. The model can be mid-sentence when a user starts speaking, and it decides on its own whether to yield, pause, or keep going.

For inference, the team built persistent streaming sessions that keep audio context in GPU memory across chunks rather than allocating fresh memory per request. They upstreamed those kernel optimizations to SGLang, the open-source inference framework.

Studio microphone on a desk, representing the voice-first architecture of TML-Interaction-Small Full-duplex AI requires rethinking the voice pipeline from the ground up, not just cutting inference latency. Source: unsplash.com

Two Models, One Conversation

TML-Interaction-Small isn't a single monolith doing everything at once. The system splits work between two models that share full conversation context throughout.

The Interaction Model

This half stays live during a session. It handles immediate audio and video responses, maintains conversational presence, and manages micro-turn flow. When a user speaks over the model mid-sentence, the interaction model handles that collision directly - it doesn't route to a higher-level controller to decide what to do.

The Background Model

The background model handles asynchronous work: web search, tool calls, extended reasoning. Results stream back into the interaction model's context rather than triggering a hard context switch. From a user's perspective, the conversation continues while the model is mid-lookup.

The split resembles how a skilled phone interviewer handles a live conversation while also tracking notes and queuing follow-up questions - two parallel threads with shared context.

Proactive Capabilities - Where Competitors Score Near Zero

The most remarkable claims aren't about latency. They're about capabilities that existing real-time voice systems don't have at all.

TimeSpeak tests whether a model can launch speech based on timing instructions - "remind me to breathe in and out every four seconds," for instance. TML-Interaction-Small scores 64.7% on this benchmark. GPT-4o Realtime-2 scores 4.3%. That gap isn't marginal; it's a different capability class completely.

CueSpeak tests visual proactivity: whether the model notices a visual change and decides on its own to comment on it. TML-Interaction-Small scores 81.7%. Other commercial real-time APIs score near zero on this because they don't monitor video between user turns.

Audio MultiChallenge, a broader audio intelligence test, puts the model at 43.4% APR - competitive with non-thinking models on audio reasoning tasks. FD-bench V3 overall response quality: 82.8%.

Thinking Machines' committed gigawatt of Nvidia Vera Rubin compute and the Google Cloud GB300 partnership suggest the company placed its infrastructure bets well before this public preview. Running 276B MoE models at interactive latency isn't cheap.

Thinking Machines Lab's interaction models blog post, published May 11, 2026 The Thinking Machines Lab research blog published the full technical architecture on May 11, 2026. Source: thinkingmachines.ai

What It Does Not Tell You

FD-bench Is a Thinking Machines Benchmark

The primary evaluation framework here - FD-bench in all three versions (V1, V1.5, V3) - was developed by Thinking Machines. The scores haven't been independently replicated and the evaluation code hasn't been published. That doesn't mean the numbers are wrong, but the competitive comparison table shows TML-Interaction-Small evaluated on a benchmark its own team designed. External researchers haven't confirmed these figures yet.

No Access, No Pricing, No API

The research preview comes with no public API, no pricing, and no enterprise SLA details. OpenAI's Realtime API went GA with three models and established pricing last week. TML-Interaction-Small can't be integrated or budgeted for by anyone outside the invited preview group right now.

Long Sessions Are an Acknowledged Problem

Thinking Machines' own blog post notes that very long sessions create context management challenges. The 200ms micro-turn loop builds up context faster than turn-based systems, which means the GPU memory grows faster too. For short conversations this is invisible. For hour-long sessions or complex multi-tool workflows, context management is still an open problem the team says it's working on.

What Larger Models Will Actually Do

TML-Interaction-Small runs 12B active parameters. The blog post says the company is constrained to this scale by current infrastructure, not by architecture. Larger versions are planned for later in 2026. Whether proactive and voice capabilities scale well with model size - or whether they plateau at some threshold - there's no data on yet.

The core architectural argument - that interactivity should be native to the model rather than stitched together with a harness - is well-reasoned and the latency benchmarks support it. The proactive scores, if they hold up under independent review, represent a genuinely different capability class. What Thinking Machines hasn't done yet is let anyone outside the company test any of this. Both the evaluation and the access are controlled by the same team. That's a standard position for a research preview, but it means the distance between the benchmark and the actual product remains exactly as wide as Thinking Machines says it is.

Sources: