Speech Turing Tests, Smart Routing, Pseudocode Agents
New research reveals no speech AI passes a Turing test, adaptive routing slashes LLM costs 82%, and pseudocode planning transforms agent reliability.

Three papers caught my attention in today's arXiv batch. One runs the first proper Turing test on voice AI and finds every system failing. Another borrows from neuroscience to route queries between fast and slow reasoning paths - cutting costs 82%. And a third replaces the reactive decision loops that plague AI agents with upfront pseudocode plans.
TL;DR
- Speech-to-Speech Turing Test - Researchers ran 2,968 human judgments across nine voice AI systems. None passed. The bottleneck is not understanding - it's prosody, emotion, and the absence of human speech quirks like "um" and "probably."
- ODAR-Expert - A neuroscience-inspired routing framework sends easy queries to a fast model and hard ones to a slow deliberator. It hits 98.2% on MATH and 54.8% on Humanity's Last Exam while slashing compute costs by 82%.
- PseudoAct - Instead of letting agents react step-by-step, this framework creates a full pseudocode plan before executing anything. It lifts success rates on FEVER by 20.93% and sets a new state-of-the-art on HotpotQA.
No Voice AI Passes the Turing Test
Can an AI hold a phone call and fool you into thinking it's human? A team led by Xiang Li and Benyou Wang set out to answer that question with the first systematic Turing test for speech-to-speech (S2S) systems. The result, now accepted at ICLR 2026, is a clear no.
The researchers recruited 28 participants from 10 countries and collected 2,968 human judgments through a gamified online platform. They tested nine systems: GPT-4o, Gemini 2.5-Pro, Qwen3, Kimi-K1.5, ChatGLM-4.5, Hunyuan-TurboS, Doubao-Pro 1.5, Claude-Sonnet 4, and iFLYTEK-Spark.
Every system failed. Success rates ranged from 0.07 to 0.31, well below the 0.5 threshold required to "pass" as human. GPT-4o performed best at 23-25.9%, while iFLYTEK-Spark scored as low as 0%.
Despite rapid progress, voice AI systems still struggle to sound convincingly human in open conversation.
The most revealing finding is where these systems fail. The team developed an 18-dimension evaluation framework spanning five categories: semantic coherence, non-physiological paralinguistic cues (rhythm, intonation, stress), physiological markers (breathing sounds), mechanical persona traits (excessive formality, sycophancy), and emotional expression.
The bottleneck is not understanding. These models handle semantics and logic reasonably well. They fail on the human stuff - rigid prosody, absence of filler words like "um" and "probably," monotone emotional delivery, and a persistent pattern of being too polished. Acoustic emotion scores were even lower than textual sentiment, meaning the models understood emotions conceptually but couldn't express them through voice.
For anyone building voice products, this is a useful reality check. The path to convincing speech AI runs through paralinguistics and embodied expression, not just better language understanding. If you are working with AI agents that need voice interfaces, plan for the uncanny valley to persist for a while.
ODAR-Expert Cuts Reasoning Costs 82%
Not every question deserves the same computational budget. That's the core insight behind ODAR-Expert, a framework from Siyuan Ma and colleagues that adaptively routes queries between fast and slow reasoning paths - borrowing directly from the dual-process theory of human cognition.
The system works in three stages. First, a lightweight Difficulty Estimator scores each query on a 0-1 complexity scale. Easy queries (below 0.3, roughly 41% of traffic) get a single fast model call. Medium queries (0.3-0.7, about 35%) run through the fast agent with slow verification. Hard queries (above 0.7, about 24%) trigger a best-of-five expansion using the slow deliberative model.
The results are striking. Using GPT-5.1 as the fast agent and Claude-4.5 Sonnet as the slow agent, ODAR hits 98.2% on MATH, 99.1% on GSM8K, 54.8% on Humanity's Last Exam, and 74.2% on SWE-bench. That's 89.6% average accuracy across 23 benchmarks, beating Self-Consistency by 6 percentage points.
But the real story is the open-source configuration. Using Llama 4 Scout and DeepSeek V3.2, the system hits 84.4% average accuracy while reducing computational cost by 82% compared to uniform sampling - dropping from 25x to 4.5x normalized cost.
The secret sauce is Free-Energy Fusion, a selection mechanism that picks the best answer by minimizing variational free energy rather than simple majority voting. This balances raw accuracy (log-likelihood) against a "varentropy" penalty that filters hallucinations by detecting high uncertainty in the model's own token predictions.
For practitioners tracking reasoning benchmarks and cost efficiency, this reframes the scaling question. Instead of throwing more compute at every query, you can route intelligently and get most of the performance at a fraction of the cost. The 82% reduction with open-source models is especially relevant for teams running inference at scale.
The limitations are worth noting. The hard path passes 60 seconds for 24% of queries, ruling out real-time applications. And 66% of remaining errors come from intrinsic model limitations that no routing scheme can fix.
PseudoAct Gives Agents a Blueprint
If you have built anything with AI agents, you know the failure mode. The agent calls a tool, reads the result, decides what to do next, calls another tool, gets confused by the output, loops back, tries a different approach, burns through tokens, and sometimes just gives up.
Structured planning through pseudocode synthesis removes the chaotic tool-calling loops that plague reactive agent designs.
PseudoAct, from Yihan Wen and Xin Chen, tackles this head-on by replacing reactive decision-making with upfront pseudocode synthesis. Before executing a single action, the agent generates a complete pseudocode plan that decomposes the task into subtasks with explicit control flow.
The framework supports seven logic primitives: EXECUTE for atomic actions, IF-ELIF-ELSE for branching, FOR and WHILE for iteration, TRY-ON_FAILURE for fault tolerance, PARALLEL for concurrent operations, and DATA-FLOW for inter-step dependencies. This isn't a rough sketch - it's a detailed blueprint with termination conditions and iteration bounds baked in.
A deterministic Control-Flow Executor then walks through the plan step by step, maintaining global memory state. Each step gets a composite context combining global constraints (workflow topology, termination criteria) with local context (step objective, resolved inputs). The executor delegates atomic tasks to a ReAct-style agent but enforces structural constraints that prevent the wandering and looping that plagues purely reactive approaches.
The payoff: a 20.93% absolute improvement in success rate on FEVER and a new state-of-the-art on HotpotQA. The approach also removes infinite loops - one of the most common failure modes in autonomous agents - through structurally enforced termination conditions rather than relying on the model to generate stop tokens.
Token efficiency also improves significantly. Standard reactive agents build up context at O(nL) complexity as their history grows. PseudoAct operates at O(L_plan + n(L_step + L_global)), where compact step contexts are much smaller than accumulated conversation histories.
The Common Thread
All three papers share a theme: the brute-force era of AI is giving way to structured intelligence. Voice AI needs more than bigger models - it needs to learn the messy, imperfect patterns of human speech. Reasoning systems needn't throw equal compute at every problem - they need to triage. And agents needn't stumble through tasks reactively - they need to plan.
The shift from "more compute" to "smarter compute" is accelerating. ODAR's 82% cost reduction with open-source models is a preview of where inference economics is heading. PseudoAct's structured planning is a preview of how agent architectures will mature. And the S2S Turing test is a reminder that some problems - like sounding human - resist the standard scaling playbook completely.
Sources:
- Human or Machine? A Preliminary Turing Test for Speech-to-Speech Interaction - Li et al., ICLR 2026
- ODAR: Principled Adaptive Routing for LLM Reasoning via Active Inference - Ma et al., 2026
- PseudoAct: Leveraging Pseudocode Synthesis for Flexible Planning and Action Control in Large Language Model Agents - Wen & Chen, 2026
