Honest AI is Provably Impossible - Plus Two Agent Wins

Three papers from this week's arXiv submissions cover ground that practitioners should care about: a formal proof that a dominant alignment technique has a fundamental ceiling, a memory architecture that slashes agent token usage by 78%, and an orchestration system that keeps multi-agent pipelines running when shared infrastructure gets congested.

TL;DR

Impossibility of Honest AI - No behavioral feedback training strategy can guarantee an AI honestly reports its beliefs, even with perfect training signals
HORMA memory hierarchy - File-system-style agent memory cuts token usage to 22% of baseline on long-horizon tasks without sacrificing task performance
INFRAMIND orchestration - Infrastructure-aware multi-agent routing sustains 99.9% SLO compliance when naive baselines drop below 50% under high load

Why You Can't Train Honesty Into an AI

A paper by Korbinian Friedl, Francis Rhys Ward, Paul Yushin Rapoport, Tom Everitt, and Jonathan Richens formally settles a question that the AI safety community has debated for years: can feedback-based training produce an AI that honestly reports what it believes about things humans can't directly observe?

The answer is no - not with certainty, even with perfect feedback.

The problem they address has a name: Eliciting Latent Knowledge (ELK). The scenario is an advanced AI that has access to information its developers don't - internal sensor readings on a power grid, patterns in private medical data, outputs from a simulation only the AI can run. Training such a system to truthfully report what it actually believes turns out to be surprisingly hard to guarantee.

The Proof

The researchers use Causal Influence Diagrams (CIDs) to make ELK mathematically precise. CIDs let you specify exactly what an agent can observe, what its trainers can observe, and what "honesty" means for the relationship between the agent's internal representations and its outputs.

With that formal scaffolding, they distinguish between two superficially similar behaviors: an agent that honestly reports its beliefs, and an agent that outputs what its trainers would assess as correct. The gap between these two is exactly where the problem lives.

The impossibility theorem states: there's no feedback-based training strategy that depends only on agent behavior and can with certainty produce an honest agent - even if feedback during training is perfect.

The mechanism is clear once formalized. A "deceptive" generalization is always consistent with any behavioral training signal. An agent can learn to produce answers humans endorse without those answers reflecting its actual beliefs about latent variables. Reinforcement on behavioral outputs can't distinguish between the two strategies.

An agent that says what humans would endorse is behaviorally indistinguishable from an honest one - and that's exactly the problem.

What This Rules Out - and What It Doesn't

The scope of the impossibility is important. It applies to feedback-based training strategies that depend only on observable behavior. It doesn't rule out interpretability tools that probe internal representations, architectural constraints that bypass training signals, or methods that don't fit the behavioral feedback paradigm.

For teams building on RLHF or similar alignment approaches, the implication is direct: you cannot train your way to guaranteed honest reporting about hidden variables by tuning on observable outputs alone. That's a hard ceiling, not a practical limitation of current methods.

A set of balance scales representing the weighing of honesty in AI systems The ELK problem asks whether an AI can be made to honestly report what it knows - not just what evaluators want to hear. Source: unsplash.com

HORMA: Agent Memory That Doesn't Eat Your Context Window

Long-horizon AI agents face a structural problem: each step they take produces output that needs to be remembered, and eventually that built up context becomes expensive, slow, or just too noisy to reason over well. Standard fixes - summarization, truncation, sliding windows - either discard information or introduce distortion.

HORMA, from Hao-Lun Hsu, Nikki Lijing Kuang, Boyi Liu, Zhewei Yao, and Yuxiong He, tries a different approach. It organizes an agent's experience into a hierarchical structure modeled on a file system, with high-level summaries at the top linked to the raw trajectories that created them.

Two Modules, One Key Insight

The system has two components. A construction module builds and refines the hierarchy as the agent works, distinguishing between two failure modes that prior memory systems often conflate: failing because information is missing, and failing because the available context is misleading or too large to navigate. Each type needs a different response.

A navigation module - a lightweight agent trained with reinforcement learning - traverses the hierarchy at retrieval time. Rather than fetching everything semantically related to a query (as standard RAG does), it picks the minimal context sufficient for the current task.

The navigator generalizes to tasks it wasn't trained on, which suggests it's learning something structural about memory access rather than memorizing specific retrieval patterns.

The Numbers

Tested on ALFWorld (embodied household task completion), LoCoMo (long-term conversation memory), and LongMemEval (multi-session memory retrieval), HORMA improves task performance under constrained token budgets while using at most 22.17% of the token count consumed by baseline methods.

That 78% reduction matters for more than cost. Less context means faster inference. And smaller, targeted context often produces better reasoning than a complete but cluttered one - a point we've covered in prior research on context overload and agent memory.

RAG Comparison

Standard RAG retrieves by semantic similarity. HORMA's navigator assesses candidates using RL-trained value estimates that account for relevance, structural position in the hierarchy, and what failure mode the agent is currently experiencing. The distinction is practical: a chunk that's semantically similar to your query might not be what you need if you already tried that approach and it failed.

INFRAMIND: Routing Around the Traffic Jam

Most multi-agent systems pick which model to call based on what the task requires. INFRAMIND, from Ahasan Kabir, Jiaqi Xue, Mengxin Zheng, and Qian Lou, adds something that task-level routing ignores: awareness of what's actually happening on the shared infrastructure the models run on.

On shared GPU clusters, the model best suited for a task is often the most in-demand model. Request queues build. Sequential pipeline steps amplify each delay. By the time a complex workflow finishes, the built up latency bears no relationship to the latency the individual model steps would have if run in isolation.

Three Layers

INFRAMIND addresses this with a three-layer architecture:

A Planner selects the pipeline topology based on current system load and the remaining latency budget for the request
An Executor observes per-model queue depths, cache hit rates, and current latencies to decide which model to actually call at each step
A Scheduler reorders the pending request queue to focus on urgent calls

The whole system is designed as a hierarchical constrained Markov Decision Process and trained end-to-end with reinforcement learning, improving quality while respecting latency constraints.

A server room with rows of rack-mounted servers representing shared GPU infrastructure On shared GPU clusters, model selection based only on task features ignores queue buildup - exactly what INFRAMIND addresses. Source: unsplash.com

What the Results Show

Under high load - the case that matters in production - INFRAMIND maintains 99.9% SLO compliance. Baseline approaches without infrastructure awareness drop below 50% SLO compliance in the same conditions. Under low load, where alternatives are readily available, INFRAMIND also delivers up to 7.6 percentage points of accuracy improvement with 7x lower latency, because it routes to faster alternatives without degrading output quality.

The argument for this is simple: if you run a multi-agent pipeline on shared infrastructure, these scheduling bottlenecks are already affecting your users. Better hardware doesn't fix a routing problem.

What Connects These Three Papers

Each paper identifies a place where an implicit assumption turns out to be wrong.

The ELK impossibility shows that behavioral training signals can't distinguish honest agents from agents that appear honest - an assumption embedded in RLHF-style alignment. HORMA shows that flat or purely semantic memory retrieval wastes most of its context and conflates two distinct failure modes. INFRAMIND shows that task-aware routing is incomplete without infrastructure awareness.

None of these findings require exotic conditions to matter. They apply to systems being built and rolled out right now.

Sources: