Reasoning Leaks, Hard Limits, and Self-Aware LLMs

Three papers landed today that circle the same uncomfortable question from different angles: what do models actually know about their own reasoning, what can outsiders learn by watching them think, and where does the thinking simply stop working?

TL;DR

Hidden Thoughts Aren't Secret - Reasoning Exposure Prompting extracts a model's hidden chain-of-thought even when providers conceal it, and the traces are good enough for distillation
The Deterministic Horizon - Neural CoT fails architecturally at 19-31 steps; tool-integrated agents reach 86-94% accuracy versus 24-42% for pure chain-of-thought, and fine-tuning barely makes a difference
Capability Self-Assessment - RL (not supervised fine-tuning) teaches models to recognise tasks beyond their competence, and the skill transfers out of distribution

The Lock Isn't as Tight as It Looks

Paper: "Hidden Thoughts Are Not Secret: Reasoning Trace Exposure in LLMs" Authors: Yu-An Lu, Ci-Yang Tsai, Yu-Lin Tsai, Raluca Ada Popa, Chia-Mu Yu

This one should concern anyone building products on top of reasoning models. OpenAI and Anthropic both offer "hidden reasoning" modes - the model thinks through a problem internally, then delivers only a final answer. The pitch: competitors can't extract proprietary reasoning patterns, and users can't game the model by watching its working.

A new paper argues this protection is weaker than advertised. The authors introduce Reasoning Exposure Prompting (REP), a method that elicits reasoning traces from a model even when the provider has suppressed them. REP works by producing demonstrations from a "shadow model" - a smaller, unrestricted model that reasons openly - then wrapping those demonstrations in auxiliary code-like formatting to nudge the target model into mirroring that reasoning style in its visible output.

The attack is inference-only. No weights, no special API access - just standard user prompts. And the extracted traces aren't noise: the paper shows they "substantially increase similarity between exposed and REP-conditioned internal traces while preserving useful reasoning signals." That makes them good enough for distillation. A competitor could learn from your reasoning patterns without touching your weights.

Why This Matters Beyond IP

This lands with Matthew Green's recent analysis of encrypted reasoning blocks - the kind OpenAI and Anthropic ship back to servers as authenticated ciphertexts. Green found those blocks can be replayed across sessions and accounts, and that even encrypted, they leak metadata including block size and token counts. A patient attacker can extract individual bits of hidden information by measuring whether the model performed a computationally simple or complex reasoning step, as determined by secret-dependent content.

Neither OpenAI nor Anthropic announced protocol changes in response. Anthropic suggested improved developer documentation. The combined picture from REP and Green's analysis: interface-level concealment doesn't constitute a reliable IP or security boundary.

A Kensington security lock clamped to a laptop Even hardware-level locks have limits. The REP attack on hidden reasoning requires only standard user API access. Source: commons.wikimedia.org

Thinking Has a Wall

Paper: "The Deterministic Horizon: When Extended Reasoning Fails and Tool Delegation Becomes Necessary" (ICML 2026) Authors: Dongxin Guo, Jikun Wu, Siu Ming Yiu

This is the most actionable paper of the three. The core claim: extended chain-of-thought reasoning doesn't just slow down on hard state-tracking tasks - it hits a hard architectural wall, not a soft capability ceiling.

The authors derive the Attention Bottleneck Theorem, which bounds how many distinct states a decoder-only transformer can reliably track. The bound scales with roughly O(H · log(L/H) · √d_h), where H is the number of attention heads, L the context length, and d_h the head dimension. From this bound they derive a Deterministic Horizon d* - the maximum reasoning depth before accuracy collapses super-exponentially.

The Numbers

Across 12 models tested - including GPT-4o, Claude Opus 4.5, o3, DeepSeek-R1, and Llama 3.1 and Llama 3.3 at 8B and 70B scale - d* falls between 19 and 31 steps. For GPT-4o specifically, d* ≈ 22.3 steps.

The empirical results confirm this across SWE-Bench, WebArena, and SQL-Multi:

Condition	Accuracy
Neural chain-of-thought	24-42%
Tool-integrated reasoning	86-94%
Fine-tuned CoT (best traces)	under 5% improvement

That last row is the key one. If the ceiling were about training quality or instruction following, fine-tuning should help more. Less than 5% improvement after fine-tuning on optimal-length traces confirms the ceiling is architectural. The authors introduce a State-Space Jaccard metric to distinguish this hard capability failure from a preference failure (where a model could succeed but doesn't try), and find cross-model correlation of r = 0.81-0.91 - the failures generalise across architectures, not just one family.

What This Means for Agent Builders

For any agentic task requiring deterministic state tracking across more than roughly 20 steps - code execution spanning multiple files, multi-hop database queries, long procedural workflows - pure neural reasoning will fail regardless of model size or fine-tuning. The theorem says this is baked into the decoder-only architecture. Building reliable agents past this horizon requires external tools, not a smarter model.

This is also relevant context for the current wave of reasoning-model benchmarks. Performance gaps between models on deterministic tasks may partly reflect proximity to d* rather than general intelligence differences. Once you're past 22 steps, most models fall off similarly.

A decision tree flowchart showing branching paths and outcomes A decision tree diagram illustrates branching logic of the kind that collapses past the Deterministic Horizon in pure neural chain-of-thought systems. Source: commons.wikimedia.org

Teaching Models Their Own Limits

Paper: "Capability Self-Assessment: Teaching LLMs to Know Their Limits" Authors: Haoyan Yang, Reza Shirkavand, Yukai Jin, Jiawei Zhou, Shangqian Gao, Heng Huang

Anyone deploying LLMs in production will recognise the problem this paper addresses: models methodically overestimate what they can do. They don't just hallucinate facts - they attempt tasks they should decline, producing confident-sounding wrong answers rather than signalling uncertainty or deferring to a better resource.

The authors frame Capability Self-Assessment (CSA) as a learnable trait and test two training approaches head to head. Their key finding: reinforcement learning teaches CSA reliably, while supervised fine-tuning doesn't. SFT on self-assessment examples actually degrades performance on the tasks themselves - likely because the model overfits to a meta-pattern of declining, losing calibration on the underlying problem. RL, by contrast, teaches the model to recognise its own failure modes through policy learning rather than imitation.

Generalisation and Downstream Uses

Learned CSA transfers out of distribution well. A model trained to self-assess in one domain applies that skill to new domains without retraining, which the authors treat as evidence of a transferable trait rather than surface-level pattern matching.

The practical applications are significant:

Local-cloud routing during inference: flag likely failures and route only those to a larger model
Targeted training data selection: use self-assessment signals to identify where more data would help
Agentic safety: an agent that knows when to defer makes fewer confident mistakes in high-stakes contexts

This connects to the broader research on agent reliability and self-correction. The CSA paper is one of the cleaner demonstrations that RL can instil a specific epistemic property - not just performance, but calibrated self-knowledge - and that standard supervised training actively works against it.

RL can instil calibrated self-knowledge. Supervised fine-tuning actively degrades it.

An analytical balance scale on a laboratory bench Capability self-assessment is a calibration problem. Like a laboratory balance, the goal is accurate measurement, not just a reading that looks plausible. Source: commons.wikimedia.org

Three Papers, One Blindspot

All three papers are pointing at a gap between what a model appears to do and what it actually can do.

The REP paper shows that hidden reasoning isn't reliably hidden - it can be reconstructed from visible outputs with the right prompting. The Deterministic Horizon shows that extended reasoning isn't really reasoning past ~22 steps; it's something closer to structured guessing applied to state tracking. The CSA paper shows that models don't natively know where that step limit sits for them on any given task.

Together, these results suggest that current transparency practices - whether designed to protect IP or to improve safety - are assuming a degree of opacity and legibility that doesn't hold. The reasoning trace is more accessible than providers intended, more brittle than practitioners assumed, and less self-aware than a reliable agent requires.

Sources: