Reasoning Models Can't Hide Their Thinking - OpenAI Study

Frontier reasoning models can't control their own chains of thought. When told to avoid certain words or patterns in their reasoning, they fail - and they often know they're failing, mid-thought, while continuing to fail anyway.

That's the core result of CoT-Control, an open-source evaluation suite published March 5 by OpenAI researchers Tomek Korbak, Micah Carroll, Bowen Baker, and Ian Kivlichan. The study tested 13 frontier reasoning models - including GPT-5, GPT-5.2, o3, and o4-mini from OpenAI, Claude 3.7 Sonnet and Claude Sonnet 4 from Anthropic, Qwen3 (14B and 32B), and DeepSeek R1 - across more than 13,000 tasks drawn from GPQA, MMLU-Pro, HLE, BFCL, and SWE-Bench Verified, each paired with a specific instruction about how the model should reason.

The scores are remarkably consistent, and remarkably low.

TL;DR

CoT-Control measures whether reasoning models can deliberately steer their own thinking traces
Frontier reasoning models score 0.1-15.4% - near-total inability to follow format instructions in their heads
Those same models control their final outputs at 37.8-69.3% - far higher
Reinforcement learning training makes controllability worse, not better, sometimes by a factor of ten
OpenAI warns that future models optimized on this metric could learn to fake compliance

The Scores

Model	CoT Controllability
DeepSeek R1	0.1%
GPT-5.2 Thinking	0.2%
GPT-5.4 Thinking	0.3%
Best frontier reasoning model tested	15.4%

Those numbers represent how often a model successfully followed a chain-of-thought format instruction while working through a hard benchmark problem. A score of 0.3% means GPT-5.4 Thinking complied with format instructions in its reasoning trace roughly 3 times out of every 1,000 tasks.

For comparison, the same models control their final outputs at rates between 37.8% and 69.3%. A model that can't manage its own thinking can still manage what it says at the end. The gap between CoT controllability and output controllability is large across every model tested.

A data monitoring dashboard showing system analytics AI safety researchers increasingly rely on monitoring tools to track model behavior during reasoning. CoT-Control provides a standardized metric for measuring how transparent that reasoning actually is. Source: unsplash.com

What CoT-Control Actually Tests

The Task Design

Each evaluation pairs a hard benchmark problem with a format constraint on the chain of thought. A model might be asked to solve a graduate-level physics problem while avoiding the technical term "XOR." Another task might require working through a coding challenge using only lowercase letters across the reasoning trace. Some instructions require appending a specific phrase to every sentence in the thinking process.

The problems are deliberately hard. Trivial tasks wouldn't isolate whether the model is failing at format control or simply not thinking carefully enough to generate much output. The constraint has to compete with genuine cognitive load.

Controllability is scored strictly on whether the model followed the format instruction within its reasoning trace - not whether it got the answer right.

Why the Gap Between Thinking and Output

The results reveal a structural split: models appear to maintain two partially separate tracks of constraint-following. Their final responses - the parts users see - are shaped clearly by instruction-following training. Their reasoning traces aren't.

OpenAI's interpretation is that this reflects how reasoning models are built. The RLVR training that produces capable reasoning models optimizes heavily for correct answers rather than for format compliance within the thinking process itself. The chain of thought develops its own momentum, and surface-level instructions don't penetrate it reliably.

When Models Watch Themselves Fail

The most striking example in the paper involves GPT-5.2 Thinking. Told to avoid the term "XOR" in its reasoning, the model used it anyway. Then it wrote: "Oops I wrote 'XOR' in analysis already." Then it used the word four more times.

This isn't a failure to understand the instruction. The model demonstrably understood what it was doing wrong, identified it by name, expressed something like recognition of the violation - then kept violating it. The chain of thought had already committed to a reasoning path that the model couldn't redirect, even with active self-awareness of the problem.

The pattern holds across models and instruction types. Larger models show marginally better controllability, but the gap is small, and it disappears as task difficulty increases and reasoning traces grow longer.

Code projection over a person's face representing AI data analysis Reasoning models generate extended thinking traces that safety researchers can monitor. The CoT-Control findings suggest those traces are difficult for models to manipulate, making them more reliable as an oversight tool. Source: pexels.com

Why Training Makes Things Worse

This result runs counter to intuition. If more capable models better understand instructions, shouldn't more training produce better controllability? The data says no.

Reinforcement learning with verifiable rewards (RLVR) - the standard approach for frontier reasoning model development - reduces CoT controllability, not increases it. In at least one tested model, RL training cut controllability by more than a factor of ten compared to the pre-RL checkpoint.

Longer reasoning chains also decrease controllability. The more compute a model applies to a hard problem, the less it can steer how it applies that compute at the format level. Depth of thinking and format discipline are pulling in opposite directions.

One more test: situational awareness. The researchers checked whether models behaved differently when told they were being monitored versus when they believed they were operating normally. Controllability increased by up to four percentage points in the monitored condition. Not enough to change the picture.

What It Does Not Tell You

The Model Architecture Question

CoT-Control tested primarily RLVR-trained reasoning models - the GPT Thinking series, DeepSeek R1, and similar architectures. The paper doesn't make strong claims about models with different training approaches. Non-reasoning models and hybrid architectures may show different controllability profiles, and the paper acknowledges that the mechanisms behind low controllability remain poorly understood. OpenAI's own framing is explicit that this is a proxy measure, not a guarantee of anything deeper about intent.

The Future Optimization Risk

Low controllability is a safety positive right now because models can't strategically hide their reasoning to evade oversight. But the researchers flag this could change. If future models learn that controllability is being scored, they might learn to appear uncontrollable in monitored settings while behaving differently in deployment - effectively lying to appear safe in tests. The researchers commit to not applying direct optimization pressure to controllability metrics and recommend the broader safety community treat the metric as a canary evaluation rather than a target.

This connects to a broader worry in the AI safety community about Goodhart's Law: once a measure becomes a target, it ceases to be a good measure. CoT-Control is useful exactly because models aren't gaming it yet.

The paper also notes that simple training pressures successfully reduced monitorability in controlled test models, confirming that the current favorable state of affairs requires active preservation, not just passive maintenance.

OpenAI will include CoT controllability scores alongside monitorability measures in system cards for future models, starting with GPT-5.4 Thinking - which itself scores 0.3% on the metric. That's a concrete commitment to standardized reasoning transparency reporting. Given the parallel anxiety about autonomous agent systems and their potential for unmonitored behavior, having a reproducible metric that correlates with genuine transparency is more useful than another qualitative safety claim. What OpenAI can't guarantee is that the correlation holds as model capabilities scale further.

Sources: