Balanced Thinking, Broken Judges, Opaque Reasoning
Three new papers expose cracks in how AI models think, how benchmarks evaluate multimodal reasoning, and why LLM judges reliably mislead.

Three papers landed this week with something in common: each attacks a practice the field has quietly assumed was working. Reasoning models are over- and under-thinking at the same time, and nobody had a clean fix. A new benchmark finds that every competitive multimodal model cherry-picks its reasoning steps. And the LLM-as-judge approach turns out to look good on paper while failing in practice far more often than people realize.
TL;DR
- ReBalance fixes overthinking and underthinking in reasoning models without retraining, using confidence signals as a real-time steering dial
- CRYSTAL reveals that no competitive multimodal model preserves more than 60% of its matched reasoning steps in the correct order
- LLM judges with decent global correlation capture only 21% of the improvement that perfect response selection would achieve - pairwise comparison recovers much of that loss
ReBalance: One Dial, Two Failure Modes Fixed
Paper: "Efficient Reasoning with Balanced Thinking" Authors: Yulin Li, Tengyao Tu, Li Ding, Junjie Wang, Huiling Zhen, Yixin Chen, Yong Li, Zhuotao Tian Venue: ICLR 2026
The oversimplified story about reasoning models is that more thinking steps equal better answers. The more accurate story is that these models fail in two opposite directions at once. They over-think easy problems, burning tokens on reasoning chains that add nothing. They under-think hard ones, producing confident but shallow answers before exploring enough paths. The problem is that both failure modes look like reasoning.
ReBalance addresses both with a single mechanism: confidence signals as a real-time dial. The system doesn't require retraining the underlying model. Instead, it monitors the model's confidence as it reasons, treating high variance in confidence as a signal of overthinking (the model keeps second-guessing itself, adding unnecessary steps) and consistent overconfidence as a signal of underthinking (the model locks in too early without enough exploration).
The technical approach involves extracting hidden states from a small calibration dataset to construct "reasoning mode prototypes" - basically learned templates for what overthinking and underthinking look like inside the model's activation space. Those prototypes become steering vectors that get applied dynamically during inference, scaled by how strongly the current confidence pattern resembles each failure mode.
Identifying overthinking through high confidence variance and underthinking via consistent overconfidence - and applying steering based on that real-time signal - is what makes this different from earlier budget-forcing methods.
The team tested across four model sizes (0.5B to 32B parameters) and nine benchmarks spanning math reasoning, general QA, and coding. Across the board, output length dropped while accuracy held or improved - the two goals that usually trade off against each other.
The training-free nature matters a lot here. Models at 32B scale are expensive to retrain, and any approach that requires it faces a steep adoption barrier. ReBalance slots in at inference time, which means it can be applied to existing launched models without touching weights.
What the paper doesn't fully answer is how sensitive the calibration dataset is. The prototypes depend on having a representative sample of overthinking and underthinking examples - if that set is too narrow, the steering vectors may not generalize. That's worth watching as practitioners try this on domain-specific models.
Code is available under a Creative Commons BY-NC-ND 4.0 license.
The ReBalance framework applies dynamic steering vectors based on real-time confidence signals, scaling intervention strength based on how strongly the current pattern resembles overthinking or underthinking.
Source: arxiv.org
CRYSTAL: Multimodal Models That Can't Follow Their Own Reasoning
Paper: "Beyond Final Answers: CRYSTAL Benchmark for Transparent Multimodal Reasoning Evaluation" Authors: Wayner Barrios, SouYoung Jin arXiv: 2603.13099
The standard way to assess multimodal models is to check whether the final answer is correct. CRYSTAL argues that's not enough - and backs the argument with 6,372 benchmark instances designed to assess whether intermediate reasoning steps are accurate, complete, and ordered correctly.
The methodology for building reference solutions is careful. Four independent multimodal large language models create reasoning trajectories for each instance. Those trajectories are aggregated through semantic clustering, then human reviewers validate the output. The result is a reference set that doesn't depend on any single model's idiosyncrasies.
Two metrics do the actual evaluation. Match F1 measures step-level precision and recall via semantic similarity - it answers whether the model produced the right reasoning content. Ordered Match F1 adds a constraint: the steps must appear in logical sequence. A model that produces all the right reasoning steps in scrambled order fails this second metric even if it passes the first.
| Metric | What it measures |
|---|---|
| Match F1 | Whether correct reasoning steps are present (precision + recall) |
| Ordered Match F1 | Whether correct steps appear in logical sequence |
The results across 20 multimodal systems are uncomfortable. Every model cherry-picks - precision reliably passes recall, meaning models tend to produce a few solid reasoning steps while omitting others. Non-monotonic scaling trade-offs appeared, where larger models didn't always perform better on reasoning transparency than smaller ones. And the ordering finding is stark: no competitive model preserved more than 60% of matched reasoning steps in the correct logical order.
That last point deserves attention. A model that generates good reasoning steps but sequences them arbitrarily isn't reasoning transparently - it's producing plausible-looking process text while the actual computation happens elsewhere. This matters for AI safety and interpretability because much of the case for trusting a model's output rests on being able to audit its reasoning chain.
CRYSTAL evaluates 20 multimodal systems on reasoning step completeness and ordering - the results show universal cherry-picking behavior.
Source: arxiv.org
Barrios and Jin don't just describe the problem. They propose Causal Process Reward (CPR), a training objective that couples answer correctness with step-level alignment multiplicatively rather than additively. Additive reward strategies, where the model gets partial credit for either correct answers or correct reasoning steps, apparently let models game the training signal by focusing on whichever is easier. The multiplicative coupling forces both to be satisfied simultaneously.
With a curriculum variant (CPR-Curriculum) that progressively increases reasoning difficulty during training, the team hit a 32% Match F1 improvement via GRPO (Group Relative Policy Optimization) in cases where additive reward strategies failed completely.
The benchmark is CC BY 4.0, meaning it's freely usable for further research. Given how much of the multimodal benchmarking space focuses on final answer accuracy, CRYSTAL should get significant adoption once people realize what they've been missing.
The LLM Judge Trap
Paper: "When LLM Judge Scores Look Good but Best-of-N Decisions Fail" Author: Eddie Landesberg arXiv: 2603.12520
The LLM-as-judge approach has become standard in model evaluation and RLHF pipelines. The basic premise is that you train a model to score or rank responses, then use those scores to pick the best output or provide a training signal. The scores are validated against human preferences, global correlation looks acceptable, and everyone moves on.
Landesberg's paper asks a sharper question: if you use a LLM judge to pick the best response from N candidates, how much of the possible improvement do you actually capture?
The answer is sobering. A judge with moderate global correlation (r = 0.47) captures only 21.0% of the improvement that perfect selection would achieve. In a 5,000-prompt benchmark, that's most of the value left on the table.
The diagnosis is precise. Global correlation is dominated by prompt-level variation - the judge reliably scores hard prompts lower than easy ones, which inflates the aggregate correlation metric. But within a prompt, where you're comparing five candidate responses to the same question, the discriminative signal is far weaker (r_within = 0.27). Coarse scoring scales make it worse: when a judge uses a 1-10 scale, it creates ties in 67% of pairwise comparisons, making selection essentially random in those cases.
Global correlation metrics mask poor within-prompt ranking ability. The judge scores hard prompts lower than easy ones, but that's not the skill needed for selecting the best of five responses to one question.
The fix Landesberg tests is explicit pairwise comparison - instead of scoring each response independently and picking the highest, the judge directly compares candidates against each other. This sidesteps the coarse-scoring tie problem and focuses the judge on within-prompt discrimination. Recovery jumped from 21.1% to 61.2% in matched-pair evaluations.
For teams using LLM judges in best-of-N sampling or RLHF reward modeling, this has immediate practical effects. Reporting global correlation as the primary validation metric actively obscures whether the judge is doing its job. Within-prompt signal strength, tie rates, and top-1 recovery are better measures of what actually matters.
This dovetails with longer-standing concerns about AI benchmark validity and whether the metrics being optimized actually correspond to the capabilities being claimed. A judge that looks well-calibrated by standard metrics while capturing less than a quarter of available signal is a reliability problem disguised as a success story.
The paper is a single-author preprint, so it hasn't cleared peer review, but the methodology is transparent and the finding is direct enough to check independently. The recommendation to shift toward pairwise comparison for selection tasks is worth testing even before the paper gets published.
Tie structure distribution from Landesberg's analysis: coarse scoring scales create ties in 67% of pairwise comparisons, collapsing selection ability.
Source: arxiv.org
All three papers land in a similar place: the metrics we use to declare things working are optimized for the wrong thing. Confidence-based steering fixes a training-free inference problem because token length doesn't tell you which failure mode you're in. CRYSTAL exposes reasoning opacity because final answer accuracy doesn't tell you whether the model's process is correct. And pairwise judging catches what global correlation misses because aggregate correlation doesn't tell you how well a judge ranks within a single prompt.
The common thread is evaluation. Getting evaluation right is harder than it looks, and the costs of getting it wrong are invisible until someone runs the right experiment.
Sources:
