Better Planning, Faster Benchmarks, CFO Reality Check

Today's arXiv batch covers a useful spread: a team from China shows why you should stop asking LLMs to plan and start making them extract instead, a solo researcher cuts benchmark evaluation costs by up to 70% without losing ranking accuracy, and a third group drops a sobering number - only 16% of LLM agents survive a simulated 132-month CFO role.

TL;DR

DUPLEX - Restricting LLMs to semantic extraction rather than planning beats end-to-end approaches across 12 domains
Cut benchmark evaluation costs by 44-70% by filtering to intermediate-difficulty tasks - rankings stay stable with a fraction of the runs
EnterpriseArena - Only 16% of LLM agent runs survive a 132-month financial simulation, and larger models don't consistently do better

DUPLEX: Stop Asking LLMs to Plan

For years the standard approach to LLM-based task planning has been to throw a capable model at the full problem: give it a goal, let it reason, and hope the output is executable. DUPLEX, from Keru Hua, Ding Wang, Yaoying Gu, and Xiaoguang Ma, argues this is the wrong framing.

"The key is not to make the LLM plan better, but to restrict the LLM to the part it is good at - structured semantic grounding - and leave logical plan synthesis to a symbolic planner."

The system uses PDDL - Planning Domain Definition Language, a decades-old standardized format for expressing automated planning problems - as the bridge between natural language and formal logic. DUPLEX splits the work into two modes. A lightweight "fast" LLM reads natural language goals and translates them into PDDL problem files using schema-guided extraction. It doesn't plan at all. A classical symbolic solver handles the actual reasoning.

The "slow" mode activates only when the symbolic solver fails. At that point, a heavier model reviews solver diagnostics and repairs the PDDL description, then hands it back. The loop continues until the planner finds a valid sequence of actions.

Automated warehouse robots - an application domain for multi-step planning systems Warehouse automation tasks - like sorting and rearrangement - are a standard testbed for long-horizon planning research. Source: commons.wikimedia.org

Tested across 12 planning domains spanning classical puzzles and household robotics tasks, DUPLEX outperforms both pure LLM baselines and prior hybrid approaches in success rate and reliability. The result holds because the LLM's errors get caught by the solver rather than silently propagating through a produced plan.

What Practitioners Should Take From This

This isn't just a robotics paper. Any multi-step agentic workflow where correctness matters - booking systems, logistics, code generation pipelines - can apply the same principle. Identify what the LLM is actually good at (parsing natural language into structured schemas), offload the combinatorial reasoning to something that can verify its own output, and reserve the expensive model call for error recovery only.

The paper was submitted March 25, 2026 (arXiv:2603.23909).

Efficient Benchmarking: You Don't Need All the Tasks

Agent evaluation is expensive. Running 33 scaffolds across 70 model configurations on a full benchmark suite can take days of compute. Franck Ndzomga's paper asks a simple question: does it have to?

The answer is no. By filtering benchmark tasks to those with historical pass rates between 30% and 70% - neither trivially easy nor consistently impossible - you can reduce the evaluation workload by 44-70% while preserving ranking stability across agents. This filtering approach comes from Item Response Theory (IRT), a psychometric framework originally designed to select exam questions that best discriminate between test-takers of different ability levels.

The key finding is that rank-order stability is more durable than it looks. Even when absolute scores vary between different scaffolding frameworks or time periods, the relative ordering of agents stays consistent. That means leaderboards don't need full evaluation to be trustworthy. They need discriminating evaluation.

The Numbers Behind It

Metric	Result
Benchmarks studied	8
Agent scaffolds covered	33
Model configurations	70+
Task reduction range	44-70%
Pass-rate filter window	30-70%

The approach outperforms random sampling (which shows high variance) and greedy task selection (which breaks down under distribution shift). It's also optimization-free: no training required, just historical pass-rate data from prior evaluations.

For teams running their own agent evaluations, this is directly actionable. If you're spending compute on a benchmark where your agent scores either 0% or 100% on most tasks, those tasks aren't telling you anything. We've covered problems with agent benchmark scores before, but this is one of the first papers to quantify exactly how much you can cut without losing signal. An earlier paper on dynamic agent benchmarks touched on related issues around static evaluation - Ndzomga's work complements that line of thinking with a practical efficiency protocol.

The paper covers 22 pages and is submitted under CC BY 4.0 (arXiv:2603.23749).

EnterpriseArena: LLMs Fail as CFOs

The last paper is the bleakest of the three. Yi Han, Sophia Ananiadou, and 11 co-authors built EnterpriseArena - described as the first benchmark for assessing agents on long-horizon enterprise resource allocation - and ran 11 advanced LLMs through a simulated 132-month business environment.

Only 16% of experimental runs completed the full timeline.

The simulation combines financial data, business documents, economic indicators, and expert-confirmed decision rules. Agents access information through a "budgeted" set of organizational tools: they have to pay (in simulated resources) to query data, which forces a trade-off between information gathering and conserving the budget they're supposed to be managing. This creates partial observability - the same condition that makes real executive decisions difficult.

Larger models didn't consistently outperform smaller ones. Scale doesn't buy you CFO-level judgment.

The researchers characterize long-horizon resource allocation under uncertainty as "a distinct capability gap for current LLM agents." That framing matters. It isn't a benchmark the models nearly pass. They fail at the structural challenge - maintaining coherent strategy across a multi-year horizon while accounting for changing conditions and resource constraints.

Why This Matters Beyond Finance

Earlier research on enterprise agent stalling showed that agent success rates in realistic work environments top out around 37%. The CFO benchmark suggests the ceiling is even lower when the time horizon extends and the environment changes over months rather than days.

The failure mode isn't hallucination or poor recall. It's the inability to trade off short-term costs against long-term goals consistently. Models that optimize locally - spending budget to gather information when it's locally rational - deplete resources before the simulation ends. Models that conserve budget miss decisions that compound over time.

This is worth keeping in mind for anyone planning to deploy agents in roles that require sustained strategic judgment. The paper was submitted in March 2026 (arXiv:2603.23638).

The Common Thread

All three papers point toward the same underlying issue: current LLMs struggle with tasks that require structured, multi-step reasoning over long horizons. DUPLEX's solution is architectural - remove the LLM from the part it's bad at. The benchmarking paper helps teams measure the gap more efficiently. EnterpriseArena makes the gap impossible to ignore.

None of the three papers offer a fix for the CFO problem. That may be the most honest contribution of the batch.

Sources: