Enterprise Agents Stall, Safety Gates, Smarter Tool Use

Three papers from today's arXiv dump land on the same uncomfortable point: we're building increasingly autonomous agents while our tools for assessing, constraining, and training them are still catching up. One benchmark shows frontier models struggling at basic office tasks. Another proposes a safety gate that doesn't need training data. A third, accepted to ICLR 2026, makes RL-based tool use dramatically cheaper.

TL;DR

EnterpriseOps-Gym - Even Claude Opus 4.5 achieves only 37.4% success on realistic enterprise workflows, with agents failing to refuse impossible tasks 46% of the time
ILION - A deterministic pre-execution safety gate reaches F1=0.85 at 143 microsecond latency with zero training data, beating commercial alternatives
AutoTool - An ICLR 2026 approach cuts RL compute by 81% while lifting tool-use accuracy by 9.8% by teaching models to scale reasoning depth to problem complexity

Enterprise Agents Can't Handle the Job

The hype around agentic AI has outpaced the evidence. A new benchmark called EnterpriseOps-Gym, from a team including researchers at multiple institutions, puts that gap in numbers.

The setup is detailed: a containerized sandbox with 164 database tables and 512 functional tools, simulating eight business domains including Customer Service, HR, and IT. Researchers created 1,150 expert-curated tasks that require long-horizon planning across persistent state changes - the kind of work that actually happens in corporate environments.

Server racks in a data center, representing enterprise infrastructure Enterprise deployments involve complex stateful environments that current benchmarks have largely ignored. Source: pexels.com

The results are hard to spin. Claude Opus 4.5, the best performer across the 14 frontier models tested, reached 37.4% success. When researchers handed agents ideal human-written plans as a scaffold, performance jumped by 14 to 35 percentage points depending on the task - which tells you exactly where the failure is. Planning is the bottleneck, not execution.

The safety numbers are more concerning. Agents should refuse tasks that are infeasible or outside their authorization. The best model managed that only 53.9% of the time, meaning it attempted dangerous or impossible tasks almost half the time.

Current agents aren't yet ready for autonomous enterprise deployment.

The researchers' conclusion isn't a surprise given the numbers, but the specificity is valuable. Prior work on agent scaling limitations has pointed at general capability ceilings; EnterpriseOps-Gym isolates where the ceiling is - strategic planning under constraints, not tool execution.

The benchmark is presented for the ICLR 2026 Workshop DATA-FM. The containerized sandbox design means researchers can run consistent, reproducible evaluations without the mess of live enterprise systems.

Why Practitioners Should Care

If you're rolling out agents in production environments today, 37.4% success on carefully curated benchmark tasks suggests the gap between demo and deployment is wider than vendors admit. The benchmark is available for teams who want to measure their own systems against a realistic standard.

A Safety Gate That Doesn't Need Training Data

The second paper addresses a structural problem in agent deployment: how do you know whether an agent's proposed action is within authorized scope before it executes?

ILION (arXiv: 2603.13247) proposes a five-stage deterministic architecture for exactly this. The system assesses proposed agent actions - filesystem writes, API calls, financial transactions - against an authorized scope, and blocks anything that doesn't fit. It operates at the action level, not the content level, which is the key distinction from existing text-safety systems like Lakera Guard.

The architecture chains five components: a Transient Identity Imprint, a Semantic Vector Reference Frame, Identity Drift Control, an Identity Resonance Score, and a Consensus Veto Layer. The whole pipeline runs in 143 microseconds on average.

Performance on ILION-Bench v2 (380 test scenarios): F1 of 0.8515, precision of 91.0%, false positive rate of 7.9%. It outperformed the three baselines tested, including commercial solutions, while producing fully interpretable decisions.

The zero-training-data requirement is the practical win. Most production safety systems require labeled datasets of permitted and prohibited actions, which are expensive to create and drift quickly as the agent's task scope changes. ILION sidesteps this by operating deterministically against a defined scope specification.

This connects to a broader argument that's gained traction in safety research - see Alignment Backfires, AI Monitors Cheat, Models Resist - that safety for agentic systems needs architectural enforcement, not just trained classifiers that can be confused or manipulated.

Limitations

A 7.9% false positive rate means roughly 1 in 12 legitimate actions gets blocked. In high-throughput agent workflows that's not negligible. The paper doesn't report recall, so it's hard to assess how many genuinely dangerous actions slip through. And ILION-Bench v2 is the authors' own benchmark - independent replication will matter.

Scaling RL Tool Use Without Blowing the Compute Budget

The third paper takes a different angle: if agents are going to get better, training needs to be more efficient.

AutoTool (arXiv: 2603.13348), accepted as a poster at ICLR 2026, targets a specific failure mode in reinforcement learning for tool use. Models either under-reason on complex problems (missing correct answers) or over-reason on simple ones (burning tokens on unnecessary chain-of-thought). Neither is good.

A software engineer working at a server rack, representing the engineering behind agentic systems Training efficient tool-using agents requires balancing reasoning depth against computational cost. Source: pexels.com

The solution is a two-stage approach. First, supervised fine-tuning gives the model a sense of problem complexity. Second, RL training with decoupled entropy constraints lets the model dynamically scale reasoning length to match what each problem actually needs. Entropy-based diversity preservation prevents the policy from collapsing to a single reasoning style.

Results across three benchmarks: 9.8% accuracy improvement with ~81% reduction in computational overhead compared to standard RL scaling approaches. That's not a minor efficiency gain - it's the difference between a training run that's practical and one that isn't.

The paper is co-authored by Yirong Zeng, Xiao Ding, Yufei Liu, and nine collaborators, which suggests this emerged from a large research team with significant compute resources to validate the claims across benchmarks.

What This Means for RL Practitioners

The decoupled entropy constraint idea is worth pulling out. Most RL training either fixes the reasoning budget or lets it roam free, neither of which is ideal. The two-stage approach (SFT for complexity calibration, then RL for policy optimization) is straightforward enough to adapt to other tool-use settings beyond the three benchmarks tested here.

The Common Thread

These three papers don't obviously belong together until you look at what's missing from each. EnterpriseOps-Gym shows agents failing at planning. ILION tries to catch failures before they cause harm. AutoTool makes training more efficient so we can iterate faster on closing the planning gap.

The gap between the 37.4% enterprise success rate and the autonomy that vendors are selling is real and measurable. Anthropic's own agent autonomy study from February showed agents running 45 minutes without human input on average - longer sessions don't help if success rates stay this low. The infrastructure for safer, smarter agents is being built, but the performance numbers haven't caught up yet.

Sources: