Corrupt Agent Scores, Memory Bottlenecks, Skill Evolution
New research exposes hidden failures in agent benchmarks, finds retrieval quality dominates memory pipeline performance, and shows evolutionary skill discovery beats manual curation.

Three papers from today's arXiv batch zero in on failure modes that practitioners actually encounter: benchmark scores that flatter rather than measure, memory pipelines that optimize in the wrong place, and agent skill systems that require too much manual work to scale. Each offers a concrete fix that doesn't require changing your base model.
TL;DR
- Beyond Task Completion - Between 27-78% of agent "successes" on tau-bench conceal procedural violations that human supervisors would immediately flag
- Diagnosing Memory Bottlenecks - Retrieval quality drives 20 percentage points of accuracy variance in agent memory; write strategy barely matters
- EvoSkill - Automated failure analysis grows reusable agent skills that transfer zero-shot to different task classes
The Benchmark Is Lying to You
Most agent evaluation today reduces to a binary: task completed or not. On the tau-bench benchmark - which tests customer-service agents across retail and airline domains - that approach produces clean numbers. It also, as it turns out, produces numbers that are systematically misleading.
A paper from Hongliu Cao, Ilias Driouich, and Eoin Thomas introduces Procedure-Aware Evaluation (PAE), a framework that watches how an agent reaches its answer, not just whether it gets there. PAE assesses agent trajectories across four dimensions: Utility (did it solve the user's problem?), Efficiency (how many tokens and turns did it consume?), Interaction Quality (was the conversation coherent and appropriate?), and Procedural Integrity (did it follow its guidelines?).
Scrutinizing agent behavior requires more than task completion metrics - it demands looking at each step of the decision process.
Running PAE across several state-of-the-art models on tau-bench, the researchers found that between 27 and 78 percent of reported successes were what they call "corrupt successes" - task completions that violated procedural requirements somewhere along the way. The range varies significantly by model. One of the systems tested concentrated nearly 78% of its failures in policy faithfulness, meaning it completed tasks but did so by ignoring the guidelines it was given. Another distributed failures more broadly across the four evaluation dimensions.
The framework uses multi-dimensional gating: any outcome that fails a procedural check gets disqualified outright, regardless of whether the user's immediate request was technically fulfilled. An agent that cancels a flight reservation correctly while sidestepping the cancellation policy it was supplied doesn't count as a success under PAE.
The implications for anyone maintaining an agent evaluation pipeline are uncomfortable. Our agentic AI benchmarks leaderboard tracks pass rates across tasks like these - PAE suggests that a major share of those "passed" results wouldn't survive scrutiny of the underlying decision process. Beyond model behavior, the paper also exposes structural flaws in tau-bench itself: task scope gaps where required actions weren't captured, and simulator artifacts that allow agents to exploit benchmark mechanics without doing the right thing. The problem isn't just how agents are behaving - it's that the tools we use to measure them are giving us a distorted picture.
For teams deploying agents in regulated environments - financial services, healthcare, legal - this isn't a theoretical concern. An agent that books correctly while violating its own policy is a compliance problem, not a success.
Memory Has a Bottleneck - and You Are Probably Fixing the Wrong End
If you have spent engineering cycles building sophisticated write-time processing for a RAG-based agent - structured fact extraction, knowledge graph construction, layered summarization - there's a reasonable chance you have been tuning the wrong part of the pipeline.
Boqin Yuan, Yue Su, and Kun Yao ran a systematic 3x3 study crossing three write strategies against three retrieval methods on the LoCoMo conversational memory benchmark. The write strategies tested were: raw chunked storage (zero LLM calls at write time), Mem0-style fact extraction, and MemGPT-style summarization. The retrieval methods were: cosine similarity, BM25 keyword search, and hybrid reranking.
The memory challenge for AI agents mirrors the library problem: the bigger the archive, the more retrieval quality determines what you can actually use.
The outcome is unambiguous. Retrieval method produced a 20 percentage point spread in accuracy - from 57.1% at the bottom of the range to 77.2% at the top. Write strategy moved the needle by only 3 to 8 points across conditions. Raw chunked storage, which requires no inference at write time, matched or beat its more expensive alternatives in most configurations. Failure analysis confirmed that when agents gave wrong answers, the breakdown was almost always at the retrieval stage, not at the point where the model used retrieved content.
The practical upshot is direct. If your memory-augmented agent is underperforming, the most productive engineering investment is improving how it retrieves - better embedding models, hybrid reranking that combines dense and sparse signals, smarter query construction. Our guide on RAG covers the fundamentals; this paper gives empirical support for where those fundamentals matter most. Elaborate write-time summarization may actively hurt performance by discarding context that a better retrieval system could have found and used. The paper's code is publicly available, which means you can run the same 3x3 study on your own domain before committing to an architecture.
Teaching Agents New Skills Without Writing a Single Workflow
The third paper tackles a scaling problem that every team building multi-agent systems eventually hits: expanding what agents can do requires someone to manually design and test capabilities for each new task class. That engineering bottleneck grows proportionally with the breadth of deployment.
EvoSkill, from Salaheddin Alzubi, Noah Provenzano, Jaydon Bingham, Weiyuan Chen, and Tu Vu, treats skill creation as an evolutionary problem. The system runs an agent on a task set, collects execution failures, analyzes what went wrong, then proposes new skill packages or modifications to existing ones. Each proposed skill is assessed against the Pareto frontier - only capabilities that demonstrably improve performance without degrading other metrics survive. The key thing: the base model is never modified. Everything happens at the skill level, making the approach modular and compatible with any underlying LLM.
Performance gains across three benchmarks are substantial. On OfficeQA, a financial document reasoning task, accuracy improved from 60.6% to 67.9%, a 7.3 point gain. On SealQA, a search-augmented question-answering task, the improvement was larger: 26.6% to 38.7%, a 12.1 point gain. The most interesting result is the zero-shot transfer test: skills evolved completely on SealQA improved accuracy on BrowseComp - a distinct task class the system had never been trained on - by 5.3 percentage points without any additional tuning.
That transfer result matters because it rules out the obvious concern: that the evolved skills are just memorizing patterns from the training task rather than capturing something reusable. If they were task-specific hacks, they wouldn't generalize. For teams building the kinds of multi-agent pipelines described in our guide to building your first AI agent, EvoSkill points toward a plausible path for ongoing capability expansion that doesn't require a human to design every new tool.
A Common Thread
Three papers, three failure modes, three practical fixes.
Evaluation is broken in ways that systematically overstate agent performance - PAE offers a framework for catching what task-completion metrics miss. Memory pipelines are optimizing at write time when the binding constraint is at retrieval - hybrid reranking alone can recover 20 percentage points of accuracy. Skill creation is a manual bottleneck that doesn't scale - evolutionary failure analysis can discover and validate new capabilities automatically, and the resulting skills transfer to tasks the system has never seen.
None of these interventions require swapping your base model or rebuilding from scratch. All three slot into existing architectures. That's exactly why this batch of research is worth reading this week.
Sources:
