Agent Safety Gaps, Memory Learning, and Leaner Inference

Three papers this week hit very different parts of the AI stack. One tests what happens when production agent frameworks meet systematic adversarial pressure. Another revisits a gap in how RLVR training uses the data it already creates. The third makes the case that calibrated confidence during training can shrink your inference bill by more than a factor of twelve.

TL;DR

Vera - Automated safety testing finds 93.9% average attack success across four production agent frameworks, with 1,600 test cases now public
Procedural Memory Distillation - Capturing cross-episode signals during RLVR training yields 7.9-13.6% gains on code benchmarks with zero inference overhead
C3RL + CAS - Adding confidence calibration to RL training enables adaptive compute at inference, cutting costs up to 12.33x against majority voting

When Agents Break: Vera's 93.9% Attack Rate

Yunhao Feng and fourteen co-authors built Vera specifically because existing safety evaluations for LLM agents are fragmented and hard to reproduce. They released it alongside Vera-Bench: 1,600 executable test cases spanning 124 risk categories, covering three distinct execution environments.

The headline result is stark. Across four production frameworks - OpenClaw, Hermes, Codex, and Claude Code - multi-channel attacks succeeded 93.9% of the time on average.

How Vera Works

The testing pipeline runs in three stages. First, it performs literature-driven risk discovery and builds a structured taxonomy from identified threat categories. Second, it uses combinatorial composition to generate executable safety cases from those categories. Third, it runs adaptive sandbox execution where dedicated control agents steer multi-turn interactions toward the target scenario, verifying outcomes through observable artifacts rather than model self-reporting.

Asking a LLM whether it behaved safely isn't the same as watching what it actually did.

That evidence-grounding distinction matters. Self-reported compliance is unreliable by design - a model that was just manipulated into unsafe behavior can still describe its actions in reassuring terms. Vera sidesteps this by measuring what tools were called, what data was accessed, and what side effects occurred.

The 124 risk categories cover the range you'd expect - prompt injection, privilege escalation, tool misuse, data exfiltration - but organized into a taxonomy that makes gaps visible. The fact that attacks worked 93.9% of the time across frameworks maintained by serious teams isn't evidence of negligence; it's evidence that no systematic harness like Vera existed before.

Vera-Bench is public. If you're building on top of any of these frameworks, that benchmark is now the floor test you don't have an excuse to skip.

Security testing tools arranged on a desk surface Systematic adversarial testing requires reproducible scaffolding - exactly what Vera-Bench provides for agent frameworks. Source: pexels.com

Procedural Memory Distillation: What RLVR Forgets

Reinforcement learning with verifiable rewards has driven a lot of recent progress - especially on math and code tasks where a ground-truth verifier exists. But RLVR is episode-local by design. A policy update from one training episode doesn't carry forward observations about which strategies keep failing across many episodes, which failure modes recur on structurally similar problems, or which approaches consistently pass verification.

Ye Liu, Srijan Bansal, Bo Pang, and colleagues at Salesforce Research call this cross-episode signal "procedural memory," and their paper proposes a way to capture and distill it.

Three Levels of Memory

PMD organizes experience at three abstraction tiers:

Raw trajectories - the actual rollout sequences
Self-reflected strategies - extracted lessons from what worked and what didn't
Behavioral patterns - recurring structures across multiple problems

A memory-conditioned teacher uses this built up signal to guide a student model on its own rollouts. The memory is a training scaffold that gradually gets absorbed into the student's weights. At inference time, the student needs no external memory - the learned patterns are already part of the model.

The gains on Qwen3-8B and OLMo3-Instruct-7B are worth noting because they come from the same training data RLVR already uses. PMD doesn't require additional labeling, additional problems, or a larger dataset. It improves over the SDPO baseline by 3.8-5.5% on SCIKNOWEVAL and by 7.9-13.6% on LIVECODEBENCH, which is a harder, more realistic benchmark for code tasks.

That 13.6% ceiling is notable in context: agent memory has been a persistent bottleneck in multi-step reasoning. PMD addresses a more specific version of that problem - not runtime working memory for launched agents, but learning memory during training.

Abstract representation of neural network connections and memory storage Cross-episode learning captures patterns that single-episode updates discard - the basis for PMD's training improvements. Source: unsplash.com

Why It Matters for Practitioners

If you're running RLVR fine-tuning on any structured task, ignoring cross-episode signals wastes signal you're already creating. The cost to enable PMD is added training complexity; the payoff is a better model without additional data collection.

Confidence Calibration Cuts Inference 12x

Test-time compute scaling works: running a model multiple times and combining answers improves accuracy on hard reasoning tasks. The catch is economics - majority voting over N samples costs N forward passes every time.

Xuqing Yang, Yi Yuan, Shanzhe Lei, and Xuhong Wang propose fixing this from the training side rather than the inference side.

The Calibration Gap

RL training improves for correctness. A model trained this way gets better at solving problems but doesn't necessarily know when it's solved them correctly. That missing calibration is what forces practitioners into blind majority voting - without confidence signals, you can't allocate compute selectively.

C3RL (Correctness and Confidence Calibration Reinforcement Learning) adds calibration objectives to the RL reward with correctness rewards. The training signal simultaneously pushes toward correct answers and toward accurate confidence estimates.

Adaptive Inference with CAS

The inference companion is CAS (Confidence-based Adaptive Test Time Scaling). Once a model is trained with C3RL, CAS routes each query based on confidence. High-confidence responses go out right away. Lower-confidence ones trigger additional sampling before a final answer is committed.

Across eight text and multimodal datasets, CAS reduces inference budget by up to 12.33x compared to majority voting, while improving both calibration and accuracy scores. The accuracy improvement comes from the better model; the budget reduction comes from not running expensive multi-sample voting on problems the model already knows it has right.

AI chip and circuit board representing compute infrastructure Adaptive inference allocation targets expensive sampling at the queries that actually need it. Source: unsplash.com

The Trade-off

C3RL requires more sophisticated training with multi-objective reward design. That's real complexity. But for systems where inference runs continuously at scale, a 12x compute reduction justifies sizable upfront training investment. The economics are especially clear for multimodal tasks, which are expensive to run in bulk.

Common Thread

All three papers expose the same pattern: current AI training pipelines don't use all available signal.

Vera shows that safety coverage is incomplete because no structured testing harness forced systematic exploration of the risk space - not because frameworks are poorly built. PMD shows that RLVR training produces cross-episode signal and then ignores it. C3RL shows that correctness optimization leaves calibration signal on the table, forcing downstream workarounds.

The solutions are all variations on the same move: look at the gap between what training actually uses and what the data contains, then close it. Vera-Bench is now public. PMD code accompanies the paper. C3RL and CAS give inference engineers a concrete path from training investment to runtime savings.

Sources: