Runtime Safety, Alignment Gaps, and Elastic Context

Three papers from today's arXiv drop cover the mechanics of keeping agents safe at runtime, a structural problem with how the field measures alignment, and a new architecture for agents that have to think for a very long time.

TL;DR

AgentTrust - A runtime safety layer that intercepts agent tool calls before execution, hitting 95% verdict accuracy at millisecond latency
Deployment-Relevant Alignment - Model-level benchmarks cannot predict real-world alignment behavior; evaluation must happen at interaction and deployment levels
LongSeeker - Elastic context management for search agents achieves 61.5% on BrowseComp by teaching agents to actively reshape their working memory

AgentTrust: A Firewall for Agent Tool Calls

The scenario motivating AgentTrust is straightforward and slightly terrifying: your agent decides to run a destructive shell command, or gets tricked by a prompt injection into exfiltrating data, and by the time you notice, the action is already done.

Chenglin Yang's paper proposes wrapping every agent tool call in a safety evaluation layer before execution. AgentTrust intercepts the call, classifies it, and returns one of four verdicts: allow, warn, block, or review.

What makes this practical is the layered approach. Shell commands get deobfuscated first - base64 and hex encoding tricks that fool naive pattern matchers are stripped before analysis. The system also detects RiskChain attacks, where individual steps look benign but combine into something dangerous. For ambiguous inputs, it falls back to a LLM judge with response caching to keep latency low.

The Numbers

On the authors' internal benchmark of 300 scenarios across six risk categories, AgentTrust hit 95.0% verdict accuracy with 73.7% risk-level accuracy. On 630 independently-constructed real-world adversarial scenarios it held at 96.7%, including roughly 93% on shell-obfuscated payloads.

Those are strong numbers for a domain where false negatives cost real damage and false positives break useful workflows. The latency is described as "low millisecond" - viable for production use.

Why This Is Worth Deploying Now

Most agent safety work operates after the fact: sandboxes, logging, rollback mechanisms. AgentTrust intervenes before execution, which matters because some actions can't be undone. The architecture is framework-agnostic and ships with Model Context Protocol (MCP) server support, meaning it integrates with existing MCP-based agent pipelines without rewriting the underlying agent.

The project is released under AGPL-3.0. For anyone running agents in production that touch files, shell commands, or external APIs, this belongs on the short list of things to assess this week. This builds on a pattern we have been tracking in the literature - see our earlier coverage on agent security and attack surfaces.

A padlock resting on a laptop keyboard, representing digital security and access control Runtime safety intercepts agent actions before they execute - catching destructive or malicious tool calls at the point of no return. Source: pexels.com

Alignment Isn't What Your Benchmarks Say It Is

Varad Vishwarupe, Nigel Shadbolt, Marina Jirotka, and Ivan Flechais released a paper that takes aim at one of the field's most comfortable assumptions: that model-level benchmark scores tell you something meaningful about real-world alignment.

Their argument is precise. Alignment claims are only valid at the level where the evidence was collected. A model that scores well on a static alignment benchmark has demonstrated alignment at the model level - which doesn't automatically translate to alignment at the interaction level (how it behaves across a multi-turn conversation) or at the deployment level (how it behaves integrated into a real product with real users).

Four Tiers, Zero Translation

The paper formalizes four evaluation tiers: model, response, interaction, and deployment. Their audit of 16 widely-used alignment benchmarks found that user-facing verification support is absent across every benchmark looked at. Interactive benchmarks exist but remain sparse - the authors call out tau-bench, CURATe, Rifts, and Common Ground as rare exceptions in a largely static landscape.

Their stress test across 180 transcripts, three frontier models, and four scaffolding approaches found sharp model dependence: a verification scaffold that worked for one model produced no categorical change in another. This matters because scaffolding approaches are often presented as universal solutions.

A benchmark score is evidence of alignment at the level where it was measured - nothing more. Claiming deployment-level safety from model-level evals is an inference gap the field has quietly accepted.

The proposed fix is deliberately unglamorous: alignment profiles that specify which tier each piece of evidence addresses, standardized interactive evaluation protocols, and reporting norms that make the gaps explicit. It is governance work, but it's the kind that prevents the next wave of "the model aced the benchmark but failed in production" incidents. Related: our look at how compound AI systems introduce new trust problems.

What Changes in Practice

This paper is an argument, not a tool - but it's a well-documented one. If your team is making alignment claims in a product spec or safety disclosure, the four-tier framework gives you a principled way to qualify what your evidence actually covers. It also gives evaluators a rubric for asking harder questions of vendors who wave benchmark numbers as proof of deployment safety.

A clipboard with a checklist on a desk, representing structured evaluation and review processes Alignment evaluation needs to be as structured as any other audit - specifying what tier of evidence each benchmark actually covers. Source: pexels.com

LongSeeker: Teaching Agents to Manage Their Own Memory

Search agents face a problem that gets worse the harder the task. Every tool call adds context. By the time an agent has done a dozen web searches and read ten pages, it is hauling mostly-irrelevant text through every next inference step - burning tokens, increasing hallucination risk, and degrading signal-to-noise ratio.

Yijun Lu et al. propose Context-ReAct, a paradigm built on five atomic operations for dynamically reshaping context: Skip (ignore a chunk), Compress (summarize), Rollback (discard recent additions), Snippet (extract a key passage), and Delete (remove completely). The result is LongSeeker, a search agent that treats its working memory as an active resource rather than a passive accumulator.

The Results on BrowseComp

LongSeeker scores 61.5% on BrowseComp and 62.5% on BrowseComp-ZH - deep, multi-step web research benchmarks that require chasing chains of evidence across dozens of pages. The authors report 40-50% improvement over baselines including Tongyi DeepResearch and AgentFold.

The model is fine-tuned from Qwen3-30B on 10,000 synthesized trajectories. The paper includes a mathematical proof that the Compress operator is expressively complete - meaning any context transformation can be added using just these five operations. For the theoretically inclined, this suggests the framework is structurally sound rather than just empirically convenient.

Beyond Bigger Windows

The dominant approach to long-context agents is to throw more context window at the problem. LongSeeker's contribution is showing that selective memory management - actively discarding irrelevant information rather than building up everything - beats passive accumulation even when windows are large enough to hold everything.

The 40-50% gain over strong baselines is significant. This isn't a marginal architectural tweak - it's evidence that memory hygiene is a first-class capability for long-horizon agents, not a nice-to-have. For practitioners building research assistants or multi-step information retrieval pipelines, the Qwen3-30B base makes this accessible, and the training recipe is described in enough detail to adapt.

Rows of books on library shelves, representing organized information retrieval and memory management Like a well-organized library, LongSeeker's five operators let agents actively prune and compress their context rather than building up everything. Source: unsplash.com

The common thread across all three papers is that agent capability has outrun the field's tooling. Tool calls need pre-execution safety checks, not just sandboxes after the fact. Alignment evaluations need to match the tier where claims are being made. Memory management needs to be active, not passive. None of these are theoretical contributions - they're arguments and tools you can act on today.

Sources: