Safety Evals Break Under Attack, Agents Work 87% Faster

Today's arxiv submissions cover a lot of ground: one team shows that how you time an attack matters far more than the attack's raw power, a Harvard-Perplexity study puts hard numbers on what AI agents actually do to knowledge work, and a third group applies Lean4 - the theorem-proving language used in competitive mathematics - to verify that agent workflows behave as intended.

TL;DR

Attack Selection in AI Control Evals - Strategic timing of attacks cuts measured AI safety by up to 28 percentage points, without improving the attack itself
How Agents Reshape Knowledge Work - Perplexity's Computer agent completes tasks in 36 minutes vs 269 for human-led workflows, cutting estimated cost by 94%
Lean4Agent - Applying formal verification to agent workflows improves task performance by 11.94% and reveals which parts of a pipeline are structurally sound

When Timing Beats Talent: Attack Selection in AI Control Evals

The "AI control" research program asks a pointed question: if a model were deliberately trying to subvert its deployment, could current monitoring systems catch it? A new paper from Catherine Ge-Wang, Tyler Crosse, Benjamin Hadad IV, Joachim Schaeffer, Ram Potham, and Tyler Tracy tests a narrower version of that question - does it matter when a rogue AI chooses to act?

What they did

The team decomposed attack strategy into two components: a start policy (deciding when to begin an attack) and a stop policy (deciding when to abort one that risks detection). They tested both across two environments, BashArena and LinuxArena, while keeping the audit budget fixed at 1%.

Overview diagram showing attack start and stop policy decomposition in AI control evaluations Figure 1 from the paper shows how start and stop policies interact with the audit pipeline. Source: arxiv.org

What they found

Start policies alone reduced measured safety by 20 percentage points in both environments. Stop policies cut safety by another 20 points on BashArena and 28 points on LinuxArena. Neither improvement required the attacker to become more capable - just more strategic about timing.

The authors are careful to call these upper bounds, not expected outcomes. But the implication is uncomfortable: current safety evaluations assume attackers who are unsophisticated about when to act. A model with any ability to read its context could do better than the baseline the evaluations assume.

For practitioners building on AI safety and alignment frameworks, this is a calibration problem. The safety number produced by a control evaluation now comes with an asterisk: it's the safety level against a naive attacker, not a strategic one.

What Autonomous Agents Actually Do to Knowledge Work

Jeremy Yang (Harvard Business School) and Jerry Ma (Perplexity) published a study using production data from Perplexity's Search and Computer products. It's one of the first papers to compare conversational AI against fully autonomous agents using real user sessions at scale, rather than lab tasks.

The setup

Perplexity Search is the familiar query-response product. Computer is the company's autonomous agent - it can browse, execute code, fill forms, and work through multi-step tasks without waiting for user input. The research team analyzed sessions from both products to measure differences in autonomy, quality, and the type of work users ask for.

The numbers

Computer agents average 26 minutes of autonomous work per session. Search averages 33 seconds. That gap represents genuine task execution: the agent decomposes the work, handles sub-steps, and returns a result rather than waiting for the user to coordinate everything manually.

Task completion time dropped from 269 minutes to 36 minutes. Estimated cost fell by 94%.

Dissatisfaction rates on Computer run 55% lower than on Search. The paper attributes this to agents handling the parts of a task users find most tedious: sequential tool calls, data retrieval, and writing intermediate outputs.

Chart from the Perplexity study showing task time and cognitive level distribution across Search vs Computer agent sessions Figure from Yang et al. (2026) showing how agent sessions differ from traditional search in task complexity and time investment. Source: arxiv.org

What shifts for users

The paper uses Bloom's Taxonomy to categorize the cognitive level of tasks. Among Computer sessions, 76% involve higher-order cognition (analysis, evaluation, creation) versus 55% for Search. Half of agent tasks sit at the "Create" level - users aren't just asking for information, they're asking the agent to produce something.

The human role shifts accordingly. Users spend less time operating steps and more time specifying goals, checking outputs, and extending results. If you've started building with AI agents, this matches the pattern most practitioners notice: the interesting part of the work changes, not disappears.

One caveat worth noting: this is data from Perplexity's own platform, studied by researchers with institutional ties to the company. The methodology looks sound, and the data is production rather than synthetic, but independent replication would strengthen the conclusions.

Bringing Math's Proof Language to Agent Workflows

The third paper takes a completely different angle. Ruida Wang, Jerry Huang, Pengcheng Wang, Xuanqing Liu, Luyang Kong, and Tong Zhang ask whether Lean4 - the dependently typed programming language used for formal mathematical proofs - can verify that an agent's workflow is structurally sound.

The problem they're solving

Most agent systems are specified in natural language: a prompt describes what the agent should do, in what order, and under what conditions. Natural language is ambiguous by design, which means subtle errors in workflow logic often go undetected until something fails in production.

Formal methods eliminate ambiguity by forcing every statement into a language where correctness can be machine-checked. Lean4 is the current tool of choice for this in mathematics - it was used in AlphaProof and in competition math environments. The question is whether it can be applied to agent behavior rather than theorem proving.

What they built

The team produced two artifacts. FormalAgentLib is a Lean4 library for modeling agent workflows - think of it as a type system for what an agent is allowed to do and in what order. LeanEvolve is a tool that takes verification results from Lean4 and applies them to improve the workflow itself: if a workflow fails a proof, the tool iterates toward a version that passes.

System diagram of the Lean4Agent framework showing FormalAgentLib and LeanEvolve components The Lean4Agent framework applies formal verification to agent workflows and uses proof results to drive iterative improvement. Source: arxiv.org

Results across five models

The researchers tested on five leading LLMs, using SWE-Bench-Verified and ELAIP-Bench. Workflows that passed formal verification beat failing ones by an average of 11.94%. Running LeanEvolve on top of that - using proof failures to drive rewrites - added another 7.47%.

The gains are consistent across models, which suggests the verification is catching something structural rather than something model-specific. This connects to earlier work on agent behavioral contracts and certified pipelines, where formal constraints on agent behavior showed similar reliability improvements.

The approach isn't cheap. Lean4 proofs take time to write and check, and right now the tooling assumes a developer can formalize the workflow in the first place. But as LLMs get better at generating formal proofs - the AlphaProof work suggested this is a tractable direction - the cost of entry should fall.

The Common Thread

Three very different papers, but one shared concern: we don't have reliable ways to measure whether AI systems behave as intended.

The control evaluation paper shows that safety measurements are optimistic against strategic attackers. The Perplexity study shows that agent autonomy produces measurable gains but also shifts what "verifying the output" means for users. Lean4Agent shows that formal verification can catch workflow errors that testing alone misses, at the cost of upfront specification work.

None of these papers claims the problem is solved. The control evaluation researchers call their findings upper bounds on attack effectiveness and recommend that future evaluations include strategic timing in their threat model. Yang et al. acknowledge their data is platform-specific. The Lean4Agent authors note the tooling gap for formalizing workflows at scale.

What they collectively point toward is a field that's getting more rigorous about the gap between "the system passed our eval" and "the system does what we think it does." That gap is real, and closing it's harder than it looks.

Sources: