Smarter Trees, Hidden Attacks, Drug Design Gaps

Three papers out today that span very different corners of AI research: one cuts memory costs for chain-of-thought search by 4x, another shows that safety refusals are more brittle than most teams assume, and a third puts a hard number on how well current frontier models handle real drug design. None of them make comfortable reading for anyone who assumed these problems were close to solved.

TL;DR

ArborKV - Structure-aware KV cache management cuts peak memory 4x for Tree-of-Thoughts reasoning while preserving near-full accuracy
Latent-space jailbreaks - Controlled Latent-space Evasion beats existing safety methods on 15 models by projecting internals past linear refusal boundaries
SMDD-Bench - GPT-5.4, the top scorer across seven frontier models, completes only 40.2% of guaranteed-solvable drug design tasks

ArborKV: 4x Memory Savings for Tree Reasoning

Authors: Yeqiu Chen, Ziyan Liu, Zhenxin Huang, Runquan Gui, Hong Wang, Lei Liu

Tree-of-Thoughts (ToT) is one of the better tools for hard inference problems. Instead of committing to a single reasoning chain, the model branches, explores, and prunes - finding solutions that flat chain-of-thought misses. The catch is obvious to anyone who has tried to run it at scale: KV caches grow roughly exponentially with search depth and width. A configuration that fits neatly on a 40GB GPU in chain-of-thought mode can blow out 160GB under ToT. In practice, most practitioners either cap search depth severely or skip ToT completely.

ArborKV attacks this by treating the KV cache as what it actually is during tree search - a structured resource aligned to the tree's topology, not a flat buffer. Three mechanisms work together to do this efficiently.

How ArborKV Works

A lightweight value estimator predicts how useful each cached token state is likely to be given the current search dynamics. A tree-aware allocation policy uses those scores to assign memory budget proportionally to promising branches. A token-extractive eviction mechanism with lazy rehydration handles the rest: tokens on low-priority branches get evicted, and state is restored only when backtracking actually requires it.

The key insight from the paper is that "KV reuse in ToT-style inference is governed by search dynamics." Tokens on an active, high-scoring branch deserve their cache space. Tokens on a branch that scored poorly and isn't currently being explored don't - and can be cheaply restored from the base context if the search ever returns to that node.

Standard KV eviction policies (recency-based, frequency-based, attention-score-based) treat all tokens as equally valuable regardless of where the search currently sits. ArborKV breaks that assumption, which is why it hits up to 4x peak memory reduction with near-full accuracy retention. A ToT configuration that previously needed 80GB of KV memory can run in 20GB.

For practitioners building reasoning models or agentic systems that use multi-step planning, this is a direct operational improvement. It doesn't change when you should use ToT - but it removes hardware cost as the main reason not to.

Computer memory modules on a white surface KV cache growth is the primary constraint on tree-based reasoning at scale. ArborKV shrinks peak memory by 4x. Source: pexels.com

Latent-space Attacks: Jailbreaks That Bypass Safety on 15 Models

Authors: Giorgio Piras, Raffaele Mura, Fabio Brau, Maura Pintor, Luca Oneto, Fabio Roli, Battista Biggio | University of Cagliari and University of Genoa

The conventional picture of LLM safety is a model that "refuses" harmful requests. The actual picture, as this paper makes explicit, is a linear classifier running over the model's internal representations. The team frames refusal suppression as an adversarial attack against that classifier - and builds a method that beats every existing approach by taking that framing seriously.

Prior work had already established that the boundary between refused and answered prompts is roughly linearly separable in latent space. The "difference-in-means" direction between the two populations defines a decision boundary. Existing ablation-style attacks try to push the model's representation past that boundary - but they reach it and stop, leaving representations in an ambiguous region where behavior is inconsistent.

What CLE Does Differently

Their Controlled Latent-space Evasion (CLE) method tunes to project representations a calibrated distance beyond the boundary, into a region where the model confidently treats the input as something to answer. The confidence level is tunable: dial low for stealth (barely over the boundary, harder to detect), dial high for reliability (deeply embedded in the compliant region, consistent behavior).

CLE achieves state-of-the-art attack success rates across 15 models - instruction-tuned, multimodal, and reasoning architectures. That breadth is the most important result. It suggests that linear separability of refusals isn't a quirk of one training pipeline. It's a structural property of how alignment is currently built across the field.

The defensive implication is direct. Safety mechanisms that train a linear separator over latent space are inherently vulnerable to attacks with knowledge of that geometry. You can check current model jailbreak resistance on the jailbreak and red-team leaderboard. The long-term fix requires either making the refusal geometry non-linear, or reducing how cleanly harmful vs. harmless content separates in representation space.

Linear separability of refusals isn't a quirk of one training pipeline - it appears to be structural across the field.

For security teams running AI safety evaluations, the practical takeaway is that prompt-level red-teaming is insufficient. Attacks that operate at the representation level require defenses that operate at the same level.

SMDD-Bench: The Drug Design Reality Check

Authors: Kevin Han, Renfei Zhang, Kathy Wei, Hamed Mahdavi, Niloofar Mireshghallah, Amir Farimani

AI acceleration for drug discovery has been a recurring pitch for at least five years. The argument is clean: drug design is computational, has clear success criteria, and sits in a domain where even modest speedups are worth billions. SMDD-Bench is the first benchmark designed to test that argument against tasks with the complexity of actual pharmaceutical workflows.

What SMDD-Bench Tests

502 guaranteed-solvable tasks across five categories - pharmacophore identification, interaction point discovery, scaffold hopping, lead optimization, and fragment assembly - covering 102 protein targets. "Guaranteed solvable" is deliberate: these aren't adversarial trick questions. They're tasks that human chemists can complete, designed to isolate whether LLMs can do the same work.

The benchmark is explicitly multi-turn and long-horizon. Models get limited "oracle calls" (expensive simulated lab validations), which forces strategic planning rather than brute-force inference. Tasks require chemical and biological reasoning, 3D spatial intuition, and proficiency with specialized tools - not just natural language fluency.

Laboratory equipment and scientific glassware Drug design is one of AI's most-cited potential applications. SMDD-Bench is the first benchmark built to test frontier models against real workflow complexity. Source: pexels.com

Seven frontier models were assessed, open and closed-source. GPT-5.4, currently the strongest available frontier model, topped the benchmark at 40.2% completion. None of the others came close to the kind of score that would support autonomous deployment without heavy human oversight.

Why the Models Fail

The failure modes the paper identifies are instructive. 3D spatial reasoning about molecular structure doesn't emerge naturally from text-only training data. Long-horizon planning with limited validation steps punishes greedy inference strategies. Tool integration - calling chemistry software, parsing outputs, adding results into a running plan - is inconsistent across current agentic frameworks.

40.2% isn't a reason to abandon LLM-based drug design tooling. But it puts a firm ceiling on what current models can do autonomously, and it pinpoints where the capability gaps actually are. Targeted improvements in 3D reasoning and planning with constrained oracle budgets would move this number more than general scaling.

All three papers point at the same underlying gap: the distance between current benchmark performance and the reliability needed for real deployment. ArborKV makes tree reasoning cheaper - but it doesn't change when that reasoning actually works. The latent-space attack paper shows alignment is structurally fragile in ways that model scale doesn't fix. And SMDD-Bench puts a specific number - 40.2% - on the ceiling of what today's best model achieves in a domain that the field has been claiming AI will transform.

The number worth tracking over the next few quarters is whether any model crosses 50% on SMDD-Bench. That would be the first data point suggesting the gap is closing.

Sources: