AI Diagnosis, Cache Efficiency, and Agent Security

Today's arXiv drop has three papers worth your attention: one running a randomized clinical trial with AI as the co-investigator, one compressing LLM memory caches down to almost nothing without losing much quality, and one building a systematic framework for attacking rolled out agent systems.

TL;DR

RaDaR - A 32B rare disease model beats DeepSeek-R1 (671B) in randomized trials and identifies the right diagnosis 1.87 months before clinicians document it
CompressKV - Semantic Retrieval Heads compress KV caches to 3% of their size while keeping over 97% accuracy on QA benchmarks
RIFT-Bench - Graph-driven red-teaming across 45 agent architectures finds vulnerabilities that static methods miss completely

RaDaR: The 32B Model Beating a 671B Giant in Rare Disease Diagnosis

Rare disease diagnosis is one of medicine's hardest problems. Patients average years of misdiagnoses before getting an accurate label, cycling through specialists while the window for intervention narrows. A new paper from a team of 31 researchers led by Haichao Chen introduces RaDaR, a 32-billion-parameter open-source model trained specifically for rare disease cases - and its results are hard to argue with.

A medical research laboratory with scientific equipment and samples RaDaR was trained on nearly 50,000 real clinical cases and more than 100,000 synthetic cases built around phenotype-anchored patient narratives. Source: unsplash.com

The team trained RaDaR on nearly 50,000 real clinical cases plus more than 100,000 synthetic cases, using a technique they call phenotype-anchored synthetic narratives to work around the chronic shortage of labelled rare disease data. The model was then put through a randomized controlled trial - not just a benchmark comparison against other models.

What the Trial Found

Physicians using RaDaR as an assistant improved their diagnostic accuracy by 21.44 percentage points compared to physicians using internet search alone. That's a meaningful clinical delta.

The lead time result is equally notable. RaDaR identified the correct diagnosis in 61% of cases before that diagnosis appeared in the official clinical documentation, with an average lead time of 1.87 months. For rare disease patients, that's the difference between months of additional wandering and starting the right treatment.

What earns this paper particular attention is the size comparison. RaDaR at 32B parameters consistently beat DeepSeek-R1 at 671B parameters on rare disease tasks. The authors attribute this to domain specialization: a compact model trained on the right data beats a much larger general-purpose model in constrained clinical settings. The development framework is reproducible and the model is open-source, which matters for deployment in under-resourced healthcare systems where access to large compute is limited.

If you're thinking about AI for health applications, this paper is a concrete reminder that scale alone doesn't win every benchmark.

CompressKV: 97% Performance With 3% of the Cache

Long-context inference has a well-known bottleneck: the KV cache (key-value cache) that stores past attention states grows linearly with context length. For a 128K-token context, that's significant GPU memory sitting idle, driving up cost and crushing batch sizes.

Close-up of a GPU processor showing memory chips and compute components KV cache memory grows linearly with context length in transformer inference - a real barrier to efficient long-context deployment. Source: unsplash.com

Xiaolin Lin and colleagues propose CompressKV, which takes a different approach to cache eviction than most existing methods. Instead of scoring all attention heads uniformly, the system identifies a small subset they call Semantic Retrieval Heads - attention heads that are especially good at locating relevant tokens across long contexts - and uses those heads to decide what to keep and what to drop.

How the Method Works

Most KV cache eviction methods apply the same eviction logic across every head in grouped-query attention (GQA) architectures. CompressKV argues this is wasteful because not all heads play the same role. A Semantic Retrieval Head has already figured out which tokens matter; the system exploits that signal to make better eviction decisions for everything else.

On top of per-head selection, the method adds layer-wise cache budget allocation based on eviction error estimates. Layers that would lose more information from compression get proportionally more budget. The result is smarter compression rather than uniform squeezing.

The numbers: on question-answering tasks, CompressKV maintains over 97% of baseline performance while using only 3% of the original KV cache. On Needle-in-a-Haystack retrieval benchmarks - the hardest test for long-context methods - it hits 90% accuracy with just 0.7% of KV storage. Both figures beat existing eviction methods across multiple memory budgets.

Code is publicly available at TUDa-HWAI/CompressKV. We covered related KV cache research earlier this year in our piece on programmatic agents and CoT compression. The difference here is the Semantic Retrieval Head signal, which is much more targeted than prior approaches. If you're running 128K-context inference at scale, this is a concrete place to cut costs.

RIFT-Bench: Graph-Driven Red-Teaming for Deployed Agents

As AI agents become more autonomous, the attack surface grows. An agent that can browse the web, execute code, and call external APIs doesn't just face the usual LLM safety risks - it faces vulnerabilities tied to its architecture, tool access, and decision-making chain. Standard LLM red-teaming frameworks weren't built for that combination.

Close-up view of computer code on a terminal screen in a security context Red-teaming agent architectures requires systematic, automated tooling. Manual testing doesn't scale to 45 distinct system configurations. Source: pexels.com

Yarin Yerushalmi Levi and colleagues introduce RIFT-Bench, a red-teaming methodology designed specifically for agentic systems. The core idea is to represent an agent system as a graph - capturing its components, tool interfaces, information flows, and decision points - and use that graph to guide automated adversarial attacks.

How RIFT-Bench Works

The framework runs in two automated phases. Discovery extracts the system's structure: what tools are exposed, how they interact, what trust boundaries exist between components. Scanning then deploys adaptive adversarial attacks guided by what Discovery found.

The framework doesn't require manual attack scripting. Discovery extracts structure; Scanning rolls out attacks informed by that structure.

The team validated RIFT-Bench across 45 different agent architectures with varying implementations - different tool sets, different orchestration patterns, different memory mechanisms. It worked across all of them without needing system-specific customization, which is the whole point: a method that only works on one agent configuration isn't a methodology.

The paper also covers using RIFT-Bench to evaluate mitigation strategies, not just find vulnerabilities, which makes it useful for hardening during development rather than just testing at the end. You can see how this fits into the broader capability picture by checking the agentic AI benchmarks leaderboard - most of those evaluations focus on task completion, not resilience to adversarial inputs. RIFT-Bench is a practical complement to AI safety and alignment work that has mostly stayed theoretical.

The Common Thread

Three papers, three domains, one pattern: specificity beats generality. RaDaR shows a purpose-built 32B model beating a 671B general-purpose one in a narrow clinical domain. CompressKV shows that targeting the right attention heads beats treating all heads identically. RIFT-Bench shows that graph-structured knowledge of a specific agent system beats generic attack templates.

The assumption that scaling general systems always wins is getting tested on multiple fronts at once. These three papers add concrete data points to that case.

Sources: