Science Articles

Agents Hit 89%, Evals Get a Schema, Memory Falls Short

Three papers from today's arXiv: workplace agents jumped from 43% to 89% task completion in two years, a 47-researcher coalition ships a unified eval schema, and agent memory only helps when similarity tops 0.8.

Tool Blindness, Tree Search, and the Road to ASI

Three new papers expose a 50-point gap in agent tool knowledge, show tree search tripling inference throughput, and map the research between AGI and superintelligence.

Honest AI is Provably Impossible - Plus Two Agent Wins

A new impossibility theorem proves feedback-based training can't guarantee honest AI, while two papers cut agent memory costs 78% and multi-agent latency 7x.

Context Overload, Memory Leaks, and Agent Safety

Three new arXiv papers expose how context bloat tanks agent performance, agent memory bleeds private data, and misaligned behavior spreads through multi-agent systems.

MCP Exploit Risk, Sycophancy Scores, and Agent Self-Harm

New research reveals MCP error messages triple agent attack success rates, ranks eight models on sycophancy with Claude scoring best, and finds self-evolving agents make 30-42% false edits.

Safety Evals Break Under Attack, Agents Work 87% Faster

Three papers: strategic attack timing exposes gaps in AI control evaluations, Perplexity's agents slash task time by 87%, and Lean4 formal proofs make agent workflows more reliable.

AI Sabotage Blind Spots, Code Drift, and ZK Proofs

Three new arXiv papers expose how developers miss AI sabotage 94% of the time, why LLMs converge structurally in code evolution, and how ZK proofs could verify frontier AI training.

AI Attachment, Smarter Spending, and Cascading RAG Errors

Three new papers tackle how routine AI use quietly rewires emotional habits, how to spend compute where failures cost most, and why agentic RAG errors compound before anyone notices.

When to Stop - Overthinking, Handoffs, and Abstention

Three new papers show that AI agents fail not by doing the wrong thing, but by doing things when they should have stopped.

Reasoning Leaks, Hard Limits, and Self-Aware LLMs

Three new papers expose how reasoning traces can be extracted from supposedly hidden model internals, where chain-of-thought hits an architectural ceiling, and how RL teaches models to know when to quit.

Cut CoT Costs, Fix Agent Memory, Test Clinical AI

Three papers: smarter CoT trimming cuts reasoning length by 50%, a plug-in context manager rescues frozen agents on long tasks, and a 960K-item clinical benchmark exposes LLM gaps in hospitals.

Reasoning Capitulation, Faster Guardrails, Curation Risk

Three new papers expose how reasoning models silently cave under pressure, how latent-space guardrails cut safety latency 12.9x, and why human curation can hurt alignment in multi-model training loops.

← Previous