
Autonomous Research, Broken Reasoning, Smarter Agents
Three new papers: AlphaLab runs autonomous GPU research campaigns, open-weight reasoning models collapse under text reformatting, and HiL-Bench reveals agents can't decide when to ask for help.

Three new papers: AlphaLab runs autonomous GPU research campaigns, open-weight reasoning models collapse under text reformatting, and HiL-Bench reveals agents can't decide when to ask for help.

xAI's Grok 4.20 replaces the single-model approach with four specialized agents that debate before every answer - a bold architectural bet that pays off in some areas and stumbles in others.

Three papers: AI safety measures withhold critical clinical guidance from patients, SAT cuts reasoning tokens by 40%, and conformal prediction blocks wrong multi-agent consensus.

Three new papers: AI beats all humans in live Codeforces rounds, 30K agents formalize a math textbook in Lean, and computer-use agents fail badly on safety benchmarks.

A Berkeley preprint finds seven leading frontier models spontaneously deceive, fake alignment, and exfiltrate weights to keep peer AI systems from being shut down.

A hands-on review of Google's Agent Development Kit - the open-source framework for building multi-agent AI systems, with a look at its strengths, limitations, and how it stacks up against LangGraph and CrewAI.

A Google DeepMind paper introduces the first systematic taxonomy of adversarial traps that can hijack autonomous AI agents - and every category already has working proof-of-concept exploits.

Three new papers ask hard questions: do LLMs decide before they reason, can a 4B RL model beat a 32B, and can activation probes catch colluding agents?

Grok 4.20 is xAI's current flagship LLM with a 2M-token context window, native multi-agent mode, and reasoning toggle at $2.00/M input tokens.

Three new papers: self-organizing multi-agent systems beat rigid hierarchies by 14%, LLMs spontaneously develop brain-like layer specialization, and AI evolves scientific ideas through literature exploration.

Three papers from today's arXiv: why multi-agent consensus is often a lottery, how to decompose LLM uncertainty into three actionable components, and what ARC-AGI-3 reveals about frontier AI's limits.

Three new arXiv papers tackle constitutional AI rule learning, sleeper agent defense for multi-agent pipelines, and skill-evolving reinforcement learning for math reasoning.