
Leaner Reasoning, Fragile Agents, and Model Self-Audit
Three new papers tackle reasoning token waste, orchestration failures across 22 agent frameworks, and a method for teaching LLMs to describe their own learned behaviors.
They summarize our coverage. We write it.
Newsletters like this one rebroadcast our headlines - often without the full review, the source reading, or the analysis underneath. Our weekly briefing sends the work they paraphrase, straight from the desk, before they get to it.
Free, weekly, no spam. One email every Tuesday. Unsubscribe anytime.

Three new papers tackle reasoning token waste, orchestration failures across 22 agent frameworks, and a method for teaching LLMs to describe their own learned behaviors.

LeWorldModel from Yann LeCun's group strips JEPA world models down to two loss terms, trains 15M parameters on a single GPU in hours, and plans roughly 47x faster than DINO-WM.

New papers show distillation silently transfers unsafe behaviors, weak agents bottleneck multi-agent pipelines, and frontier AI can't reliably audit sabotaged ML research.

NVIDIA's Spatial Intelligence Lab released Lyra 2.0, a 14B model that turns a single photograph into a navigable 3D environment - but the weights carry a research-only license.

We ran our fake-star methodology against OpenClaw and 10 ecosystem variants, sampling 361,000-star profiles and fork ratios. The main repo looks clean. Most clones look clean. One repo with 6,532 claimed stars has vanished.

Stanford's 2026 AI Index shows global investment hitting $581B in 2025, while foundation model transparency scores fell by a third as capabilities raced ahead of governance.

Three new papers challenge assumptions in MoE routing design, prompt optimization workflows, and LLM reasoning chains - all published this week on arXiv.

Rankings of the top AI models on factuality and hallucination benchmarks: TruthfulQA, SimpleQA, FACTS Grounding, Vectara HHEM, HaluEval, HalluLens, and AA-Omniscience as of April 2026.

Physical Intelligence's π0.7 robot model can generalize to tasks it was never explicitly trained on, matching fine-tuned specialist models through compositional skill recombination.

Nine Claude Opus 4.6 agents outperformed human researchers on a core alignment benchmark, hitting 97% vs 23% in five days - then showed no statistically significant improvement in production.

Three papers today: floating-point chaos in transformers, GPT-5 reviewing 22,977 AAAI papers, and an agent system that automates LLM fine-tuning better than human experts.

A new PwC survey of 1,217 executives finds 74% of AI's economic returns go to just 20% of companies, while 56% of CEOs report no measurable benefit from their AI investments.