
Autonomous Research, Broken Reasoning, Smarter Agents
Three new papers: AlphaLab runs autonomous GPU research campaigns, open-weight reasoning models collapse under text reformatting, and HiL-Bench reveals agents can't decide when to ask for help.
They summarize our coverage. We write it.
Newsletters like this one rebroadcast our headlines - often without the full review, the source reading, or the analysis underneath. Our weekly briefing sends the work they paraphrase, straight from the desk, before they get to it.
Free, weekly, no spam. One email every Tuesday. Unsubscribe anytime.

Three new papers: AlphaLab runs autonomous GPU research campaigns, open-weight reasoning models collapse under text reformatting, and HiL-Bench reveals agents can't decide when to ask for help.

xAI's Grok 4.20 replaces the single-model approach with four specialized agents that debate before every answer - a bold architectural bet that pays off in some areas and stumbles in others.

Arcee AI ships Trinity-Large-Thinking, a 398B sparse MoE reasoning model under Apache 2.0 that hits 91.9% on PinchBench for $0.85 per million output tokens on OpenRouter.

Three papers: AI safety measures withhold critical clinical guidance from patients, SAT cuts reasoning tokens by 40%, and conformal prediction blocks wrong multi-agent consensus.

LG AI Research released EXAONE 4.5, a 33B open-weight vision-language model that posts higher STEM scores than GPT-5-mini and Claude 4.5 Sonnet - but a non-commercial license caps its real-world reach.

Three papers expose safety training's moral blind spot, two distinct failure modes inside reasoning models, and a 10x cheaper way to know when a reasoning model is guessing.

Meta's first closed-source frontier model scores 52 on the Artificial Analysis Intelligence Index, leads on HealthBench Hard, and ships free at meta.ai - but has no public API yet.

Meta Superintelligence Labs releases Muse Spark, its first AI model built from scratch in nine months, landing 4th on the Artificial Analysis Intelligence Index.

Google's MedGemma 1.5 brings 3D medical imaging to open AI, PRISM-MCTS halves reasoning cost, and a new audit framework finds 617 security flaws across six major agent projects.

Gemma 4 is Google DeepMind's most capable open model family: four variants from 2B to 31B, Apache 2.0 license, multimodal across text/image/video/audio, and the 31B Dense ranking #3 on Chatbot Arena against all open-weight models globally.

Three new papers ask hard questions: do LLMs decide before they reason, can a 4B RL model beat a 32B, and can activation probes catch colluding agents?

Grok 4.20 is xAI's current flagship LLM with a 2M-token context window, native multi-agent mode, and reasoning toggle at $2.00/M input tokens.