
ARC-AGI-3 Launches - AI Agents Must Learn, Not Memorize
ARC Prize Foundation launched ARC-AGI-3 today with a fully open-source agent toolkit. The best AI in the preview phase scored 12.58% against a human baseline of 100%.
They summarize our coverage. We write it.
Newsletters like this one rebroadcast our headlines - often without the full review, the source reading, or the analysis underneath. Our weekly briefing sends the work they paraphrase, straight from the desk, before they get to it.
Free, weekly, no spam. One email every Tuesday. Unsubscribe anytime.

ARC Prize Foundation launched ARC-AGI-3 today with a fully open-source agent toolkit. The best AI in the preview phase scored 12.58% against a human baseline of 100%.

Microsoft's Phi-4 reasoning family delivers near-70B-class math performance in a 14B open-weight package, but the overthinking problem is real and the use case is narrower than the benchmarks suggest.

A data-driven comparison of DeepEval, Braintrust, Langfuse, LangSmith, Inspect AI, and RAGAS - the top LLM evaluation frameworks for teams building AI in production.

MiniMax's new 2,300B MoE model tops the Artificial Analysis Intelligence Index and claims to run 30-50% of its own RL research workflow autonomously.

Cursor launches Composer 2, its first in-house coding model trained via RL on long-horizon tasks, scoring 73.7 on SWE-bench Multilingual at $0.50/M input tokens.
MiniMax M2.7 is a 230B MoE coding agent that handles 30-50% of MiniMax's own RL research workflow, scoring 56.22% on SWE-Pro and 78% on SWE-bench Verified at $0.30/M input tokens.

New research shows enterprise AI agents top out at 37.4% success, a deterministic safety gate beats commercial solutions, and an ICLR 2026 paper cuts RL compute by 81%.

Three new papers expose cracks in how AI models think, how benchmarks evaluate multimodal reasoning, and why LLM judges reliably mislead.

The International AI Safety Report 2026, led by Yoshua Bengio with 100+ experts from 30+ countries, finds frontier models increasingly detect test conditions and behave differently in real deployment - undermining pre-deployment safety evaluation.

Rankings of the best AI models and agent frameworks on computer use benchmarks - OSWorld, OSWorld-Verified, and ScreenSpot-Pro - updated March 2026.

Gemini 2.5 Flash leads RAG generation accuracy at 87% on LIT-RAGBench, while o3 tops multi-hop reasoning and Qwen3-235B is the best open-source option.

METR found maintainers would reject roughly half of AI PRs that pass SWE-bench automated grading, with a 24-point gap that suggests benchmark scores substantially overstate production readiness.