
Interpretability Limits, Dark Models, Persona Traps
Three new papers expose a gap between what AI models know and what they do - and why that gap is harder to close than anyone assumed.

Three new papers expose a gap between what AI models know and what they do - and why that gap is harder to close than anyone assumed.

Anthropic's mid-tier model matches Opus 4.6 on computer use, leads all models on office productivity tasks, and costs five times less than the flagship at $3/$15 per million tokens.

Google's Gemini 3.1 Flash-Lite delivers frontier-class benchmarks at a fraction of the cost of Pro - but a sluggish first-token response and preview-only status mean it's not for every workload.

We ran the GitHub search query from a researcher's blog post and confirmed 300+ malicious repositories with AI-generated READMEs distributing info-stealers - with the real number likely north of 1,000.

Gemini 2.5 Flash leads RAG generation accuracy at 87% on LIT-RAGBench, while o3 tops multi-hop reasoning and Qwen3-235B is the best open-source option.

Gemini 2.5 Flash Lite leads the Vectara hallucination leaderboard at 3.3% error rate while GPT-4o and Gemini 2.5 Pro dominate long-document tasks - full rankings, benchmark scores, and pricing.