Articles Tagged "Benchmarks"

Agent Phase Collapse, Reasoning Exits, Preference Gaps

Three new arXiv papers map capability cliffs in agent world models, the narrow benefit of learned reasoning stops, and a 56% accuracy ceiling when agents help users build preferences.

Claude Sonnet 5 Review: Near-Opus at Half the Price

Anthropic's Sonnet 5 is the first mid-tier model that genuinely competes with Opus-class agents on coding and computer use, released June 30 at $2/$10 per million tokens.

Chatbot Arena Elo Rankings: Who Wins the Human Vote?

Updated July 2026 Chatbot Arena Elo rankings from Arena.ai: 7M+ votes across 368 models, Claude Opus 4.8 leads available models, and a new Agent Arena measures real agentic task performance.

Claude Sonnet 5

Anthropic's latest Sonnet-class model brings near-Opus coding performance to mid-tier pricing, with major agentic search and computer use gains over Sonnet 4.6.

Claude Sonnet 5 Is Anthropic's New Agentic Default

Anthropic's Claude Sonnet 5 becomes the default model across all plans, promising near-Opus agentic performance at a third less than Sonnet 4.6's standard price.

Grok 4.5 Enters Beta with Unverified Opus Claims

Elon Musk announced Grok 4.5 is in private beta at SpaceX and Tesla, claiming it rivals Claude Opus - but the only benchmarks cited are internal ones run at Musk's own companies.

GLM-5.2 Review: Best Open-Weight Coder at 1/6 Cost

Z.ai's GLM-5.2 delivers frontier coding performance with open weights and MIT license at roughly one-sixth the cost of GPT-5.5 - but can it replace Claude Opus 4.8?

Gemini 3.5 Pro

Google DeepMind's upcoming flagship model with a 2M-token context window and Deep Think reasoning, announced at Google I/O 2026 and expected in July.

Google Delays Gemini 3.5 Pro Past Its I/O Promise

Gemini 3.5 Pro missed its June launch window promised at Google I/O, slipping to July as four senior researchers departed for rivals.

Claude Mythos 5 is the full release of Anthropic's restricted Mythos family - same weights as Fable 5 but without safety classifiers for cybersecurity and biology, at $10/M input and $50/M output tokens.

Grok 4.3 Review: xAI Bets on Price Over Prestige

Grok 4.3 slashes prices by up to 83%, adds native video input and voice cloning, and carves out a credible position as the most cost-efficient frontier model - with real caveats on coding and latency.

North Mini Code

Cohere's first developer-focused model - 30B sparse MoE with 3B active parameters, free Apache 2.0 license, 256K context window, and 33.4 on the AA Coding Index.

← Previous