Benchmarks

GPT-5.4 Lands with Computer Use and 1M Token Context

OpenAI ships GPT-5.4 with built-in computer use that beats human desktop performance, a 1 million token context window, and native Excel and Google Sheets integrations.

Corrupt Agent Scores, Memory Bottlenecks, Skill Evolution

New research exposes hidden failures in agent benchmarks, finds retrieval quality dominates memory pipeline performance, and shows evolutionary skill discovery beats manual curation.

Mercury 2 Review: 1,000 Tokens per Second, Tested

Mercury 2 by Inception Labs is the fastest reasoning LLM available, built on diffusion architecture. We tested the speed, quality, and real-world trade-offs.

Qwen3.5 MoE vs Kimi K2.5 for Coding - Price Breakdown

Kimi K2.5 leads every coding benchmark, but Qwen3.5-35B-A3B delivers 87-93% of that performance at 3-4x lower cost and runs on a single consumer GPU. Here is the full breakdown.

Mercury 2 Is 13x Faster Than Claude Haiku - Verified

Inception Labs' Mercury 2 hits 1,196 tokens per second in independent testing - a diffusion architecture that rewires how inference works.

Gemini 3.1 Pro Tops Benchmarks but Developers Can't Rely on It

Gemini 3.1 Pro leads ARC-AGI-2, LiveCodeBench, and 11 other benchmarks with 750 million users and 21.5% market share - but developers report stalled responses, leaked thinking tokens, and API outages that make it unusable for production coding and agent workflows.

← Previous