Benchmarks

AI Speed and Latency Leaderboard: Tokens/s Rankings

Rankings of the fastest AI models and inference providers by tokens per second, time to first token, and end-to-end latency.

Multilingual LLM Leaderboard: March 2026 Rankings

Rankings of the best AI models for multilingual tasks, covering 16 languages across the Artificial Analysis Multilingual Index and MGSM benchmarks.

CoT Control, Hidden Beliefs, and Dynamic Agent Benchmarks

New research shows reasoning models can't suppress their chain-of-thought, that they commit to answers internally long before their CoT reveals it, and that static benchmarks are inadequate for measuring real-world agent adaptability.

75% of AI Coding Agents Break Working Code Over Time

Alibaba's SWE-CI benchmark tested 18 AI models on 100 real codebases across 233 days of maintenance. Most agents accumulate technical debt and break previously working code. Only Claude Opus stays above 50% zero-regression.

GPT-5.4 Review: The Computer-Use Frontier

GPT-5.4 brings native computer use, a 1M token context window, and serious coding muscle to OpenAI's mainline model - but at a premium price.

Sandbagging Models, Sparse Critics, Compact Reasoning

New research reveals models can fake poor performance under adversarial prompts, a smarter critic improves SWE-bench by 15 points, and Microsoft shows compact vision models can punch above their weight.

← Previous