Benchmarks

Speech Turing Tests, Smart Routing, Pseudocode Agents

New research reveals no speech AI passes a Turing test, adaptive routing slashes LLM costs 82%, and pseudocode planning transforms agent reliability.

Math Olympiad AI Leaderboard - March 2026 Rankings

Rankings of AI models on competition mathematics benchmarks including AIME 2025, IMO, MathArena, and FrontierMath, measuring the cutting edge of mathematical reasoning.

China's GLM-5 Rivals GPT-5.2 on Zero Nvidia Silicon

Zhipu AI's 744B open-source model GLM-5 was trained entirely on Huawei Ascend chips and now competes with GPT-5.2 and Claude Opus on major benchmarks.

Agentic AI Benchmarks Leaderboard - GAIA, WebArena, BFCL, and Tau2-Bench

Rankings of the best AI models and agent frameworks on agentic benchmarks measuring real-world task completion, web navigation, function calling, and multi-turn tool use.

QuiverAI's Arrow 1 Hits #1 on SVG Arena With 1583 Elo - First Model to Break 1500 on Any Design Arena Leaderboard

Arrow 1, a purpose-built SVG generation model from a16z-backed QuiverAI, reached the top of SVG Arena with an Elo of 1583 one day after launch - shattering Gemini 3.1 Pro's previous record of 1421 by 162 points.

Gemini 3.1 Pro Review: Google's Reasoning Leap Is Real - With Caveats

Google's Gemini 3.1 Pro more than doubles its predecessor's reasoning scores and introduces adjustable thinking modes, but latency issues and preview-status quirks keep it from a clean sweep.

← Previous