
Leaderboards
Reasoning Benchmarks: GPQA, AIME, and Humanity's Last Exam
Rankings of AI models on the hardest reasoning benchmarks available: GPQA Diamond, AIME competition math, and the notoriously difficult Humanity's Last Exam.

Rankings of AI models on the hardest reasoning benchmarks available: GPQA Diamond, AIME competition math, and the notoriously difficult Humanity's Last Exam.

Anthropic launches Claude Opus 4.6 featuring agent teams, adaptive thinking, 1M token context window, and state-of-the-art performance on Terminal-Bench 2.0 and Humanity's Last Exam.

A comprehensive review of xAI's Grok 4, the first model to score 50% on Humanity's Last Exam, featuring Heavy and Coding variants with built-in tool use.