
Reasoning Benchmarks: GPQA, AIME, and Humanity's Last Exam
Rankings of AI models on the hardest reasoning benchmarks available: GPQA Diamond, AIME competition math, and the notoriously difficult Humanity's Last Exam.

Rankings of AI models on the hardest reasoning benchmarks available: GPQA Diamond, AIME competition math, and the notoriously difficult Humanity's Last Exam.

Rankings of the best multimodal AI models for image understanding, video analysis, and visual reasoning, covering MMMU-Pro, Video-MMMU, and more.

Rankings of the best open-weight and open-source large language models in February 2026, including DeepSeek V3.2, Qwen 3.5, Llama 4 Maverick, GLM-5, and Mistral 3.

Complete MMLU-Pro benchmark rankings measuring graduate-level knowledge across 14 subjects with 12,000 questions and 10 answer options per question.

Rankings of the best AI image generation models including GPT Image 1.5, Gemini 3 Pro, Midjourney v7, FLUX 2 Max, Stable Diffusion 3.5, and Ideogram 2.0 across text rendering, photorealism, and artistic quality.

Rankings of AI models by cost efficiency, comparing performance per dollar across frontier and budget models. See which models deliver GPT-4 level performance at 1/100th the cost.