Best AI Benchmarks to Watch in 2026
A practical guide to 30+ active AI benchmarks - what each one tests, who publishes it, how to read the scores, and where it breaks down. Organized by capability.

Every model release comes with a scoreboard. MMLU: 92%. SWE-Bench Verified: 70%. GPQA-Diamond: 78%. The numbers keep climbing. But if you don't know what each benchmark actually measures, who designed it, and where it falls apart, those scores tell you almost nothing.
This guide is the one I wish existed when I started tracking model capabilities seriously. It covers 30+ active benchmarks organized by capability - what they test, how scoring works, who publishes them, which models lead, and what the known failure modes are. Think of it as a field guide for reading the scoreboards.
If you want the deeper argument about why benchmarks are increasingly unreliable in general, I've covered that separately in Do AI Benchmarks Still Matter? - that piece is required reading alongside this one. This guide assumes you've accepted that benchmarks are imperfect but still useful, and focuses on helping you use them correctly.
TL;DR - The benchmarks that matter most by capability
- Coding: SWE-Bench Verified (real GitHub issues), LiveCodeBench (contamination-resistant)
- Reasoning: GPQA-Diamond (expert-level science), AIME 2025 (competition math), FrontierMath (unsolved research math)
- Multimodal: MMMU-Pro (multi-discipline visual), MathVista (visual math reasoning)
- Agentic: GAIA (real-world task completion), tau-bench (tool use in realistic environments)
- Safety: HarmBench (adversarial robustness), HolisticBias (representation fairness)
- Long context: HELMET (holistic evaluation across use cases), RULER (synthetic needle tests)
- The most honest benchmark right now: FrontierMath - frontier models score 2-5%, and you can't fake that
How to Read a Benchmark Score
Before digging into individual benchmarks, here is the framework I use when a benchmark score appears in a model release post.
Check the release date of the benchmark. Benchmarks saturate. Once a test's top scores cluster above 90%, it stops differentiating. MMLU had a 45% score from GPT-3 in 2020. By 2025, frontier models were at 90%+. An "industry-leading" MMLU score in 2026 means nothing.
Check N (number of questions). Benchmarks with fewer than 500 questions have high variance. A 5-question run can swing ±3% from sampling noise. GPQA-Diamond has 198 questions. AIME 2025 has 30. Take small-N scores with appropriate skepticism.
Check the eval framework. The same model can score differently depending on whether the evaluator uses zero-shot, few-shot, chain-of-thought prompting, or an agentic scaffold. Labs routinely choose the setup that favors their model. Apples-to-apples comparisons require identical eval frameworks.
Check for cherry-picking. A lab reporting "best-in-class on coding benchmarks" may be citing only the benchmarks where their model leads. Ask which benchmarks they didn't report.
Check contamination risk. Static benchmarks with public questions are contamination risks. If a model trained after a benchmark's release date, the questions may have appeared in its training data. See our contamination analysis for documented cases.
Check who ran the eval. Self-reported scores are less credible than third-party evaluations. The overall LLM rankings use standardized frameworks to reduce this variance.
Reasoning Benchmarks
View leaderboard data for reasoning benchmarks
Full rankings tracked at /leaderboards/reasoning-benchmarks-leaderboard/.
MMLU-Pro
What it tests: Graduate and professional knowledge across 14 disciplines - law, medicine, engineering, mathematics, history, psychology, and more. Harder than the original MMLU because wrong answer options are more plausible (10 choices vs. 4), and many questions require domain-specific reasoning rather than retrieval.
Who publishes it: TIGER-Lab (University of Waterloo). Released 2024.
How scoring works: Multiple-choice accuracy. Top frontier models score 85-91%. Scores below 65% indicate mid-tier capability.
Top scorers (as of April 2026): GPT-5.4 (~91%), Claude Opus 4.6 (~90%), Gemini 3.1 Pro (~89%).
Known limitations: Still susceptible to contamination on older discipline-specific questions. The 10-choice format reduces but doesn't eliminate option elimination shortcuts. Does not test applying knowledge to novel problems.
Official URL: huggingface.co/datasets/TIGER-Lab/MMLU-Pro
GPQA-Diamond
What it tests: Expert-level science questions (biology, chemistry, physics) written by PhD-level domain experts. The "Diamond" subset is the hardest tier - questions that non-expert PhDs typically answer wrong even with Google access. Designed to be Google-proof.
Who publishes it: David Rein et al. (NYU / Google DeepMind). Released 2023.
How scoring works: Multiple-choice, 4 options. Human expert baseline is ~65%. Non-expert humans with internet access score ~34%. Top frontier models now exceed the expert human baseline.
Top scorers: Frontier models cluster 75-84%. Human expert PhDs average ~65%.
Known limitations: Only 198 questions in the Diamond set - high variance. Expert-written questions cluster in hard science; no humanities or social science coverage. Possible soft contamination via arXiv papers describing the dataset.
Official URL: arxiv.org/abs/2311.12022 - github.com/idavidrein/gpqa
ZebraLogic
What it tests: Constraint satisfaction puzzles (logic grid puzzles similar to Einstein's riddle). Models must deduce unique assignments from a set of clues. Scales from 2x2 grids to 6x6 grids. Tests pure deductive reasoning without factual knowledge.
Who publishes it: Allen Institute for AI. Released 2024.
How scoring works: Exact-match accuracy per grid. Top models score ~60-75% on large grids; humans with time score near 100%. No contamination risk since grids are procedurally generated.
Top scorers: Models with extended thinking (Claude Opus 4.6, o3, Gemini 3.1 Pro Thinking) score significantly higher than base models.
Known limitations: February 2026 soft contamination research found 50% of ZebraLogic problems had semantic duplicates in common training corpora. Procedural generation helps but doesn't fully solve this.
Official URL: huggingface.co/datasets/allenai/ZebraLogic
BIG-Bench Hard (BBH)
What it tests: 23 tasks from the BIG-Bench suite that stumped models when the original benchmark was designed. Includes multistep arithmetic, word sorting, causal judgment, temporal sequences, and logical deduction. Specifically selects tasks where models underperformed humans at release.
Who publishes it: Google Research / DeepMind. Released 2022, BBH subset curated 2023.
How scoring works: Averaged exact-match accuracy across 23 tasks. Models that use chain-of-thought prompting score significantly higher. Frontier models now score 90%+ on most tasks - the benchmark is saturating.
Known limitations: The benchmark is aging. Most tasks are now solved by frontier models. Use as a basic capability floor test, not a differentiator.
Official URL: github.com/google/BIG-bench
AIME 2024 / AIME 2025
What it tests: American Invitational Mathematics Examination problems - competition math requiring proof-adjacent reasoning. AIME 2024 has 30 problems from official 2024 exams. AIME 2025 added the 2025 exam problems. Both test symbolic manipulation, number theory, combinatorics, and geometry at competition level.
Who publishes it: Mathematical Association of America (original exams). Third-party evaluators collect and report model scores.
How scoring works: Integer answers 0-999. No partial credit. Frontier models score 20-29/30 on AIME 2024; AIME 2025 (less likely to be contaminated) shows a few points lower.
Known limitations: Only 30 problems per year - high variance per run. AIME 2024 questions appeared in training data for most frontier models. AIME 2025 is the more reliable signal for current reasoning capability.
Official URL: artofproblemsolving.com/wiki/index.php/AIME_Problems_and_Solutions
MATH
What it tests: 12,500 competition math problems across 7 difficulty levels (1-5) and 7 subjects (algebra, counting, geometry, number theory, pre-calculus, prealgebra, intermediate algebra). Tests symbolic mathematical reasoning and multi-step problem solving.
Who publishes it: Dan Hendrycks et al. (UC Berkeley). Released 2021.
How scoring works: Exact match on symbolic answers, normalized for equivalent forms. Frontier models now score 85-95% overall. The Level 5 subset remains meaningful for differentiating frontier models.
Known limitations: Significantly contaminated - MATH problems and solutions are extensively indexed online. The Level 5 subset is more useful than overall MATH score. AIME is a better 2026 proxy for math reasoning.
Official URL: github.com/hendrycks/math
FrontierMath
What it tests: Novel, expert-crafted mathematics problems at research frontier level - problems involving original mathematical constructions that have never appeared online. Designed specifically to resist contamination. Problems span algebraic geometry, number theory, complex analysis, and other advanced domains.
Who publishes it: Epoch AI. Released late 2024. Problems verified by Fields Medal-level mathematicians.
How scoring works: Exact symbolic answer verification. Frontier models score approximately 2-5%. Human professional mathematicians working with tools solve 60-80%.
Top scorers: Best current models score under 5% - this benchmark is doing exactly what it's supposed to do.
Why it matters: This is the most honest benchmark currently available for mathematical reasoning. No contamination possible (problems are novel). No pattern-matching shortcuts. A 5% score on FrontierMath is more informative than a 92% score on MATH.
Official URL: arxiv.org/abs/2411.04872
Coding Benchmarks
View full coding benchmark leaderboard
Rankings tracked at /leaderboards/coding-benchmarks-leaderboard/.
SWE-Bench Verified
What it tests: Real GitHub issues from popular open-source Python repositories. Each task requires a model (or agent) to read an issue description, understand the codebase, write a patch, and pass the repo's existing test suite. "Verified" means human annotators have confirmed the issue and reference patch are unambiguous.
Who publishes it: Princeton NLP. Released 2024. Full details at swebench.com.
How scoring works: % of issues resolved (patch passes tests). The agentic leaderboard is tracked at /leaderboards/swe-bench-coding-agent-leaderboard/. Top scaffolded agents score 60-72%.
Top scorers: Top coding agents (Devin, SWE-agent variants, agent scaffolds on Claude/GPT-5) cluster 60-72%.
Known limitations: Research by Aleithan et al. found 60.83% of resolved issues involved solution leakage - the fix was stated or hinted in the issue itself. Filtering those drops resolution rates by roughly half. Still the most meaningful coding benchmark in widespread use, but treat absolute scores skeptically.
Official URL: github.com/SWE-bench/SWE-bench
SWE-Bench Multimodal
What it tests: Extension of SWE-Bench where issues include visual context - screenshots of bugs, UI failures, rendering artifacts. Models must interpret images alongside text to generate correct patches. Tests multimodal coding capability, not just text-based debugging.
Who publishes it: Princeton NLP. Released 2024.
How scoring works: Same patch-pass-tests metric as SWE-Bench Verified. Scores are notably lower than text-only due to the added visual reasoning requirement.
Official URL: arxiv.org/abs/2410.03859
LiveCodeBench
What it tests: Competitive programming problems from LeetCode, Codeforces, and AtCoder, continuously updated since July 2023. By pulling new problems each month, it is structurally resistant to contamination. Tests algorithmic problem-solving across difficulty levels.
Who publishes it: LiveCodeBench team (Harvard / MIT). Ongoing rolling release.
How scoring works: Pass@1 on auto-graded test cases. Scores stratified by problem difficulty. Rolling updates mean current scores reflect recently released problems.
Top scorers: Top models score 55-70% on the full problem set; harder recent problems show lower rates.
Known limitations: Problems cluster in competitive programming, which has a narrower distribution than real-world software engineering. Does not test repository-level reasoning.
Official URL: livecodebench.github.io - github.com/LiveCodeBench/LiveCodeBench
HumanEval+ (EvalPlus)
What it tests: Extension of OpenAI's original HumanEval coding benchmark with significantly more test cases per problem (up to 80x more). The extra tests catch solutions that pass the minimal original tests but fail edge cases - measuring code correctness more rigorously.
Who publishes it: EvalPlus team. Released 2023.
How scoring works: Pass@1 - does the model's first attempt pass all test cases? Scores on HumanEval+ are 5-15% lower than on the original HumanEval due to stricter testing.
Known limitations: 164 problems is a small sample. Problems are Python function synthesis - not representative of real-world engineering tasks. Heavily contaminated.
Official URL: github.com/evalplus/evalplus
MultiPL-E
What it tests: Translation of HumanEval and MBPP coding problems into 18+ programming languages beyond Python. Tests whether coding capability generalizes across languages or is Python-centric.
Who publishes it: Northeastern University / Wellesley College. Released 2022.
How scoring works: Pass@1 per language. Scores vary widely by language - models perform best in Python and TypeScript, noticeably worse in Lua, Perl, and R.
Known limitations: Inherits the contamination and narrow problem distribution of HumanEval. Better for testing language coverage than general coding quality.
Official URL: github.com/nuprl/MultiPL-E
AlgoTune
What it tests: Algorithmic efficiency - models must generate code that solves a computational problem both correctly and faster than a reference solution. Tests optimization and algorithmic thinking, not just correctness.
Who publishes it: Released 2025. Paper at arxiv.org/abs/2503.08048.
How scoring works: Solutions must pass correctness tests and achieve a speedup factor over the baseline solver. Scoring captures both correctness and efficiency.
Why it matters: Most coding benchmarks reward any correct solution. AlgoTune tests whether models can actually think algorithmically - a harder and more practically relevant skill.
TerminalBench
What it tests: Long-horizon terminal tasks requiring multi-step command-line operations - file manipulation, shell scripting, system administration, package management, and network tasks. Models operate directly in a terminal environment.
Who publishes it: Released 2025. Paper at arxiv.org/abs/2504.01990.
How scoring works: Task completion rate. Models must issue sequences of correct commands to achieve a defined end state.
Why it matters: Evaluates practical agentic capability in a realistic shell environment, closer to real-world developer use cases than synthetic function-synthesis benchmarks.
Agentic and Tool Use Benchmarks
View agentic benchmark leaderboard
Full rankings at /leaderboards/agentic-ai-benchmarks-leaderboard/.
BFCL v3 (Berkeley Function-Calling Leaderboard)
What it tests: Function calling / tool use. Given a user query and a set of available function signatures, can the model call the right function with the right arguments? Tests accuracy of tool selection and parameter extraction across diverse APIs and languages.
Who publishes it: UC Berkeley Sky Computing Lab. Continuously updated.
How scoring works: AST (abstract syntax tree) match between model output and correct function call. Covers single-turn, multi-turn, parallel, and nested function calls.
Top scorers: Leading models score 85-93% on simple single-call tasks, with significant drops on multi-step and parallel calls.
Known limitations: Controlled API signatures don't reflect the messiness of real-world API docs. Does not test error recovery when a tool call fails.
Official URL: gorilla.cs.berkeley.edu/leaderboard.html
tau-bench
What it tests: End-to-end task completion in realistic customer service and software developer environments. Models use tools (databases, APIs, code interpreters) to complete multi-step tasks. Evaluates whether models can recover from errors and maintain task focus over extended interaction.
Who publishes it: Sierra AI. Released 2024.
How scoring works: Success rate on task completion. Both retail (customer service) and airline (booking/management) domains. Top models score 55-70%.
Why it matters: One of the most realistic agentic benchmarks. Tasks require planning, tool use, error recovery, and domain knowledge - closer to real agent deployment than synthetic benchmarks.
Official URL: arxiv.org/abs/2406.12045
GAIA
What it tests: Real-world web tasks requiring multi-step information gathering, tool use, and reasoning. Tasks are designed so that a capable human with web access completes them in minutes, but they require careful planning and multiple tool calls. The benchmark is stratified by difficulty level.
Who publishes it: Meta AI / Hugging Face. Released 2023.
How scoring works: Exact answer match on 450+ tasks. Level 1 (simple factual) through Level 3 (complex multi-step). Frontier models score 65-80% on Level 1, dropping significantly on Level 3.
Known limitations: Answers are often factual lookups that could be contaminated by training data. The web-browsing component is only meaningful when evaluated with real browsing capability.
Official URL: arxiv.org/abs/2311.12983 - huggingface.co/spaces/gaia-benchmark/leaderboard
WebArena
What it tests: Autonomous web browsing tasks in realistic website simulations. Models must navigate realistic replicas of sites like Reddit, GitLab, CMS platforms, and online stores to complete tasks like "post a comment," "file a bug," or "find the cheapest product." Tests end-to-end web agent capability.
Who publishes it: Carnegie Mellon University. Released 2023.
How scoring works: Task success rate. Environments are local simulations to ensure reproducibility. Frontier agents score 30-55%.
Known limitations: Simulated environments differ from live websites. The benchmark does not capture real-world UI variability or dynamic content changes.
Official URL: webarena.dev
OSWorld
What it tests: GUI-based computer tasks on real operating systems (Linux, Windows, macOS). Models must interpret screenshots and issue keyboard/mouse actions to complete real software tasks - editing documents, navigating file systems, configuring settings.
Who publishes it: Shanghai AI Laboratory / CUHK. Released 2024.
How scoring works: Task success rate across 369 tasks in real OS environments. Frontier models score 20-40%. Human success rate is ~70%.
Why it matters: The most realistic computer-use benchmark available. Tasks run on actual OSes, not simulations.
Known limitations: Evaluation requires significant compute infrastructure. Results vary significantly based on the visual grounding method used.
Official URL: os-world.github.io
AgentBench
What it tests: Eight distinct agentic environments: OS manipulation, database operations, web browsing, web shopping, house holding, lateral thinking puzzles, digital card games, and knowledge graph queries. Each environment tests a different dimension of agent capability.
Who publishes it: Tsinghua University. Released 2023.
How scoring works: Task success rate averaged across environments. Frontier models score 35-55%. The multi-environment design makes it hard to game with narrow optimizations.
Official URL: arxiv.org/abs/2308.03688 - github.com/THUDM/AgentBench
Multimodal Benchmarks
View multimodal benchmark leaderboard
Full rankings at /leaderboards/multimodal-benchmarks-leaderboard/.
MMMU (Massive Multidiscipline Multimodal Understanding)
What it tests: College-level questions requiring both image understanding and disciplinary knowledge. 11,500 questions across 30 subjects and 6 disciplines (Art, Business, Science, Health, Tech, Social Science). Each question contains at least one image that is essential for answering.
Who publishes it: MMMU team (CMU, UCSB, U Michigan, etc.). Released 2023.
How scoring works: Multiple-choice accuracy. Human expert performance is ~88%. Top models score 75-85%.
Known limitations: Image-centric questions can sometimes be answered from text alone by strong language models. The benchmark is saturating for frontier models.
Official URL: mmmu-benchmark.github.io
MMMU-Pro
What it tests: Harder version of MMMU with more visually complex images, more plausible wrong answer options (10 choices), and an additional vision-only mode where questions are presented entirely as images. Designed to force genuine visual reasoning.
Who publishes it: Same team as MMMU. Released 2024.
How scoring works: Multiple-choice accuracy. Top models score 55-70% - well below MMMU saturation. The vision-only mode drops scores a further 5-15%.
Why it matters: The vision-only mode in particular is a strong test of genuine multimodal capability vs. text-heavy shortcuts.
Official URL: arxiv.org/abs/2409.02813
MathVista
What it tests: Mathematical reasoning in visual contexts - charts, geometry diagrams, scientific figures, natural images. 6,141 examples across 19 mathematical tasks and 5 image types. Tests whether models can apply math to visual problems, not just text.
Who publishes it: University of California, Los Angeles / other institutions. Released 2023.
How scoring works: Exact answer accuracy across task types. Human performance is ~60%. Frontier models score 65-80%.
Official URL: github.com/lupantech/MathVista
ChartQA
What it tests: Question answering on real-world charts (bar charts, line charts, pie charts, scatter plots) from business reports and news. Combines visual chart interpretation with numerical reasoning and comparative questions.
Who publishes it: Released 2022. Paper at arxiv.org/abs/2203.10244.
How scoring works: Relaxed accuracy (within 5% tolerance for numerical answers). Frontier models score 85-93%. Human performance is ~80%.
Known limitations: Real charts introduce OCR and layout parsing challenges that differ from clean diagram benchmarks. Scores vary significantly with chart rendering quality.
DocVQA
What it tests: Document understanding - answering questions about scanned document images including forms, tables, reports, and contracts. Tests OCR capability, layout understanding, and question answering over document structure.
Who publishes it: University of Lyon / Synchromedia. Released 2020.
How scoring works: Normalized Levenshtein similarity between predicted and reference answer. Frontier models score 92-97%. The benchmark is largely saturated.
Official URL: arxiv.org/abs/2007.00398
OCRBench
What it tests: Comprehensive OCR and text recognition across 29 sub-tasks: handwriting, scene text, document text, artistic text, multilingual text, and math formulas. Tests the full spectrum of visual text understanding.
Who publishes it: Sun Yat-sen University / Tencent. Released 2023.
How scoring works: Normalized accuracy across all sub-tasks. Frontier vision models score 70-85%. Handwriting and artistic text remain genuinely hard.
Official URL: github.com/Yuliang-Liu/MultimodalOCR
BrowseComp
What it tests: Hard web research tasks requiring multi-hop information gathering across multiple web pages. Each task has a unique, verifiable answer that can only be found by following several browsing steps. Designed to test deep research capability rather than single-page lookup.
Who publishes it: OpenAI. Released 2025.
How scoring works: Exact answer accuracy on 1,266 tasks. Base models with browsing score 5-15%. Agents specifically optimized for web research score 30-50%.
Why it matters: One of the most practically relevant benchmarks for evaluating deep research tools. High difficulty means it's far from saturated.
Official URL: arxiv.org/abs/2504.12516
Instruction Following Benchmarks
IFEval
What it tests: Instruction following fidelity. Tasks include formatting instructions (use bullet points, include specific words, keep under N characters), structural requirements (include specific headers), and style constraints. Tests whether models reliably follow explicit instructions - a basic but often underperforming capability.
Who publishes it: Google Research. Released 2023.
How scoring works: Prompt-level and instruction-level accuracy. Both strict (exact) and loose (lenient) scoring. Frontier models score 85-95% strict instruction level; smaller models show dramatic drops.
Why it matters: A model that scores 95% on reasoning but 70% on IFEval is unreliable in production. Instruction following is a practical prerequisite for most real-world deployments.
Official URL: arxiv.org/abs/2311.07911
MT-Bench
What it tests: Multi-turn conversational quality across 8 categories: writing, roleplay, extraction, reasoning, math, coding, knowledge, STEM. Each task has a follow-up question to test whether models maintain context and quality across turns. Uses GPT-4 as the judge.
Who publishes it: UC Berkeley LMSYS. Released 2023.
How scoring works: GPT-4 scores each response 1-10. Average across all tasks. Benchmark is now partially saturated - frontier models cluster at 9.0-9.7.
Known limitations: LLM-as-judge is susceptible to positional bias and verbosity preferences. The "judge" model's own biases affect results. See hallucination benchmark analysis for LLM-as-judge limitations.
Official URL: github.com/lm-sys/FastChat
AlpacaEval 2.0
What it tests: Open-ended instruction following quality, evaluated with length-controlled win rate against a reference model (GPT-4 Turbo). 805 diverse instructions. The length-controlled version penalizes verbosity gaming.
Who publishes it: Stanford CRFM. Released 2023, v2.0 updated 2024.
How scoring works: Length-controlled win rate (LC WR%) against the reference model. Corrects for models that win by being longer rather than better. Top models score 55-70% LC win rate.
Known limitations: GPT-4-as-judge carries GPT-4's biases. Models that mimic GPT-4's style score artificially high.
Official URL: github.com/tatsu-lab/alpaca_eval
Long Context Benchmarks
NIAH (Needle in a Haystack)
What it tests: The original long-context stress test. A specific fact (the "needle") is hidden at a random position inside a long document (the "haystack"). The model must locate and report the needle. Tests whether context windows actually work throughout their stated length.
Who publishes it: Greg Kamradt. Released 2023. Widely replicated.
How scoring works: Binary success/fail per needle placement. Usually visualized as a 2D heatmap across document length and needle position depth. Most frontier models pass at 128K context; degradation appears at higher depths.
Known limitations: Too easy for frontier models at normal context lengths. Any model claiming a 1M+ context window should be tested with multi-needle variants (many needles, conflicting needles) rather than single-needle NIAH.
Official URL: github.com/gkamradt/LLMTest_NeedleInAHaystack
RULER
What it tests: Extends NIAH with 13 task types across four categories: retrieval (single and multi-key), multi-hop tracing, aggregation, and question answering. Tests long context comprehension beyond simple needle retrieval. Designed to expose the performance cliff that NIAH doesn't catch.
Who publishes it: NVIDIA Research. Released 2024.
How scoring works: Accuracy per task type across context lengths from 4K to 128K+. Aggregated score shows where models fall off.
Why it matters: Models that pass basic NIAH often fail multi-hop RULER tasks at the same context length. RULER is the more diagnostic test.
Official URL: arxiv.org/abs/2404.06654
LongBench v2
What it tests: 503 long-document tasks covering single-document QA, multi-document QA, summarization, code completion, and long-form reasoning. Context lengths range from 8K to 2M tokens. Designed to test practical long-context use cases.
Who publishes it: Tsinghua University. Released 2024 (v2 update).
How scoring works: Accuracy per task type. Average across all tasks. Top models score 55-70%.
Official URL: arxiv.org/abs/2501.02897 - github.com/THUDM/LongBench
InfiniteBench
What it tests: Evaluation at extreme context lengths (100K+ to 1M+ tokens). Tasks include book-length question answering, code debugging in large codebases, mathematical reasoning over long derivations, and retrieval from very long documents. Designed to test whether long-context capability is genuine or degraded.
Who publishes it: OpenBMB. Released 2024.
How scoring works: Accuracy per task type. Notably, top model performance drops significantly on math and code tasks at 500K+ context even if retrieval tasks hold.
Official URL: github.com/OpenBMB/InfiniteBench
HELMET
What it tests: Holistic Evaluation of Long-context Language Models across 11 use cases: summarization, QA, retrieval, RAG, re-ranking, citation generation, and structured reasoning. Uses carefully written, semantically equivalent but lexically diverse test cases to reduce contamination risk.
Who publishes it: Princeton University. Released 2024.
How scoring works: Per-task metrics (F1, accuracy, ROUGE, citation F1) aggregated across 11 task types. Provides a comprehensive capability profile rather than a single score.
Why it matters: The most complete long-context evaluation currently available. The 11-task coverage reveals where models excel (retrieval) vs. fail (structured reasoning over long documents).
Official URL: arxiv.org/abs/2410.02694
Safety and Adversarial Benchmarks
View hallucination and safety benchmark data
Related analysis at /leaderboards/hallucination-benchmarks-leaderboard/.
HarmBench
What it tests: Adversarial robustness of safety systems across 7 functional categories and 18 attack methods. Tests whether a model's safety guardrails hold up under direct requests, jailbreak prompts, indirect prompt injection, and optimized adversarial attacks.
Who publishes it: Center for AI Safety / UC Santa Barbara. Released 2024.
How scoring works: Attack success rate (ASR) per attack method. Lower is safer. Also measures "refusal effectiveness" on benign queries - safety should not come at the cost of refusing legitimate requests.
Top scorers: Frontier models with RLHF safety training score 3-15% ASR on direct requests but higher on sophisticated attacks.
Official URL: github.com/centerforaisafety/HarmBench
AdvBench
What it tests: Adversarial instruction following - whether models can be prompted to produce harmful content via jailbreaking techniques. 500 harmful behaviors and 500 adversarial string attacks. One of the earlier standardized safety evaluation sets.
Who publishes it: University of Maryland. Released 2023.
How scoring works: Harmful behavior rate. Binary classification per prompt. Simpler than HarmBench but widely cited.
Known limitations: The attack library is now well-known and labs optimize against it. More recent attacks not in AdvBench may succeed at higher rates.
Official URL: arxiv.org/abs/2307.15043
HolisticBias
What it tests: Representation and fairness across 13 demographic axes (gender, race, religion, nationality, disability, etc.) and 600+ descriptor terms. Tests whether models produce systematically biased, stereotyped, or harmful content when discussing different demographic groups.
Who publishes it: Meta AI. Released 2022.
How scoring works: Measures regard scores (positive/negative/neutral sentiment toward each demographic group) and toxicity rates. Parity across groups is the goal.
Why it matters: Most safety benchmarks test for refusal; HolisticBias tests for representation quality across a very broad demographic matrix.
Official URL: arxiv.org/abs/2206.07845
BIG-Bench Hard Red Team
What it tests: Adversarial subset of BIG-Bench designed to elicit problematic behavior through seemingly benign instructions. Tests whether models recognize when a task's apparent goal (solving a puzzle) leads to a harmful output (providing instructions for dangerous activities).
Who publishes it: Google Research. Part of the BIG-Bench family.
How scoring works: Refusal rate and accuracy on safe vs. unsafe tasks. Good models should refuse the red team tasks while correctly completing the benign subset.
Official URL: github.com/google/BIG-bench
Creative Writing Benchmarks
EQ-Bench Creative Writing v3
What it tests: Creative writing quality across story prompts, character development, dialogue, and emotional resonance. Scored by an LLM judge (default: GPT-4o) using a structured rubric measuring originality, writing craft, emotional depth, and coherence. One of the few benchmarks specifically targeting generative quality rather than factual accuracy.
Who publishes it: EQ-Bench project. v3 released 2025.
How scoring works: Normalized score 0-100 from LLM judge assessment. Rankings differ meaningfully from instruction-following benchmarks - models that score high on MMLU don't necessarily score high here.
Known limitations: LLM-as-judge is susceptible to stylistic biases. GPT-4o as judge may favor outputs similar to GPT-4o's style.
Official URL: eqbench.com
Antislop Evaluation
What it tests: Whether model outputs avoid "slop" - the overused phrases, clichés, and formulaic constructions that mark AI-generated text. Tests include specific pattern detection (use of phrases like "tapestry of," "testament to," overuse of em dashes, certain sentence openers) and overall prose quality.
Who publishes it: Community-driven, emerged from the antislop-sampler project. Growing as a community standard.
Why it matters: Creative writing quality is impossible to capture in a single metric. Antislop evaluation, while informal, captures a real and observable failure mode in AI writing that other benchmarks miss.
Official URL: github.com/sam-paech/antislop-sampler
Specialized Domain Benchmarks
HealthBench
What it tests: Medical advice quality across 5,000 multi-turn health conversations. Evaluates accuracy, appropriate referral to professionals, safety of advice, and avoidance of dangerous recommendations. Uses physician-written criteria for scoring.
Who publishes it: OpenAI. Released 2025.
How scoring works: Physician-evaluated criteria match per conversation. Each criterion is binary pass/fail; aggregate score is % criteria met. Frontier models score 70-85%.
Why it matters: One of the most rigorous domain-specific benchmarks with human expert scoring. Medical AI is a high-stakes deployment context; HealthBench provides a standardized evaluation.
Official URL: arxiv.org/abs/2505.19955 - huggingface.co/datasets/openai/healthbench
FinanceBench
What it tests: Financial document question answering - publicly available earnings reports, 10-Ks, and financial statements. Tests whether models can read, interpret, and accurately answer numerical and analytical questions from real financial documents.
Who publishes it: Patronus AI. Released 2023.
How scoring works: Exact match or near-exact match on numerical answers. Scoring penalizes hallucinated figures. Frontier models score 75-90%; models with poor numerical grounding score 40-60%.
Official URL: github.com/patronus-ai/financebench
LegalBench
What it tests: Legal reasoning across 162 tasks including contract clause identification, statutory interpretation, case outcome prediction, and legal writing. Covers diverse legal subfields with tasks contributed by law professors.
Who publishes it: Hazy Research / Stanford Law School. Released 2023.
How scoring works: Task-level accuracy averaged across 162 tasks in 6 categories. Top models score 65-80% overall; performance drops significantly on tasks requiring genuine legal reasoning vs. pattern matching.
Official URL: github.com/HazyResearch/legalbench
ETHICS
What it tests: Moral reasoning across five philosophical frameworks: commonsense ethics, deontology, virtue ethics, utilitarianism, and justice. 130,000+ examples. Tests whether models have internalized ethical reasoning aligned with human moral intuitions.
Who publishes it: Dan Hendrycks et al. (UC Berkeley / Center for AI Safety). Released 2021.
How scoring works: Multiple-choice and binary classification accuracy per framework. Frontier models score 80-92%; the hard subsets remain more challenging.
Official URL: arxiv.org/abs/2008.02275 - github.com/hendrycks/ethics
Red Flags When a Model Cites a Benchmark
After running evaluations for two years and reading hundreds of model release posts, here are the patterns that reliably indicate a benchmark score is misleading.
Cherry-picking without disclosure. A lab shows benchmark scores only on the tests where their model leads. They don't report scores on benchmarks where competitors win. The tell is a selective benchmark table that conveniently omits tests from competing labs' release posts.
Subset reporting instead of full benchmark. "Best on GPQA" might mean best on a 30-question subset of the 448-question dataset, not the full test. Always check N.
Unreported eval configuration. The same model can score 5+ percentage points differently depending on prompt format, chain-of-thought, number of examples, and decoding temperature. A score without these details is unverifiable.
Benchmark-adjacent training data. Labs that report enormous gains on a specific benchmark shortly after that benchmark's public release may have used the benchmark's training split or similar problems to fine-tune the model. This is especially common for math and coding benchmarks.
"Experimental" vs. production model. The Llama 4 scandal established that "experimental" versions submitted to arenas can be meaningfully different from the released model. If a lab reports arena scores from a "variant," those scores don't transfer to the production model.
Comparing against older model versions. "Beats GPT-4" in 2026 is not impressive. "Beats GPT-4o-mini" is not impressive. Make sure comparisons are against current equivalents.
LLM-as-judge without calibration. MT-Bench and AlpacaEval use GPT-4 as a judge. Models that mimic GPT-4's style get inflated scores. If a model is fine-tuned on GPT-4 outputs, its MT-Bench score is not comparable.
FAQ
Which benchmark should I trust most right now?
For math: FrontierMath (hardest, no contamination possible). For coding: SWE-Bench Verified + LiveCodeBench together. For general capability: a combination of GPQA-Diamond and MMLU-Pro weighted by your use case. For practical reasoning: GAIA and tau-bench. No single benchmark covers everything.
Is contamination really that bad?
Yes, for static benchmarks. MMLU-CF research showed GPT-4o's MMLU score dropping from 88% to 73.4% on a clean version. See the full contamination analysis for documented cases.
How do I compare two models on benchmarks?
Use the same eval framework, same prompt format, and same generation parameters for both models. Use a benchmark released after both models' training cutoffs if possible. Run on at least 500 questions to reduce variance. See our reasoning and coding leaderboards for standardized comparisons.
What is benchmark saturation?
When top models cluster above 90% accuracy, a benchmark no longer differentiates capability. MMLU is saturated. HumanEval is saturated. When a benchmark saturates, the field needs a harder replacement. FrontierMath was designed specifically because MATH saturated.
Are there benchmarks that can't be gamed?
Dynamic benchmarks (LiveCodeBench, LiveBench) are structurally harder to contaminate because they use new content each month. FrontierMath uses novel unsolved math problems. ARC-AGI-2 uses novel visual reasoning patterns. These are the closest to contamination-resistant evaluation available. None are completely ungameable, but they're significantly harder to manipulate than static benchmarks.
What benchmark should I use to evaluate a model for my specific use case?
None of the above. Build a small private benchmark from your actual tasks, run both models on it, and make your decision from that data. Public benchmarks measure average capability across a training-distribution-friendly test; your use case is a specific distribution. The most reliable evaluation is always task-specific.
Sources
- TIGER-Lab MMLU-Pro Dataset
- GPQA: A Graduate-Level Google-Proof Q&A Benchmark (arXiv 2311.12022)
- GPQA GitHub
- ZebraLogic - Allen AI (HuggingFace)
- BIG-Bench: Beyond the Imitation Game (GitHub)
- AoPS AIME Problems and Solutions
- FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning (arXiv 2411.04872)
- SWE-bench: Can Language Models Resolve Real-World GitHub Issues? (swebench.com)
- SWE-bench GitHub
- SWE-bench Multimodal (arXiv 2410.03859)
- LiveCodeBench (GitHub)
- EvalPlus: Rigorous Evaluation of LLMs as Code Generators (GitHub)
- MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation (GitHub)
- AlgoTune (arXiv 2503.08048)
- TerminalBench (arXiv 2504.01990)
- Berkeley Function-Calling Leaderboard
- tau-bench: A Benchmark for Tool-Agent-User Interaction (arXiv 2406.12045)
- GAIA: a benchmark for General AI Assistants (arXiv 2311.12983)
- GAIA Leaderboard (Hugging Face)
- WebArena: A Realistic Web Environment for Building Autonomous Agents
- OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
- AgentBench: Evaluating LLMs as Agents (arXiv 2308.03688)
- AgentBench GitHub
- MMMU: A Massive Multi-discipline Multimodal Understanding Benchmark
- MMMU-Pro (arXiv 2409.02813)
- MathVista: Evaluating Math Reasoning in Visual Contexts (GitHub)
- ChartQA (arXiv 2203.10244)
- DocVQA: A Dataset for VQA on Document Images (arXiv 2007.00398)
- OCRBench (GitHub)
- BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents (arXiv 2504.12516)
- IFEval: Instruction-Following Evaluation for Large Language Models (arXiv 2311.07911)
- MT-Bench / FastChat (GitHub)
- AlpacaEval (GitHub)
- NIAH: Needle In A Haystack Pressure Testing (GitHub)
- RULER: What's the Real Context Size of Your Long Context Language Models? (arXiv 2404.06654)
- LongBench v2 (arXiv 2501.02897)
- LongBench (GitHub)
- InfiniteBench: Extending Long Context Evaluation Beyond 100K Tokens (GitHub)
- HELMET: How to Evaluate Long-context Language Models Effectively and Thoroughly (arXiv 2410.02694)
- HarmBench: A Standardized Evaluation Framework for Automated Red Teaming (GitHub)
- Universal and Transferable Adversarial Attacks on Aligned Language Models / AdvBench (arXiv 2307.15043)
- HolisticBias (arXiv 2206.07845)
- EQ-Bench Creative Writing
- Antislop Sampler (GitHub)
- HealthBench (arXiv 2505.19955)
- HealthBench Dataset (Hugging Face)
- FinanceBench (GitHub)
- LegalBench (GitHub)
- Aligning AI With Shared Human Values - ETHICS (arXiv 2008.02275)
- ETHICS Benchmark (GitHub)
