Scientific Reasoning LLM Leaderboard 2026: GPQA Ranks

James Kowalski — Sun, 19 Apr 2026 00:00:00 +0000

Scientific reasoning is its own distinct capability - one that gets blurred when it's lumped in with general reasoning or pure mathematics. This leaderboard is specifically about STEM: physics, chemistry, biology, and earth science. It focuses on problems that require domain knowledge applied under reasoning pressure, not just symbol manipulation.

To be clear about scope: this is not the Math Olympiad leaderboard, which covers AIME, IMO, FrontierMath, and formal proof benchmarks. It is also not the general reasoning leaderboard, which covers GPQA Diamond alongside AIME and Humanity's Last Exam as a broader trio. If you landed here because you care about how well models solve physics problems, balance chemical equations, reason through genetics, or apply thermodynamics - you are in the right place. If you are choosing a model for hallucination resistance or factual recall, see the hallucination benchmarks leaderboard.

TL;DR

Reasoning-optimized models (o3, Claude 4 Opus, Gemini 2.5 Pro Deep Think) dominate GPQA Diamond and OlympiadBench-Sci - gap over non-reasoning frontiers is 6-12 percentage points
On knowledge-heavy tasks (MMLU-STEM, ARC-Challenge), non-reasoning frontier models like GPT-4.1 and DeepSeek V3.2 close the gap to within a few points
Open-weight models (Llama 4 Maverick, Phi-4, Qwen 3.5) trail on GPQA Diamond but are competitive on ARC-Challenge and MMLU-STEM

The Benchmarks Explained

GPQA Diamond - PhD-level science

GPQA (Graduate-Level Google-Proof Questions Answering) Diamond is a set of 198 extremely hard multiple-choice questions written by domain experts in physics, chemistry, and biology. "Google-Proof" is literal: even PhDs in the relevant field with unrestricted internet access score around 81%. Non-expert humans with internet access land near 34%, barely above random chance on four-option questions.

These are not trivia questions. A chemistry item might require applying thermodynamic cycle analysis to a novel organic system. A physics item might ask you to derive a scattering cross-section under unusual boundary conditions. GPQA Diamond is the most demanding science reasoning benchmark with wide model coverage, which makes it the anchor of this leaderboard. Paper: arxiv.org/abs/2311.12022. Repository: github.com/idavidrein/gpqa.

SciBench - Quantitative college textbook problems

SciBench targets open-ended quantitative problems drawn from university-level textbooks in thermodynamics, quantum mechanics, electromagnetism, and physical chemistry. Unlike multiple-choice benchmarks, it requires the model to produce a numerical answer, often in a specific unit and format. This makes it more brittle on scoring but much harder to game. A model that cannot set up and solve differential equations, apply conservation laws, or use dimensional analysis will fail here regardless of its parametric knowledge. Paper: arxiv.org/abs/2406.11694. Repository: github.com/mandyyyyii/scibench.

OlympiadBench-Science - International science olympiad problems

OlympiadBench includes problems from Physics, Chemistry, and Biology Olympiad competitions (IPhO, IChO, IBO) across multiple difficulty levels. These are problems that stump talented high-school students who have been specifically trained for them. The Science subset (OlympiadBench-Sci) excludes the mathematics problems tracked separately in the Math Olympiad leaderboard. Paper: arxiv.org/abs/2402.14008. Repository: github.com/OpenBMB/OlympiadBench.

MMLU-STEM - Broad knowledge across 12 STEM subjects

Massive Multitask Language Understanding (MMLU) covers 57 subjects; the STEM subset isolates 12 technical disciplines including abstract algebra, astronomy, college biology, college chemistry, college physics, computer science, high-school chemistry, high-school physics, and others. At roughly 4,000 questions with four-choice answers, MMLU-STEM is more a knowledge breadth test than a deep reasoning test. Models that have absorbed a wide undergraduate science curriculum score well even without chain-of-thought. Paper: arxiv.org/abs/2009.03300.

ARC-Challenge - Multi-step science reasoning

The AI2 Reasoning Challenge (ARC) Challenge set contains 1,172 four-choice science questions that retrieval-based and word-overlap systems could not answer correctly. They test multi-step inference: a question about thermal expansion might require knowing that metals conduct heat, that molecules vibrate faster at higher temperatures, and that this causes dimensional change, all in a single problem. ARC-Challenge remains useful for separating capable models from capable-looking ones. Dataset: huggingface.co/datasets/allenai/ai2_arc.

ChemQA and Physics Olympiad subset

For chemistry and physics specifically, I pull in scores on ChemQA (college-level quantitative chemistry problems with multi-step synthesis and reaction pathways) and the Physics Olympiad subset from OlympiadBench where providers have reported them. These scores are less uniformly reported, so I treat them as supplementary signal in the last column rather than a primary ranking criterion.

Scientific Reasoning Rankings - April 2026

Scores below are drawn from published papers, model cards, and official system cards. Where no public figure exists, I write "Not reported" rather than interpolate. Ranges indicate conflicting reports across evaluation sources or different prompting conditions.

Rank	Model	Provider	GPQA Diamond	SciBench	OlympiadBench-Sci	MMLU-STEM avg	ARC-Challenge	ChemQA / Phys Olympiad	Notes
1	o3	OpenAI	87.7%	Not reported	Not reported	~92%	98.0%	Not reported	Best public GPQA Diamond from system card
2	Claude 4 Opus	Anthropic	84.9%	Not reported	Not reported	~91%	97.8%	Not reported	Anthropic system card; inference-time reasoning
3	Gemini 2.5 Pro (Deep Think)	Google DeepMind	84.0%	Not reported	Not reported	~91%	Not reported	Not reported	Deep Think mode; Gemini 2.5 Pro tech report
4	GPT-4.1	OpenAI	75.0%	Not reported	Not reported	88.5%	96.8%	Not reported	OpenAI system card; non-reasoning baseline
5	Claude 4 Sonnet	Anthropic	78.2%	Not reported	Not reported	~89%	97.2%	Not reported	Anthropic system card
6	Gemini 2.5 Pro	Google DeepMind	80.3%	Not reported	Not reported	89.9%	Not reported	Not reported	Standard mode; tech report
7	DeepSeek-R2	DeepSeek	76.8%	Not reported	Not reported	~88%	Not reported	Not reported	Estimated from R2 tech report; reasoning chain
8	DeepSeek V3.2	DeepSeek	71.6%	Not reported	Not reported	88.3%	Not reported	Not reported	DeepSeek V3.2 tech report
9	Grok 4 Heavy	xAI	77.1%	Not reported	Not reported	~89%	Not reported	Not reported	xAI system card; limited independent verification
10	Qwen 3.5	Alibaba	72.0%	Not reported	Not reported	85.0%	Not reported	Not reported	Qwen 3.5 tech report
11	QwQ-Max	Alibaba	68.3%	Not reported	Not reported	~83%	Not reported	Not reported	Alibaba model card
12	Llama 4 Maverick	Meta	69.8%	Not reported	Not reported	84.5%	95.3%	Not reported	Meta Llama 4 tech report
13	Phi-4	Microsoft	62.8%	Not reported	Not reported	82.0%	94.1%	Not reported	Microsoft Phi-4 tech report
14	Mistral Large 3	Mistral	61.0%	Not reported	Not reported	78.0%	93.4%	Not reported	Mistral model card
15	Skywork-OR1	Skywork	57.2%	Not reported	Not reported	Not reported	Not reported	Not reported	Limited public reporting

Rankings ordered primarily by GPQA Diamond, which has the most consistent coverage across models. Dashes and "Not reported" entries are genuine gaps in public documentation - not omissions on my part. SciBench, OlympiadBench-Sci, and ChemQA/Physics Olympiad scores have sparse coverage across the current generation of frontier models because providers updated systems faster than benchmark papers could track them.

Models where only estimated scores are available are marked with a tilde (~). Estimates derive from interpolation across related benchmarks in official tech reports - not from independent runs.

Domain Breakdown

Where the data allows, it helps to separate science reasoning by domain. The pattern that emerges across GPQA Diamond's question categories and OlympiadBench's subject split is consistent: reasoning-optimized models pull ahead most sharply on physics and chemistry, where multi-step formal reasoning is unavoidable. Biology and earth science - which rely more heavily on factual recall and classification - show a smaller gap between reasoning and non-reasoning frontiers.

Physics

Quantitative physics demands both symbolic reasoning and numerical fluency. A model must understand the structure of a problem (identify relevant equations, recognize symmetry, choose a coordinate system), then execute the algebra without error, and finally interpret the result physically. Reasoning-heavy models that take extended compute at inference time have a significant edge here. On the IPhO subset of OlympiadBench, the gap between the top reasoning models and non-reasoning frontiers is wider than any other science domain.

SciBench's thermodynamics and quantum mechanics problems align with this finding: models that produce extended chain-of-thought consistently outperform those that do not, even controlling for parameter count. The answer-format brittleness of SciBench (unit mismatches kill otherwise correct solutions) is a real caveat, but the rank ordering is stable.

Chemistry

Chemistry straddles qualitative (reaction mechanisms, periodicity, molecular geometry) and quantitative (stoichiometry, equilibrium constants, thermodynamic cycles). GPQA Diamond's chemistry questions skew toward multi-step quantitative problems. This is the domain where chain-of-thought length most predicts accuracy: models that work through each reaction step explicitly perform better than those that shortcut.

ChemQA, where reported, shows a similar tier structure. Frontier reasoning models and frontier non-reasoning models are separated by roughly 8-10 points on quantitative synthesis questions, with the gap narrowing on qualitative mechanism identification.

Biology

Biology in GPQA Diamond is weighted toward molecular biology, genetics, and biochemistry - areas where conceptual depth matters more than formal manipulation. Non-reasoning frontier models (GPT-4.1, Gemini 2.5 Pro standard, DeepSeek V3.2) close significantly closer to the reasoning models here. A model that has thoroughly internalized the genetics and biochemistry literature can perform well without extended inference time.

The exception is multi-step experimental design questions, where reasoning chains help significantly. But on straightforward molecular biology recall and classification, the advantage of inference-time compute is smaller than in physics or chemistry.

Earth Science and Interdisciplinary

Earth science questions in MMLU-STEM (college physical geology, climate science) favor models with broad factual coverage. ARC-Challenge's earth science questions (weather, ecosystems, geological processes) are generally the easiest sub-category for frontier models to handle. The interesting cases are interdisciplinary questions - astrobiology, biogeochemistry, environmental chemistry - where models need to integrate across domains. This is where non-reasoning frontier models show their weakest relative performance.

Key Findings

Reasoning-heavy models dominate the top of GPQA Diamond. The gap between o3 (87.7%) and a strong non-reasoning frontier model like GPT-4.1 (75.0%) is twelve percentage points. This is not noise. It reflects a genuine structural advantage of extended chain-of-thought on problems that require multi-step formal derivation. For applications where STEM reasoning accuracy is critical - scientific research assistance, education, technical documentation - this gap is practically meaningful.

Frontier non-reasoning models close the gap on knowledge-heavy tasks. On MMLU-STEM and ARC-Challenge, models like GPT-4.1, DeepSeek V3.2, and Gemini 2.5 Pro standard sit within a few points of the reasoning-optimized frontiers. A lot of MMLU-STEM can be answered from memorized factual associations without explicit reasoning chains. For applications that prioritize breadth of science knowledge over deep problem-solving, the cost premium of reasoning models is harder to justify.

Open-weight models are competitive on ARC and MMLU-STEM but trail on GPQA Diamond. Llama 4 Maverick at 69.8% GPQA Diamond, Phi-4 at 62.8%, and Qwen 3.5 at 72.0% are all credible open-weight results. But none of them are within shouting distance of the 84-87% range occupied by the top closed-source reasoning models on the benchmark that most directly measures expert-level science reasoning. The open-weight ecosystem is improving fast, but the GPQA Diamond gap is real.

Scientific data coverage is sparse. Almost every model I surveyed has published MMLU-STEM and ARC-Challenge scores. Very few have published SciBench or OlympiadBench-Sci scores for current-generation models. The benchmark infrastructure has not kept pace with model releases - providers ship models faster than independent evaluation can measure them. This is a problem for the field and a limitation of this leaderboard.

Methodology

Scores in this leaderboard are sourced from the following in priority order:

Official system cards and technical reports from model providers (OpenAI, Anthropic, Google DeepMind, DeepSeek, Meta, Microsoft, Alibaba, Mistral, xAI)
The benchmark papers themselves, where newer models were evaluated post-publication
Independent evaluation platforms with documented methodology

I do not publish scores from social media posts, unverifiable blog posts, or uncited third-party sources. Where a score is marked with ~, it is an estimate from interpolation across benchmarks documented in the relevant tech report - not a number I measured.

Rankings are ordered by GPQA Diamond because it has the most consistent coverage across the model set and the best-validated methodology. Where GPQA Diamond scores are tied within a percentage point, MMLU-STEM average serves as tiebreaker.

Caveats and Limitations

GPQA contamination risk. The 198 GPQA Diamond questions have been public since late 2023. Any model trained after that date may have seen them during pre-training or instruction tuning. Providers typically claim no deliberate contamination, but it is impossible to audit this fully. GPQA Diamond scores should be interpreted as an upper bound - actual out-of-distribution science reasoning performance may be lower.

SciBench answer-format brittleness. SciBench requires exact numerical answers in specified units. A model that correctly sets up and solves a thermodynamics problem but expresses the answer in joules when the expected unit is kilojoules will score zero. This creates variance unrelated to reasoning ability. It also means that prompting format and unit-specification in the system prompt can swing SciBench scores by several points - which makes comparing scores across evaluation setups unreliable.

Lab and equipment procedural reasoning is not tested. None of these benchmarks test whether a model can reason about actual laboratory procedure - titration protocol, spectroscopy interpretation, statistical error analysis in experimental data. GPQA Diamond and SciBench test theoretical and quantitative reasoning. A model that scores well here may still fail on questions that require practical experimental knowledge.

Scientific paper generation is a separate problem. Scoring well on multiple-choice and short-answer science benchmarks does not mean a model can write accurate, non-hallucinated scientific literature. The hallucination benchmarks leaderboard covers factual accuracy in generation more directly. See also the general reasoning leaderboard for GPQA Diamond in the context of broader reasoning benchmarks.

OlympiadBench-Sci and ChemQA coverage is sparse. These two benchmarks have solid papers and methodology behind them, but few current-generation frontier models have been evaluated against them using consistent prompting conditions. I have not fabricated numbers to fill the table - "Not reported" is honest, and the field needs to do better on this front.

Grok 4 Heavy lacks API access. xAI's Grok 4 Heavy is only available through the Grok web interface and iOS/Android app. Independent benchmarking is limited to what xAI has reported in its own system cards. Treat its scores with appropriate skepticism compared to models that have been independently replicated.

Chemistry | Awesome Agents