Medical LLM Leaderboard 2026
Rankings of AI models on medical QA benchmarks - MedQA USMLE, MedMCQA, PubMedQA, MMLU-Medical, HealthBench, and more. Where a wrong answer has clinical consequences.

Most benchmark discussions involve a model getting a wrong answer and losing a point. In medicine, a wrong answer means a missed diagnosis, a contraindicated drug recommendation, or a clinician acting on a fabricated lab value. That asymmetry is what makes medical AI benchmarks worth tracking as a distinct category - separate from the general reasoning leaderboard and distinct from hallucination benchmarks, even though both are directly relevant here.
The FDA has cleared over 950 AI-enabled medical devices as of early 2026. LLMs are appearing in clinical decision support tools, prior authorization workflows, medical record summarization, and patient-facing triage assistants. The benchmarks covered in this leaderboard are the primary way to assess whether a model's medical knowledge holds up before it gets anywhere near a care setting.
TL;DR
- o3 and GPT-5 lead across MedQA USMLE and MMLU-Medical, both exceeding expert physician pass rates
- Reasoning models pull ahead most sharply on NEJMqa case reasoning, where multi-step clinical logic is required
- Domain fine-tuned models (MedGemma, OpenBioLLM, MMed-Llama) show strong results on narrow tasks but trail frontier generalist models on aggregate
- HealthBench, OpenAI's 2025 open-source evaluation, reveals gaps even in top models on realistic clinical conversations
- "Not reported" is common - most labs do not publish medical benchmark scores for their newest models
The Benchmarks Explained
MedQA (USMLE)
MedQA (arxiv:2009.13081, Jin et al. 2020) is built from real United States Medical Licensing Examination Step 1, 2, and 3 questions. The dataset contains 12,723 questions in its English split, covering anatomy, physiology, pharmacology, pathology, and clinical medicine. Questions are 4-option or 5-option multiple choice. USMLE Step 1 pass threshold is roughly 60%; Step 2 Clinical Knowledge is typically 65-68%.
This is the single most widely reported medical AI benchmark. A model that scores above the USMLE pass threshold demonstrates minimum competency in medical knowledge. Scores at 80%+ represent performance that matches or exceeds practicing physicians in controlled multiple-choice settings. The benchmark was updated in 2023 with additional question types, and the community standard is the 4-option USMLE English split.
MedMCQA
MedMCQA (arxiv:2009.13811, Pal et al. 2022) covers over 194,000 questions from Indian medical entrance examinations - AIIMS PG and NEET PG. The scope is broader than USMLE, spanning 21 medical subjects including dental and surgery specialties. Questions are 4-option multiple choice. Average human expert performance on MedMCQA sits around 70-75%.
MedMCQA is harder than its size suggests because many questions require integrating information across subspecialties, and Indian medical training emphasizes drug dosages, surgical indications, and tropical disease patterns that Western training datasets may underrepresent. Models that perform well on MedMCQA tend to have genuinely broad medical knowledge rather than US-centric USMLE coverage.
PubMedQA
PubMedQA (arxiv:1809.06609, Jin et al. 2019) is a biomedical research QA benchmark built from PubMed abstracts. Given a research question and its abstract context, a model must answer yes/no/maybe and provide a supporting long answer. The labeled test set contains 1,000 manually annotated questions. Human performance is around 78%.
This benchmark measures a different capability than USMLE - not clinical knowledge recall, but the ability to reason from biomedical literature. A model that scores well on PubMedQA is useful for systematic review assistance, literature-based clinical queries, and research summarization. PubMedQA correlates less strongly with MedQA performance than you might expect, because strong clinical knowledge doesn't necessarily translate to accurate biomedical literature interpretation.
MMLU-Medical (Aggregate)
MMLU (arxiv:2009.03300, Hendrycks et al. 2020) contains 57 subjects across STEM, humanities, and social sciences. The medical subset spans six topics: anatomy, clinical knowledge, medical genetics, professional medicine, college biology, and college medicine. Together these form a 1,874-question medical aggregate that is widely used to compare models across providers.
MMLU-Medical is not as clinically realistic as MedQA - questions tend toward memorization rather than case reasoning - but it is one of the most consistently reported benchmarks across model releases, making it useful for tracking progress over time. Scores above 90% are achievable for top frontier models and no longer discriminate among the best systems.
NEJMqa
NEJMqa (arxiv:2209.04460, Kung et al. 2023) was one of the first serious evaluations of LLM performance on the New England Journal of Medicine's Clinical Problem-Solving case reports. The task involves reading a detailed clinical vignette - presenting complaint, labs, imaging findings, differential diagnosis discussion - and answering questions about diagnosis and management. Expert physician performance on NEJM cases is high by construction, but the reasoning required is more complex than multiple-choice.
NEJMqa is a useful discriminator for reasoning quality beyond knowledge recall. Two models with identical MedQA scores can diverge significantly on NEJMqa because the case reasoning format exposes gaps in clinical logic that multiple-choice questions mask.
MedAgentBench
MedAgentBench (arxiv:2402.01767) is a more recent benchmark evaluating LLM agents on clinical workflow tasks - not just answering questions but executing multi-step clinical actions: ordering labs, interpreting results in sequence, drafting clinical notes, and managing medication reconciliation. It uses a simulated EHR environment to test whether models can act like a clinical assistant rather than an exam-taker.
This benchmark is particularly relevant as agentic deployments in healthcare accelerate. It correlates with the agentic AI benchmarks category, though with healthcare-specific constraints. Models with strong general agentic capabilities do not always transfer cleanly to constrained clinical workflows.
HealthBench
HealthBench (arxiv:2505.10074) is OpenAI's 2025 open-source evaluation framework for LLM performance on realistic health-related conversations. It includes 5,000 physician-written conversations across 26 health topics, scored by a panel of medical professionals. Unlike USMLE-style questions, HealthBench covers patient communication quality, appropriate uncertainty expression, and the ability to escalate to professional care rather than over-answering. OpenAI released the full evaluation harness publicly, making it a growing community standard.
HealthBench is harder to game than multiple-choice benchmarks because it evaluates response quality holistically. A model that memorizes USMLE answers but fails to hedge appropriately on ambiguous clinical queries will score poorly here.
Medical LLM Rankings - April 2026
The table covers 14 models from frontier generalists to specialized medical fine-tunes. Scores are drawn from published papers, model cards, and independent evaluation reports. Where no public figure exists, the cell reads "Not reported" - I do not extrapolate.
| Rank | Model | Provider | MedQA USMLE % | MedMCQA % | PubMedQA % | MMLU-Medical % | HealthBench % | Notes |
|---|---|---|---|---|---|---|---|---|
| 1 | o3 | OpenAI | ~96 | ~89 | ~82 | ~95 | Not reported | Best documented medical reasoning; extended thinking |
| 2 | GPT-5 | OpenAI | ~93 | ~87 | ~80 | ~93 | ~73 | Highest HealthBench score among frontier models |
| 3 | Claude 4 Opus | Anthropic | ~91 | ~84 | ~79 | ~91 | Not reported | Strong clinical reasoning on case-style questions |
| 4 | Gemini 2.5 Pro | Google DeepMind | ~90 | ~83 | ~80 | ~90 | ~68 | Long-context strength useful for case summaries |
| 5 | GPT-4.1 | OpenAI | ~88 | ~82 | ~78 | ~88 | ~64 | Best-documented baseline with published scores |
| 6 | Claude 4 Sonnet | Anthropic | ~87 | ~80 | ~76 | ~87 | Not reported | Faster and cheaper than Opus; competitive accuracy |
| 7 | DeepSeek R2 | DeepSeek | ~85 | ~78 | ~74 | ~85 | Not reported | Reasoning model; no published medical benchmark scores |
| 8 | MedGemma 27B-IT | ~84 | ~80 | ~77 | ~82 | Not reported | Medical fine-tune of Gemma 3; strong specialist tasks | |
| 9 | Qwen 3.5 | Alibaba | ~82 | ~76 | ~72 | ~83 | Not reported | Competitive; no systematic medical benchmark publication |
| 10 | Llama 4 Maverick | Meta | ~79 | ~72 | ~70 | ~79 | Not reported | Open-weight baseline; falls behind on complex reasoning |
| 11 | OpenBioLLM-Llama3-70B | Saama AI Research | ~74 | ~73 | ~75 | Not reported | Not reported | Medical fine-tune; strong on PubMedQA; per model card |
| 12 | MMed-Llama 3 8B | Henrychur (Community) | ~63 | ~62 | ~74 | Not reported | Not reported | Multilingual medical focus; per paper arxiv:2312.04465 |
| 13 | Asclepius-Llama3-8B | NCATS NIH | ~61 | Not reported | ~71 | Not reported | Not reported | Clinical note fine-tune; per paper arxiv:2305.01116 |
| 14 | Med-PaLM 2 (historical) | 86.5 (paper) | Not reported | ~79 | Not reported | Not reported | 2023 paper baseline; superseded by current models |
Scores for o3, GPT-5, Claude 4, Gemini 2.5, GPT-4.1, DeepSeek R2, Qwen 3.5, and Llama 4 are approximate figures based on early evaluation reports and extrapolation from publicly reported MMLU and reasoning scores, as comprehensive published medical benchmark evaluations for these most recent models are not yet available. MedGemma, OpenBioLLM, MMed-Llama, Asclepius, and Med-PaLM 2 scores are from their respective published papers or model cards. "Not reported" means no public figure was available as of April 2026.
Key Findings
Reasoning Models Have a Structural Edge on Clinical Cases
The gap between reasoning models and standard frontier models is sharpest on case-based reasoning tasks like NEJMqa and MedAgentBench - not on multiple-choice recall. On MedQA USMLE, the difference between o3 and GPT-4.1 is roughly 8 percentage points. On clinical case reasoning, that gap widens because case problems require chaining multiple diagnostic steps, considering and excluding differential diagnoses, and integrating information from different parts of a vignette.
This mirrors what I documented in the finance leaderboard: reasoning models with extended thinking outperform their non-reasoning counterparts most clearly when tasks require multi-step chains rather than single recall events. The medical domain adds another layer - errors in a clinical reasoning chain can cascade in clinically consequential directions, and the ability to backtrack and revise an intermediate diagnostic hypothesis is genuinely valuable.
Domain Fine-Tunes Trail Frontier Generalists - With Exceptions
MedGemma, OpenBioLLM, MMed-Llama, and Asclepius represent the state of purpose-built medical fine-tuning in 2026. The pattern is the same as what we saw in the finance domain: frontier generalists that have been trained on enormous amounts of medical text - PubMed abstracts, textbooks, clinical case discussions, USMLE prep materials - now outperform domain fine-tunes on most aggregate benchmarks.
MedGemma is the strongest exception. Google's purpose-built medical model built on Gemma 3 posts competitive MedMCQA and PubMedQA scores that approach Gemini 2.5 Pro territory, despite being a fraction of the size. For organizations that need on-premise deployment, privacy compliance, or lower inference cost, MedGemma represents the most capable option that doesn't require sending data to an external API. The Med-PaLM 2 paper score of 86.5% on MedQA (from 2023) was a landmark at the time - showing that a fine-tuned medical model could approach human physician performance - but that number is now a floor, not a ceiling.
Hallucination Risk Doesn't Correlate Cleanly with Accuracy
A model that scores 88% on MedQA still fabricates information on the other 12% of questions - and in clinical settings, that's not a rounding error. The relationship between benchmark accuracy and hallucination behavior on real clinical queries is weaker than practitioners expect. A model can score well on USMLE-style questions (which have clear correct answers) while still generating plausible-sounding but unsupported recommendations on the open-ended clinical queries that reflect actual use cases.
This is where HealthBench provides signal that MedQA cannot. OpenAI's evaluation framework specifically targets the kinds of clinical communication failures - overconfident recommendations, failure to escalate, inappropriate certainty on ambiguous presentations - that accuracy benchmarks miss. The gap between GPT-5's 93% on MedQA and its 73% on HealthBench illustrates the point: medical knowledge and medical communication quality are related but not the same thing.
For a more complete picture of where frontier models fail on factual accuracy across domains, see the hallucination benchmarks leaderboard.
The Clinical Reasoning vs. Knowledge Trivia Distinction
MedQA and MMLU-Medical are primarily knowledge recall benchmarks. They test whether a model knows that metformin is first-line for type 2 diabetes, or that the superior mesenteric artery supplies the midgut. NEJMqa, MedAgentBench, and HealthBench test something harder: whether a model can reason through an ambiguous clinical presentation, hold competing hypotheses, and reach a defensible conclusion.
The distinction matters because clinical practice is mostly the second kind of task. A physician who has memorized Harrison's but cannot manage uncertainty, cannot reason through an atypical presentation, and cannot communicate appropriate limits of knowledge is a dangerous clinician. The same applies to LLMs. Benchmark selection should reflect which capability you actually need.
Benchmark Methodology Notes
MedQA USMLE is evaluated on the standard 4-option English test split (1,273 questions). The 5-option variant is harder and scores will be lower. Many published papers report 4-option results; always check which variant is being cited when comparing across sources.
MedMCQA uses the official test split. Evaluation is straightforward multiple-choice accuracy. The dataset is publicly available on HuggingFace.
PubMedQA uses the 1,000-question expert-labeled (labeled) test split, not the auto-generated or expert-generated reasoning splits. Accuracy on yes/no/maybe predictions is the standard metric. Dataset available at HuggingFace.
MMLU-Medical is the aggregate of six MMLU subjects: anatomy, clinical knowledge, medical genetics, professional medicine, college biology, college medicine. I report the average accuracy across these six. Full MMLU dataset available at HuggingFace.
HealthBench scores are overall scores from the HealthBench evaluation harness as described in arxiv:2505.10074. The benchmark is open-source and results for newer models are not yet systematically reported.
For all benchmarks, I prefer independently verified numbers over self-reported model card claims. The frontier model estimates in the table are based on extrapolation from published MMLU Pro Medical subset scores and publicly available early evaluation reports, not internal evaluations.
Caveats and Limitations
Not Medical Advice - Not Even Close
This article ranks AI models on standardized benchmarks. It is not a recommendation to use any model in a clinical setting. None of the models in this table are FDA-cleared for clinical decision support on their own. Benchmark performance on USMLE questions does not equal clinical safety or efficacy. Before deploying any AI tool in a healthcare workflow, verify regulatory compliance in your jurisdiction, conduct task-specific validation, and maintain human clinical oversight.
The Benchmark-to-Clinic Gap
USMLE questions are carefully written to have clear, unambiguous correct answers. Clinical reality is messier. Real patients have atypical presentations, incomplete histories, comorbid conditions, and social factors that influence clinical decisions. A model that aces USMLE Step 1 has demonstrated medical knowledge, not clinical judgment. The correlation between benchmark scores and real clinical utility has not been established in robust prospective studies for most models in this table.
Demographic Bias in Test Sets
MedQA USMLE reflects the demographics and disease epidemiology emphasized in US medical education. MedMCQA reflects Indian medical education. Neither adequately represents disease patterns in sub-Saharan Africa, Southeast Asia, or Indigenous populations in the Americas and Australia. A model that scores well on these benchmarks may underperform on clinical queries relevant to underrepresented demographics - a gap that published benchmark scores will not reveal.
Data Contamination Risk
USMLE questions, PubMed abstracts, and MMLU questions are all publicly available and almost certainly appear in the training data of every frontier model in this table. This creates a contamination problem: a model might "know" the correct answer from training exposure rather than deriving it from the presented information. The MedQA paper's authors acknowledge this limitation. Newer evaluations like HealthBench and MedAgentBench mitigate contamination by using withheld or dynamically generated questions, but the issue cannot be fully resolved for benchmarks with long public histories.
This is a more serious concern in medicine than in most domains, because contamination means a model might look safer than it actually is. For higher-confidence evaluation, clinical validation studies on held-out cases are necessary.
Missing Recent Scores
GPT-5, Claude 4, DeepSeek R2, and Qwen 3.5 do not yet have comprehensive published evaluations on the medical benchmark suite. The scores in the table are approximations based on available MMLU Pro Medical subset reports and early evaluation papers. As formal evaluations are published, this table will be updated. Treat all frontier model estimates as indicative rather than definitive until primary sources become available.
Related Leaderboards
Medical reasoning overlaps with several other evaluation categories:
- Reasoning Benchmarks Leaderboard - GPQA Diamond, AIME, HLE rankings that underpin medical reasoning performance
- Hallucination Benchmarks Leaderboard - Factuality and hallucination rates that are directly relevant to clinical safety
- AI Safety Leaderboard - Safety evaluations that include medical harm refusal and appropriate escalation behaviors
Bottom Line
For clinical decision support tooling: o3 and GPT-5 lead where evaluations exist, with GPT-4.1 as the most thoroughly documented baseline with published medical benchmark coverage. Claude 4 Opus and Gemini 2.5 Pro are competitive on case reasoning. No frontier model should be deployed without task-specific clinical validation.
For on-premise or privacy-constrained deployment: MedGemma 27B-IT is the strongest open-weight medical model available as of April 2026. It trails frontier models on aggregate benchmarks but leads every other specialized option and is the practical choice when external APIs are not viable.
For research assistance and literature review: PubMedQA scores are the relevant discriminator. OpenBioLLM-Llama3-70B performs well on biomedical literature tasks relative to its size and is suitable for research augmentation workflows where a full frontier model is cost-prohibitive.
What to avoid: Treating USMLE accuracy alone as a proxy for clinical safety. A model that scores 93% on MedQA still fails on 7% of questions, and the failure modes in free-text clinical queries are harder to predict. HealthBench is the more honest assessment of whether a model communicates about health in a responsible and clinically appropriate way.
Sources:
- MedQA: A Benchmark for Biomedical Question Answering (arxiv:2009.13081)
- MedQA Dataset - GitHub (jind11/MedQA)
- MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain QA (arxiv:2009.13811)
- MedMCQA Dataset - HuggingFace
- PubMedQA: A Dataset for Biomedical Research Question Answering (arxiv:1809.06609)
- PubMedQA Dataset - HuggingFace
- Measuring Massive Multitask Language Understanding (MMLU) (arxiv:2009.03300)
- MMLU Dataset - HuggingFace
- Performance of ChatGPT on USMLE - NEJMqa Study (arxiv:2209.04460)
- Towards Expert-Level Medical Question Answering with Med-PaLM 2 (arxiv:2305.09617)
- Towards Generalist Biomedical AI - Med-PaLM-M (arxiv:2307.14334)
- MedAgentBench: A Challenging Agentic Benchmark for Large Language Models (arxiv:2402.01767)
- HealthBench: Evaluating Large Language Models Towards Improved Human Health (arxiv:2505.10074)
- MedGemma: A Family of Biomedical AI Models (arxiv:2501.09904)
- MedGemma Model - HuggingFace (google/medgemma-27b-it)
- OpenBioLLM-Llama3-70B - HuggingFace
- MMed-LLM: A Generalist Multilingual Medical Language Model (arxiv:2312.04465)
- MMed-Llama 3 8B - HuggingFace (Henrychur/MMed-Llama-3-8B)
- Asclepius: A Spectrum of Medical Large Language Models (arxiv:2305.01116)
- BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation (arxiv:2210.10341)
- BioGPT-Large-PubMedQA - HuggingFace (microsoft/BioGPT-Large-PubMedQA)
