
Medical LLM Leaderboard 2026: MedQA, USMLE, PubMedQA
Rankings of AI models on medical QA benchmarks - MedQA USMLE, MedMCQA, PubMedQA, MMLU-Medical, HealthBench, and more. Where a wrong answer has clinical consequences.
They summarize our coverage. We write it.
Newsletters like this one rebroadcast our headlines - often without the full review, the source reading, or the analysis underneath. Our weekly briefing sends the work they paraphrase, straight from the desk, before they get to it.
Free, weekly, no spam. One email every Tuesday. Unsubscribe anytime.

Rankings of AI models on medical QA benchmarks - MedQA USMLE, MedMCQA, PubMedQA, MMLU-Medical, HealthBench, and more. Where a wrong answer has clinical consequences.

Rankings of dedicated reward models and frontier LLMs as judges across RewardBench, RewardBench-2, and JudgeBench - benchmarks that measure preference alignment and human agreement.

Rankings of AI models on STEM benchmarks: GPQA Diamond, SciBench, OlympiadBench-Science, MMLU-STEM, ARC-Challenge, and ChemQA/Physics Olympiad as of April 2026.

Rankings of LLMs and constrained decoding frameworks on JSON schema adherence benchmarks including JSONSchemaBench and BFCL v3, covering native APIs and open-source constraint engines.

Rankings of the top LLMs on summarization benchmarks - ROUGE-L, BERTScore, FActScore, and human preference across CNN/DailyMail, XSum, GovReport, QMSum, and BookSum as of April 2026.

Rankings of the best LLMs and agent pipelines on BIRD, Spider 2.0, CoSQL, and SParC text-to-SQL benchmarks, with execution accuracy scores and analysis.

Three new papers challenge assumptions in MoE routing design, prompt optimization workflows, and LLM reasoning chains - all published this week on arXiv.

Rankings of the top AI models on factuality and hallucination benchmarks: TruthfulQA, SimpleQA, FACTS Grounding, Vectara HHEM, HaluEval, HalluLens, and AA-Omniscience as of April 2026.

Rankings of top LLMs on function calling and tool use benchmarks including BFCL v3, tau-bench, ToolBench, and FinTrace as of April 2026.

OpenAI launched GPT-Rosalind on April 16, a frontier reasoning model for drug discovery that outranked human experts on RNA prediction and competes directly with Google DeepMind's AlphaFold.

Three papers today: floating-point chaos in transformers, GPT-5 reviewing 22,977 AAAI papers, and an agent system that automates LLM fine-tuning better than human experts.

Arcee Trinity-Large-Thinking is a 400B sparse MoE open-source reasoning model that ranks #2 on PinchBench at $0.85/M output tokens, 28x cheaper than Claude Opus 4.6.