Leaderboards

Edge and Mobile LLM Leaderboard 2026: Phi, Gemma, Qwen

Rankings of the best LLMs for on-device edge inference - phones, laptops without GPUs, Raspberry Pi, and Jetson - scored by quality benchmarks and real tokens/sec on iPhone, MacBook, and Raspberry Pi 5.

Finance LLM Leaderboard 2026: FinBench Scores Ranked

Rankings of AI models on financial reasoning benchmarks: FinanceBench, FinQA, TAT-QA, CFA-Bench, and more - where hallucination costs real money.

Legal AI LLM Leaderboard 2026: LegalBench and CaseHOLD

Rankings of AI models on legal benchmarks - LegalBench, LexGLUE, CaseHOLD, ContractNLI, Bar Exam MBE, and more. Where hallucinated citations already got lawyers sanctioned.

LLM Code Review Leaderboard - Benchmarks and Rankings

Rankings of the best LLMs and AI agents at automated code review - spotting bugs in diffs, commenting on PRs, and surfacing non-obvious issues across CodeReviewer, CR-Bench, and real-world evaluations.

LLM Jailbreak and Red-Team Resistance Leaderboard

Rankings of 14 frontier LLMs by adversarial robustness - how well they resist jailbreaks, prompt injection, and harmful-behavior elicitation across HarmBench, AdvBench, StrongREJECT, JailbreakBench, and AgentHarm.

LLM Quantization Impact Leaderboard 2026: INT4 vs FP16

How much quality do LLMs lose when quantized from BF16 to INT8, Q6, Q5, Q4, Q3, Q2? Per-model delta tables across MMLU, HumanEval, and perplexity, with VRAM and throughput data for every major quantization format.

Medical LLM Leaderboard 2026: MedQA, USMLE, PubMedQA

Rankings of AI models on medical QA benchmarks - MedQA USMLE, MedMCQA, PubMedQA, MMLU-Medical, HealthBench, and more. Where a wrong answer has clinical consequences.

OCR and Document AI Leaderboard 2026: Top Models Ranked

Rankings of AI models on OCR and document understanding benchmarks - OCRBench, DocVQA, InfographicVQA, ChartQA, TextVQA, and MMMU-Pro. Covers GPT-4.1 Vision, Claude 4 Sonnet/Opus, Gemini 2.5 Pro, Qwen2.5-VL, InternVL3, Mistral OCR, and more.

Reward Model and LLM-as-Judge Leaderboard 2026 Ranked

Rankings of dedicated reward models and frontier LLMs as judges across RewardBench, RewardBench-2, and JudgeBench - benchmarks that measure preference alignment and human agreement.

Robotics Embodied AI Leaderboard 2026: VLA Models Ranked

Rankings of VLA models and embodied AI systems on real robotics benchmarks: CALVIN, SimplerEnv, LIBERO, RoboCasa, DROID, and real-robot success rates as of April 2026.

Scientific Reasoning LLM Leaderboard 2026: GPQA Ranks

Rankings of AI models on STEM benchmarks: GPQA Diamond, SciBench, OlympiadBench-Science, MMLU-STEM, ARC-Challenge, and ChemQA/Physics Olympiad as of April 2026.

Structured Output JSON Schema Leaderboard 2026

Rankings of LLMs and constrained decoding frameworks on JSON schema adherence benchmarks including JSONSchemaBench and BFCL v3, covering native APIs and open-source constraint engines.

← Previous