Leaderboards

AI Music Generation Leaderboard 2026

AI Music Generation Leaderboard 2026

Ranked benchmarks for AI music generation tools covering FAD, CLAP, MOS listening tests, and MusicCaps evaluation - text-to-music, lyric-to-song, and stem remixing.

Creative Writing LLM Leaderboard 2026

Creative Writing LLM Leaderboard 2026

Rankings of AI models on creative writing quality benchmarks: EQ-Bench Creative Writing v3, Antislop evaluations, and human-preference judging. Which LLMs can actually write?

Edge and Mobile LLM Leaderboard 2026

Edge and Mobile LLM Leaderboard 2026

Rankings of the best LLMs for on-device edge inference - phones, laptops without GPUs, Raspberry Pi, and Jetson - scored by quality benchmarks and real tokens/sec on iPhone, MacBook, and Raspberry Pi 5.

Finance LLM Leaderboard 2026

Finance LLM Leaderboard 2026

Rankings of AI models on financial reasoning benchmarks: FinanceBench, FinQA, TAT-QA, CFA-Bench, and more - where hallucination costs real money.

Legal AI LLM Leaderboard 2026

Legal AI LLM Leaderboard 2026

Rankings of AI models on legal benchmarks - LegalBench, LexGLUE, CaseHOLD, ContractNLI, Bar Exam MBE, and more. Where hallucinated citations already got lawyers sanctioned.

LLM Jailbreak and Red-Team Resistance Leaderboard

LLM Jailbreak and Red-Team Resistance Leaderboard

Rankings of 14 frontier LLMs by adversarial robustness - how well they resist jailbreaks, prompt injection, and harmful-behavior elicitation across HarmBench, AdvBench, StrongREJECT, JailbreakBench, and AgentHarm.

LLM Quantization Impact Leaderboard 2026

LLM Quantization Impact Leaderboard 2026

How much quality do LLMs lose when quantized from BF16 to INT8, Q6, Q5, Q4, Q3, Q2? Per-model delta tables across MMLU, HumanEval, and perplexity, with VRAM and throughput data for every major quantization format.

Medical LLM Leaderboard 2026

Medical LLM Leaderboard 2026

Rankings of AI models on medical QA benchmarks - MedQA USMLE, MedMCQA, PubMedQA, MMLU-Medical, HealthBench, and more. Where a wrong answer has clinical consequences.

Reward Model and LLM-as-Judge Leaderboard

Reward Model and LLM-as-Judge Leaderboard

Rankings of dedicated reward models and frontier LLMs as judges across RewardBench, RewardBench-2, and JudgeBench - benchmarks that measure preference alignment and human agreement.

Robotics Embodied AI Leaderboard 2026

Robotics Embodied AI Leaderboard 2026

Rankings of VLA models and embodied AI systems on real robotics benchmarks: CALVIN, SimplerEnv, LIBERO, RoboCasa, DROID, and real-robot success rates as of April 2026.