Leaderboards Articles

Summarization LLM Leaderboard 2026: ROUGE and Faithfulness

Rankings of the top LLMs on summarization benchmarks - ROUGE-L, BERTScore, FActScore, and human preference across CNN/DailyMail, XSum, GovReport, QMSum, and BookSum as of April 2026.

SWE-Bench Coding Agent Leaderboard 2026: Claude vs GPT

Rankings of the best LLM-powered software engineering agents on SWE-Bench Verified, with pass rates, pricing, scaffold notes, and methodology - updated April 2026.

Text-to-SQL LLM Leaderboard 2026: Spider and BIRD Ranked

Rankings of the best LLMs and agent pipelines on BIRD, Spider 2.0, CoSQL, and SParC text-to-SQL benchmarks, with execution accuracy scores and analysis.

Web Agent Benchmarks Leaderboard: Apr 2026

Rankings across WebArena, WebVoyager, BrowseComp, Mind2Web, WorkArena, and WebChoreArena - every verified score for browser-driving AI agents as of April 2026.

Hallucination Benchmarks Leaderboard: April 2026

Rankings of the top AI models on factuality and hallucination benchmarks: TruthfulQA, SimpleQA, FACTS Grounding, Vectara HHEM, HaluEval, HalluLens, and AA-Omniscience as of April 2026.

Video Generation Benchmarks Leaderboard 2026

Rankings of AI video generation models across VBench, VBench-2.0, and the Artificial Analysis Video Arena Elo system, covering text-to-video and image-to-video performance.

Function Calling Benchmarks Leaderboard 2026

Rankings of top LLMs on function calling and tool use benchmarks including BFCL v3, tau-bench, ToolBench, and FinTrace as of April 2026.

Vision-Language Benchmarks: Image Reasoning Ranked

Rankings of AI models on the key visual reasoning benchmarks - MMMU, MathVista, ChartQA, DocVQA, OCRBench, AI2D, CharXiv, and more - focused on image and document understanding.

RAG Benchmarks Leaderboard: Retrieval Rankings 2026

Rankings of the top embedding and RAG systems across BEIR, MTEB retrieval, MIRACL, MS MARCO, KILT, HotpotQA, and RAGTruth hallucination benchmarks as of April 2026.

Instruction Following Leaderboard: IFEval Rankings 2026

Rankings of AI models on IFEval and IFBench, the two main benchmarks for measuring how reliably LLMs follow precise formatting, length, and content constraints.

MCP Server Ecosystem Leaderboard - Top Servers Ranked

Rankings of the most popular MCP servers across development, data, web automation, and productivity categories based on installs, search volume, and GitHub activity.

Computer Use Leaderboard: Desktop AI Agent Rankings

Rankings of the best AI models and agent frameworks on computer use benchmarks - OSWorld, OSWorld-Verified, and ScreenSpot-Pro - updated March 2026.

← Previous