
Web Agent Benchmarks Leaderboard: Apr 2026
Rankings across WebArena, WebVoyager, BrowseComp, Mind2Web, WorkArena, and WebChoreArena - every verified score for browser-driving AI agents as of April 2026.
They summarize our coverage. We write it.
Newsletters like this one rebroadcast our headlines - often without the full review, the source reading, or the analysis underneath. Our weekly briefing sends the work they paraphrase, straight from the desk, before they get to it.
Free, weekly, no spam. One email every Tuesday. Unsubscribe anytime.

Rankings across WebArena, WebVoyager, BrowseComp, Mind2Web, WorkArena, and WebChoreArena - every verified score for browser-driving AI agents as of April 2026.

Rankings of the top AI models on factuality and hallucination benchmarks: TruthfulQA, SimpleQA, FACTS Grounding, Vectara HHEM, HaluEval, HalluLens, and AA-Omniscience as of April 2026.

Rankings of AI video generation models across VBench, VBench-2.0, and the Artificial Analysis Video Arena Elo system, covering text-to-video and image-to-video performance.

Rankings of top LLMs on function calling and tool use benchmarks including BFCL v3, tau-bench, ToolBench, and FinTrace as of April 2026.

Rankings of AI models on the key visual reasoning benchmarks - MMMU, MathVista, ChartQA, DocVQA, OCRBench, AI2D, CharXiv, and more - focused on image and document understanding.

Rankings of the top embedding and RAG systems across BEIR, MTEB retrieval, MIRACL, MS MARCO, KILT, HotpotQA, and RAGTruth hallucination benchmarks as of April 2026.

Rankings of AI models on IFEval and IFBench, the two main benchmarks for measuring how reliably LLMs follow precise formatting, length, and content constraints.

Rankings of the most popular MCP servers across development, data, web automation, and productivity categories based on installs, search volume, and GitHub activity.

Rankings of the best AI models and agent frameworks on computer use benchmarks - OSWorld, OSWorld-Verified, and ScreenSpot-Pro - updated March 2026.

Rankings of AI models by safety metrics including refusal rates, jailbreak resistance, bias scores, and truthfulness across major benchmarks.

Rankings of the best text-to-speech and speech-to-text AI models by naturalness, accuracy, latency, and pricing.

Rankings of the fastest AI models and inference providers by tokens per second, time to first token, and end-to-end latency.