Articles Tagged "Benchmarks"

Scientific Reasoning LLM Leaderboard 2026: GPQA Ranks

Rankings of AI models on STEM benchmarks: GPQA Diamond, SciBench, OlympiadBench-Science, MMLU-STEM, ARC-Challenge, and ChemQA/Physics Olympiad as of April 2026.

Structured Output JSON Schema Leaderboard 2026

Rankings of LLMs and constrained decoding frameworks on JSON schema adherence benchmarks including JSONSchemaBench and BFCL v3, covering native APIs and open-source constraint engines.

Summarization LLM Leaderboard 2026: ROUGE and Faithfulness

Rankings of the top LLMs on summarization benchmarks - ROUGE-L, BERTScore, FActScore, and human preference across CNN/DailyMail, XSum, GovReport, QMSum, and BookSum as of April 2026.

SWE-Bench Coding Agent Leaderboard 2026: Claude vs GPT

Rankings of the best LLM-powered software engineering agents on SWE-Bench Verified, with pass rates, pricing, scaffold notes, and methodology - updated April 2026.

MoE Routing, Prompt Gambles, and Where Reasoning Breaks

Three new papers challenge assumptions in MoE routing design, prompt optimization workflows, and LLM reasoning chains - all published this week on arXiv.

Web Agent Benchmarks Leaderboard: Apr 2026

Rankings across WebArena, WebVoyager, BrowseComp, Mind2Web, WorkArena, and WebChoreArena - every verified score for browser-driving AI agents as of April 2026.

Hallucination Benchmarks Leaderboard: April 2026

Rankings of the top AI models on factuality and hallucination benchmarks: TruthfulQA, SimpleQA, FACTS Grounding, Vectara HHEM, HaluEval, HalluLens, and AA-Omniscience as of April 2026.

GLM-5.1 Review: Open-Source Model Tops SWE-Bench Pro

Z.ai's GLM-5.1 is a 754B open-weight model that claims the top spot on SWE-Bench Pro without a single NVIDIA chip - here's how it holds up in practice.

Video Generation Benchmarks Leaderboard 2026

Rankings of AI video generation models across VBench, VBench-2.0, and the Artificial Analysis Video Arena Elo system, covering text-to-video and image-to-video performance.

Function Calling Benchmarks Leaderboard 2026

Rankings of top LLMs on function calling and tool use benchmarks including BFCL v3, tau-bench, ToolBench, and FinTrace as of April 2026.

Best AI Vector Databases 2026 - Full Comparison

A data-driven comparison of 12 vector databases for RAG and AI workloads, with verified pricing, benchmark numbers, and honest trade-off analysis.

Best Open-Source LLM Inference Servers 2026

A benchmark-driven comparison of the top open-source LLM inference servers - vLLM, SGLang, TGI, llama.cpp, TensorRT-LLM, LMDeploy, and more.

← Previous