
Web Agent Benchmarks Leaderboard: Apr 2026
Rankings across WebArena, WebVoyager, BrowseComp, Mind2Web, WorkArena, and WebChoreArena - every verified score for browser-driving AI agents as of April 2026.

Rankings across WebArena, WebVoyager, BrowseComp, Mind2Web, WorkArena, and WebChoreArena - every verified score for browser-driving AI agents as of April 2026.

OpenAI's April 16 Codex update adds background computer use on Mac, an Atlas-based in-app browser, gpt-image-1.5 image generation, and 111 new plugins - moving the app far beyond agentic coding.

Z.ai's GLM-5.1 is a 754B open-weight model that claims the top spot on SWE-Bench Pro without a single NVIDIA chip - here's how it holds up in practice.

Rankings of top LLMs on function calling and tool use benchmarks including BFCL v3, tau-bench, ToolBench, and FinTrace as of April 2026.

OpenAI's April 2026 Agents SDK update adds sandboxed execution environments and three-layer guardrails for enterprises building long-horizon agentic systems.

Microsoft released the Agent Governance Toolkit, a seven-package open-source system that enforces policies on autonomous AI agents at sub-millisecond latency and covers all 10 OWASP agentic risks.

Researchers from Google DeepMind, Microsoft, and Columbia propose financial guardrails for AI agents, with simulations showing up to 61% reduction in user losses.

Anthropic released Claude Managed Agents in public beta today, a fully managed platform that handles sandboxing, state, and tool execution so developers can skip building agent infrastructure from scratch.

Claude Opus 4.6 leads SWE-bench Verified at 80.8% and OSWorld at 72.7% for agentic tasks, while GPT-5.4 ties for computer use; no single model dominates every workflow type.

Z.ai's GLM-5.1 scores 58.4 on SWE-bench Pro, edging out GPT-5.4 and Claude Opus 4.6, after being trained on 100,000 Huawei Ascend chips with no US silicon.

Cursor's ground-up IDE rebuild ships parallel agent orchestration, Design Mode for frontend work, and cloud-to-local session handoff - all in one unified workspace.

Claude Sonnet 4.6 and GPT-5.4 cost nearly the same per token but win on opposite benchmarks. Here is where each model leads and which to pick for your workload.