AgentPerf - First Infrastructure Benchmark for Agents
Artificial Analysis released AgentPerf, the first agentic AI infrastructure benchmark, measuring concurrent agents per megawatt. NVIDIA Blackwell leads with 20x gains over Hopper.

Every inference benchmark in production today measures the same thing: tokens per second on a single request. That number is useful for chatbots. It's nearly useless for coding agents, which chain dozens to hundreds of LLM calls in a single session, build up context across turns, and stall repeatedly waiting on tool call results. On June 12, Artificial Analysis published AA-AgentPerf, the first benchmark built specifically to assess infrastructure under agentic workloads. The first published results put NVIDIA's Blackwell platform 20x ahead of its own previous generation.
TL;DR
- Artificial Analysis launched AA-AgentPerf on June 12 - the first benchmark measuring concurrent agent capacity under real SLA constraints
- Lead metric is Agents per Megawatt; NVIDIA GB300 NVL72 scored 61,400 vs H200's 2,600
- Uses real coding-agent trajectories from public repos across 12+ languages, not synthetic workloads
- Only NVIDIA hardware has been tested so far; other vendors may submit to [email protected]
- The benchmark currently covers only DeepSeek V4 Pro as the test model
| System | Agents/MW | Agents/GPU | Architecture |
|---|---|---|---|
| NVIDIA GB300 NVL72 | 61,400 | 57.5 | Blackwell, 72-GPU NVLink rack |
| NVIDIA HGX H200 | 2,600 | 1.4 | Hopper, SXM5 |
The 20x gap per megawatt is the headline. For data centers paying energy costs that increasingly define total AI infrastructure spend, that ratio matters more than raw throughput.
Why Existing Benchmarks Miss Agents
Standard LLM inference benchmarks were designed for stateless, single-turn requests. They measure time-to-first-token and output token rate on short, independent prompts. That captures roughly zero percent of what a launched coding agent actually does.
A real agentic session looks nothing like a benchmark request. The agent reads files, writes diffs, runs commands, gets errors, re-reads the file, and creates another tool call. Artificial Analysis measured real trajectories: sessions extending up to 200 turns, with sequences that reach 100,000 tokens, and input lengths ranging from 5,000 to 131,000 tokens (mean around 27,000). Output is mostly short tool calls interrupted by longer reasoning bursts. The system has to manage hundreds of partially-complete sessions simultaneously, not fire-and-forget a single response.
"Standard inference benchmarks use fixed-length inputs and single-shot outputs. Real agentic workloads are neither. AgentPerf measures what actually happens when you run 500 agents at once."
- Artificial Analysis, AA-AgentPerf launch post
That problem is why the benchmark uses its own unit: Agents per Megawatt, defined as the maximum number of concurrent agents a system can serve while meeting a production service-level target. It's a density metric, not a speed metric.
What AgentPerf Actually Measures
The Workload
Artificial Analysis pulled real coding-agent trajectories from public code repositories across 12+ programming languages. The trajectories simulate realistic task patterns - file reading, editing, command execution - at the sequence lengths and tool-call frequencies that production agents actually produce. One-second median tool call delays were injected to reflect CPU-side latency. Dynamic prefixes were added to each phase to prevent systems from caching trajectories and gaming results. The test sets themselves remain private.
The benchmark enables production-grade serving strategies during evaluation: KV cache reuse, speculative decoding, and disaggregated prefill/decode are all allowed. That's deliberate - those optimizations are what you'd run in production, so synthetic tests that exclude them would create misleading numbers.
The SLO Framework
Three service-level tiers define what "good enough" means for an agent:
| Tier | Output Speed | P95 Time-to-First-Token |
|---|---|---|
| SLO #1 | 30 tokens/sec | 10 seconds |
| SLO #2 | 100 tokens/sec | 5 seconds |
| SLO #3 | 300 tokens/sec | 3 seconds |
The tiered structure is useful. An overnight batch agent running code review at SLO #1 costs very differently than an interactive pair-programmer that needs SLO #3. Most published benchmark numbers collapse this distinction entirely.
The GB300 NVL72 connects 72 Blackwell GPUs via NVLink fabric in a rack-scale unit - key to its agent density advantage.
Source: nvidianews.nvidia.com
The NVIDIA Result
The GB300 NVL72's numbers (61,400 agents per megawatt at SLO #1) come from a combination of architectural advantages that specifically suit agentic workloads. The NVLink fabric connecting all 72 GPUs in the rack lets the system distribute DeepSeek V4 Pro's mixture-of-experts routing across the entire pod rather than within a single accelerator. NVIDIA's TensorRT LLM handles session multiplexing efficiently enough that the system can keep hundreds of contexts warm simultaneously. The kernel stack - WideEP and DeepEP for expert distribution, MXFP4 and MXFP8 for compute - was tuned for the memory-access patterns MoE models produce under long contexts.
The H200 comparison (2,600 agents per megawatt) gives the 20x claim its grounding. Hopper-class hardware wasn't designed for rack-scale expert routing and lacks the NVLink bandwidth that makes that optimization possible. Per-GPU numbers tell a similar story: 57.5 agents per GB300 versus 1.4 per H200.
What They Measured - and What They Didn't
What They Measured
The methodology is solid for what it covers: real trajectories, private test sets to prevent gaming, production-legal optimizations, multi-tier SLO framing. The Agents per Megawatt metric is more operationally meaningful than token throughput for teams actually planning infrastructure spend.
What They Didn't
The results have a real ceiling on how far they generalize. As of the launch post, only NVIDIA hardware has submitted results. There are no numbers from AMD Instinct MI300X, no Google TPU v6 data, no AWS Trainium figures, no Groq or Cerebras results. Artificial Analysis says the benchmark is open for vendor submissions at [email protected]. Until other vendors appear in the table, this is a single-vendor self-assessment with external methodology review, not a competitive market comparison.
The model coverage is equally thin. The entire first round runs on DeepSeek V4 Pro, a large mixture-of-experts model that favors rack-scale NVLink setups by design. Dense transformer models behave differently under long-context agent workloads, and smaller models run on single-node configurations. Neither is represented yet.
Context length also stops at 131K. Artificial Analysis has announced plans to extend to 1M token sessions, which would change the KV cache and memory bandwidth dynamics substantially. The current numbers reflect agentic tasks that fit in a long context window, not multi-session retrieval-augmented workflows that keep state across calls.
Agents per Megawatt comparison across SLO tiers. Only NVIDIA systems appear in the first round of published results.
Source: artificialanalysis.ai
Should You Care?
If you're running coding agents at any meaningful scale, yes. The critique of existing benchmarks is correct - single-call throughput numbers don't predict agentic workload behavior. Token rate on a short prompt has almost nothing to do with how many concurrent coding sessions your cluster can handle when each session is 30,000 tokens deep and issuing tool calls every few seconds.
The benchmark's methodology - real trajectories, SLO tiers, production optimizations allowed - is the right approach. When other vendors publish results, it will become a useful planning tool for infrastructure decisions. The gap between Blackwell and Hopper on agent density also confirms something practitioners already suspected: the architecture changes that mattered most for agentic AI aren't in raw FLOPS, they're in memory bandwidth, interconnect, and context management. NVIDIA's Vera Rubin platform, projected at 50 PFLOPs of NVFP4 compute with enhanced tool-call acceleration, is being designed with exactly these workloads in mind.
The actual planning question - Blackwell versus AMD versus TPU for an agentic coding fleet - still can't be answered from this benchmark alone. The first round of AgentPerf told you NVIDIA is very good at running NVIDIA's preferred model on NVIDIA's preferred hardware. That's not useless information, but wait for the second round before restructuring your procurement.
Sources:
