Claude Hits Azure GA on NVIDIA's Blackwell Ultra Hardware

Anthropic's Claude models are now generally available in Microsoft Foundry on Azure, running on NVIDIA GB300 Blackwell Ultra NVL72 racks - with vendor numbers claiming 40% faster token generation than H100.

Claude Hits Azure GA on NVIDIA's Blackwell Ultra Hardware

Anthropic's Claude models reached general availability in Microsoft Foundry on Azure on June 29, running on NVIDIA GB300 Blackwell Ultra NVL72 racks. The launch marks the first time enterprise teams can access Claude through their Azure billing account with full Azure-native identity, networking, and governance - no separate Anthropic subscription required.

The headline number: Microsoft says Claude Sonnet on GB300 produces tokens 40% faster than H100 nodes and roughly 15% faster than B200 systems. At 1-million-token context lengths, Azure reports inference latency drops up to 6x versus the previous H100 deployment. Those figures are vendor-measured in Microsoft's own environment, not independently validated, and they should be treated accordingly until customers can publish their own runs.

Key Specs

ComponentDetail
GPUs per rack72 Blackwell Ultra (GB300)
CPUs per rack36 NVIDIA Grace (Arm)
Fast memory37 TB total; 192 GB HBM3e per GPU
FP4 performance1,440 petaflops per rack
NVLink fabric1.8 TB/s per GPU
InterconnectQuantum-X800 InfiniBand
vs H100 (vendor claim)~40% faster token generation
vs B200 (vendor claim)~15% faster token generation

The Hardware

Memory and Compute

The GB300 NVL72 is a 48U rack that unifies 72 Blackwell Ultra GPUs and 36 Arm-based Grace CPUs into a single liquid-cooled unit. Each GPU carries 192 GB of HBM3e memory - double the H100's 80 GB - for 37 TB total fast memory per rack. That headroom is what makes very long context windows practical without aggressive KV-cache eviction.

FP4 throughput lands at 1,440 petaflops per rack. The cooling is hybrid: GPUs, CPUs, and NVSwitch are liquid-cooled, while OSFP modules and drives stay air-cooled. Power draw runs up to 140 kW per rack, which puts facility requirements firmly in the "serious data center" category.

Fabric

Every GPU in the rack connects over NVLink 5.0, which provides 1.8 TB/s per GPU and 130 TB/s of aggregate NVLink bandwidth across all 72 chips. External connectivity runs over Quantum-X800 InfiniBand. NVIDIA quotes a 50x overall throughput improvement compared with Hopper-generation AI factories, a number that scales from entire-factory comparisons rather than single-model inference jobs.

The ASUS GB300 NVL72 rack solution - one of the hardware vendors supplying GB300 systems to Azure The GB300 NVL72 integrates 72 Blackwell Ultra GPUs and 36 Grace CPUs into a single 48U liquid-cooled rack. Source: press.asus.com

The Performance Numbers

Microsoft's published comparisons put Claude Sonnet on GB300 ahead of both of its predecessor generations on raw throughput and latency:

MetricH100B200GB300
Token generation speedBaseline+~15%+~40%
Long-context latency (1M tokens)Baseline-Up to 6x lower
Memory per GPU80 GB HBM3e192 GB HBM3e192 GB HBM3e

The 40% throughput gain over H100 is plausible given the architectural jump - GB300 physically has more memory bandwidth and faster tensor cores. The 6x latency claim for million-token context is harder to assess without knowing batch size and prompt length distributions. Those figures matter enormously for long-context inference jobs, where memory bandwidth, not compute, is usually the bottleneck.

Context matters: these are single-vendor benchmarks measured on Azure's infrastructure at Microsoft's discretion. They match what a reasonable architectural analysis would predict, but independent customer validation hasn't surfaced publicly yet.

Getting Started

Accessing Claude through Foundry uses the AnthropicFoundry client from Anthropic's SDK. Microsoft Entra ID is the recommended auth method for production; API keys work for everything except Mythos 5 and Mythos Preview.

from anthropic import AnthropicFoundry
from azure.identity import DefaultAzureCredential, get_bearer_token_provider

base_url = "https://<resource-name>.services.ai.azure.com/anthropic"
deployment_name = "claude-sonnet-5"

token_provider = get_bearer_token_provider(
    DefaultAzureCredential(),
    "https://ai.cognitiveservices.com/.default"
)

client = AnthropicFoundry(
    azure_ad_token_provider=token_provider,
    base_url=base_url
)

message = client.messages.create(
    model=deployment_name,
    messages=[{"role": "user", "content": "Your prompt here"}],
    max_tokens=2048,
)
print(message.content)

The base URL format is fixed: https://<resource-name>.services.ai.azure.com/anthropic. The deployment name you chose during provisioning routes the request to a specific model version.

Available models and deployment options as of GA:

ModelHosted on AzureAuthRegion Scope
Claude Opus 4.8YesEntra ID or API keyGlobal / Data Zone (US)
Claude Sonnet 4.6YesEntra ID or API keyGlobal Standard
Claude Sonnet 5YesEntra ID or API keyGlobal Standard
Claude Haiku 4.5YesEntra ID or API keyGlobal Standard
Claude Mythos 5Anthropic-hosted onlyEntra ID onlyGlobal Standard

Global Standard deployments land in East US2 or Sweden Central. Claude Opus 4.8 on Azure also supports Data Zone Standard (US) for US data-residency requirements.

Data center server infrastructure Enterprise Claude workloads in Azure run on Anthropic's own GB300 NVL72 racks rather than shared multi-tenant GPU pools. Source: pexels.com

What Didn't Move

The Azure path isn't a full replacement for Anthropic's direct API. Several features either don't transfer or behave differently:

Data residency caveats exist. Prompts and completions stay within Azure for the "Hosted on Azure" deployment path. However, Microsoft explicitly notes that Excel Agent Mode and Copilot Researcher - two Claude integrations available in Microsoft 365 - run on Anthropic-managed infrastructure outside Azure's data-residency commitment. If your compliance requirement is "Claude traffic stays in Azure," check which specific integration you're using.

Pricing isn't public yet for the GB300 tier. Anthropic's standard API pricing ($2/$10 per million tokens for Sonnet 5 through August 31) is the reference point, but Azure Marketplace billing adds a layer. The faster throughput on GB300 doesn't automatically mean lower cost per token - you're paying for a different infrastructure tier.

Mythos 5 can't use API keys. Claude Mythos 5 and Mythos Preview support Entra ID authentication only. If your pipeline relies on static API key rotation, you need to redesign the auth flow before using those models on Azure.

Where It Falls Short

The 40% H100 speedup claim sounds clean, but the conditions matter. That figure likely reflects single-request throughput rather than real-world batch serving patterns, where memory bandwidth contention and scheduling decisions affect actual throughput. It also doesn't say what model size and context length the benchmark ran at.

Microsoft and Anthropic report early pilots where Claude+Phi-4 multi-model chains reached 30% accuracy improvements in enterprise workflows. One case study from a nuclear safety firm cites reducing a 200-day human review process to one day using Claude on Foundry. These are compelling anecdotes, but they're cherry-picked success stories from customers motivated to praise the platform. Representative average-case performance data isn't public.

The region limitation also matters. Global Standard deployments are confined to East US2 and Sweden Central as of GA, with more regions scheduled for Q3 2026. Enterprise customers with latency-sensitive workloads in other regions need to wait.


NVIDIA's Justin Boitano described Claude on GB300 as targeting "complex technical work" requiring "strong reasoning and coding capabilities." That framing is accurate - this isn't about cheaper chat; it's about enabling the long-context agent workflows that previously choked on H100 memory limits. Whether the raw performance claims hold under production conditions is the question that only customer benchmark data can answer.

Sources:

Sophie Zhang
About the author AI Infrastructure & Open Source Reporter

Sophie is a journalist and former systems engineer who covers AI infrastructure, open-source models, and the developer tooling ecosystem.