Microsoft Maia 200 - Azure's Inference Accelerator

Microsoft Maia 200 specs, benchmarks, and architecture analysis. TSMC 3nm, 216GB HBM3e, 10 PFLOPS FP4, 750W - Microsoft's first inference-only silicon deployed in Azure.

Microsoft Maia 200 - Azure's Inference Accelerator

Overview

Microsoft's Maia 200, announced January 26, 2026, is the company's second-generation custom AI accelerator and its first silicon platform built specifically for large-scale inference rather than general-purpose AI training. Maia 200 runs inside Azure's US Central datacenter in Des Moines, Iowa today, with US West 3 in Phoenix next in line, and further regions expanding through 2026.

TL;DR

  • Inference-only ASIC on TSMC 3nm with 216GB HBM3e at 7 TB/s and 272MB on-chip SRAM
  • Claims 10+ PFLOPS FP4 and 5+ PFLOPS FP8 in a 750W envelope - currently powering GPT-class models on Azure
  • 3x the FP4 throughput of AWS Trainium3 and higher FP8 than Google TPU v7 Ironwood, by Microsoft's own numbers
  • Inference only - not rolled out for training workloads

The chip arrived as Microsoft's answer to growing internal infrastructure costs and continued dependency on NVIDIA GPUs for serving OpenAI models at Azure scale. Maia 200 runs on a custom Ethernet-based interconnect rather than NVLink or InfiniBand, reflecting a deliberate architectural choice to scale horizontally with standard networking fabric instead of proprietary GPU interconnects. Each accelerator provides 2.8 TB/s bidirectional bandwidth to its peers via an on-die NIC.

With 140 billion transistors and a tile-based micro-architecture, Maia 200 is meaningfully more complex than the original Maia 100 rolled out in 2023. Microsoft says the chip delivers 30% better performance per dollar than the previous hardware generation already running in its fleet.

Key Specifications

SpecificationDetails
ManufacturerMicrosoft
Product FamilyMaia
Chip TypeASIC (Inference)
Process NodeTSMC N3 (3nm)
Transistors140 billion
Memory216GB HBM3e
Memory Bandwidth7,000 GB/s (7 TB/s)
On-chip SRAM272MB (CSRAM + TSRAM)
FP8 Performance>5,000 TFLOPS (5 PFLOPS)
FP4 Performance>10,000 TFLOPS (10.1 PFLOPS)
BF16 PerformanceNot disclosed
TDP750W
Scale-up InterconnectCustom Ethernet, 2.8 TB/s bidirectional per chip
Max Cluster Size6,144 accelerators
Target WorkloadInference only
Release DateJanuary 2026

Performance Benchmarks

No independent third-party benchmarks are available yet. All performance data comes from Microsoft's own announcements. The table below summarizes the vendor-reported figures against the two closest hyperscaler competitors:

BenchmarkMaia 200AWS Trainium3Google TPU v7 Ironwood
FP4 Performance>10 PFLOPS~3.4 PFLOPS (est.)Not disclosed
FP8 Performance>5 PFLOPS~2.5 PFLOPS~4.7 PFLOPS (est.)
Memory Capacity216GB HBM3e144GB HBM3e192GB HBM per pod chip
Memory Bandwidth7 TB/s~4 TB/s~7.4 TB/s
TDP750W~600W~900W (est.)
Max Cluster6,144 chips65,536 chips9,216 chips per pod

Microsoft's FP4 comparison against Trainium3 is the most specific claim: "three times the FP4 performance." The FP8 vs. TPU v7 comparison is vaguer - "above Google's seventh generation TPU" with no specific multiplier cited. Both comparisons rely on vendor-supplied figures. The Trainium3 comparison is more credible because AWS has published its own FP8 numbers; the TPU v7 comparison rests on Google's undisclosed FP8 figure.

What's clear from the architecture alone: the 7 TB/s HBM3e bandwidth is sizable. NVIDIA's B200 reaches 8 TB/s on 192GB. Maia 200 gives up some bandwidth density but gains 24GB of capacity, which matters for serving large models with long context.

Maia 200 server rack cooling configuration inside an Azure datacenter Maia 200 accelerators installed in Azure datacenter racks with liquid cooling sidecars. Source: news.microsoft.com

Key Capabilities

Tile-Based Compute Architecture

Maia 200 uses a hierarchical tile design. Each tile pairs a Tile Tensor Unit (TTU) with a Tile Vector Processor (TVP). The TTU handles high-throughput matrix multiply and convolution in FP8, FP6, and FP4, including mixed-precision modes like FP8 activations multiplied by FP4 weights. The TVP is a fully programmable SIMD engine supporting FP8, BF16, FP16, and FP32 for operations that need higher precision. A lightweight Tile Control Processor (TCP) coordinates both.

The FP4 throughput is 2x that of FP8 and 8x that of BF16 on the same die, which is why Microsoft positions Maia 200 specifically for modern reasoning and large language models where FP4 quantization is becoming standard practice.

Three-Tier Memory Subsystem

The 272MB on-chip SRAM divides into two software-managed layers: Cluster SRAM (CSRAM) and Tile SRAM (TSRAM). Both are fully deterministic - the compiler places data explicitly rather than relying on hardware caches. This is the same approach Groq uses in its LPU architecture, and for the same reason: deterministic latency for inference, no cache thrash, no unpredictable bandwidth spikes. Behind the SRAM sits 216GB of HBM3e at 7 TB/s.

Data movement between all three tiers is handled by a custom Network-on-Chip with hierarchical DMA engines at the tile, cluster, and network levels. Microsoft also built a custom AI Transport Layer (ATL) for chip-to-chip communication, supporting packet spraying and multipath routing to minimize congestion at cluster scale.

Ethernet-Based Scale-Up

Microsoft explicitly chose standard Ethernet over proprietary interconnects like NVLink or Intel Gaudi's HCCL. Each Maia 200 has an on-die NIC providing 1.4 TB/s unidirectional bandwidth. The first tier of the topology uses a "Fully Connected Quad" (FCQ): four accelerators connected via direct switchless links. The second tier extends this to 6,144 accelerators. This architecture is harder to beat on raw GPU-to-GPU bandwidth compared to NVLink, but it gives Microsoft full control of the networking stack and avoids licensing proprietary interconnects.

Maia 200 die shot showing chip architecture The Maia 200 silicon die, showing 140 billion transistors on TSMC N3. Source: news.microsoft.com

Pricing and Availability

Maia 200 is not available for external purchase or cloud rental at a per-chip level. Microsoft uses it exclusively for internal Azure inference workloads, mainly serving OpenAI's production models. There's no announced Azure VM SKU or dedicated pricing for Maia 200 instances as of April 2026.

This differs from Google's approach with Cloud TPUs, where external customers can rent TPU v6e and TPU v7 pods. It also differs from AWS, where Trainium2 and Trainium3 are available as EC2 Trn instances.

Microsoft claims 30% better performance per dollar compared to its own existing hardware fleet, but doesn't disclose the baseline hardware or the specific model and batch size where that improvement holds. External validation of that number isn't possible without access to the systems.

Strengths and Weaknesses

Strengths

  • High FP4 throughput (10.1 PFLOPS) relative to TDP for inference workloads
  • Large 216GB HBM3e capacity serves 70B+ parameter models comfortably
  • 272MB deterministic on-chip SRAM reduces memory latency for attention operations
  • Scales to 6,144 chips in two-tier Ethernet topology without proprietary interconnects
  • 140B transistors on TSMC N3 gives substantial room for future iterations

Weaknesses

  • No independent benchmark data - all performance figures from vendor
  • Inference-only: not usable for training workloads
  • Not externally available - Azure customers cannot directly access Maia 200 resources
  • No disclosed tokens-per-second figures on standard benchmarks like MLPerf
  • Custom software stack (Maia Triton compiler, NPL) adds engineering overhead vs. CUDA ecosystem

Sources

✓ Last verified April 1, 2026

Microsoft Maia 200 - Azure's Inference Accelerator
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.