Hardware

AMD Instinct MI440X - CDNA 5 Enterprise GPU

Complete specs and analysis of the AMD Instinct MI440X - a CDNA 5 enterprise GPU with 432GB HBM4, 19.6 TB/s bandwidth, and TSMC 2nm compute chiplets.

AMD Instinct MI440X - CDNA 5 Enterprise GPU

TL;DR

  • 432GB HBM4 memory across 12 stacks at 19.6 TB/s aggregate bandwidth - 1.5x the capacity of the NVIDIA Rubin R200's 288GB
  • CDNA 5 architecture on TSMC 2nm (N2) compute chiplets - AMD's first 2nm data center GPU, with 3nm (N3P) I/O dies
  • ~320 billion transistors across a 12-die chiplet package (8 XCDs + active interposer/IO dies) on CoWoS-L
  • First AMD GPU with UALink interconnect for open-standard scale-out networking alongside Infinity Fabric
  • MI440X is the enterprise 8-GPU variant of the MI400 family - per-chip performance numbers have not been separately disclosed from the MI455X flagship
  • Expected H2 2026 availability in OAM form factor, compatible with existing UBB chassis

Overview

The AMD Instinct MI440X is the enterprise-focused member of AMD's MI400 accelerator family, which also includes the MI455X (the flagship, designed for the Helios rack-scale system) and the MI430X (an HPC-oriented variant with strong FP64 capabilities). All three share the same CDNA 5 silicon and HBM4 memory subsystem. The difference is the deployment target: the MI440X ships in the OAM form factor for standard 8-GPU server configurations, while the MI455X is optimized for AMD's proprietary Helios 72-GPU rack architecture.

AMD has not disclosed MI440X-specific compute performance numbers. What we know comes from the MI455X, which AMD confirmed at CES 2026 as delivering 40 PFLOPS FP4, 20 PFLOPS FP8, and 10 PFLOPS FP16 per GPU. AMD describes the MI440X as a "scaled-back MI455X," but hasn't clarified whether "scaled back" means lower clock speeds, disabled compute units, reduced interconnect bandwidth, or simply a different thermal/power profile for the OAM form factor. Until AMD publishes MI440X-specific benchmarks, the MI455X numbers serve as an upper bound.

What's confirmed is the memory subsystem - and it's sizable. The MI440X packs 432GB of HBM4 across 12 stacks at 8 Gbps per pin, delivering 19.6 TB/s of aggregate memory bandwidth. That's 1.5x the capacity and roughly 0.89x the bandwidth of NVIDIA's Rubin R200 (288GB at 22 TB/s). The capacity lead is AMD's competitive wedge here: at FP8 precision, a single MI440X can hold a 405B-parameter model (~405GB) with room for KV cache. The R200 can't fit that model at FP8 on a single chip. This is the same strategic playbook AMD ran with the MI300X and MI350X - win on memory, compete on compute.

Key Specifications

SpecificationDetails
ManufacturerAMD
Product FamilyInstinct MI400
ArchitectureCDNA 5
Process NodeTSMC 2nm (N2) compute chiplets + TSMC 3nm (N3P) I/O dies
Transistors~320 billion (package total; confirmed for MI455X)
Chiplet Configuration12 dies total - 8 XCDs (compute) + active interposer/IO dies
GPU Memory432 GB HBM4 (12 stacks x 36GB)
HBM4 Per-Pin Rate8 Gbps
Memory Bandwidth19,600 GB/s (19.6 TB/s)
FP4 PerformanceUp to 40 PFLOPS (MI455X; MI440X TBD)
FP8 PerformanceUp to 20 PFLOPS (MI455X; MI440X TBD)
FP16 / BF16 PerformanceUp to 10 PFLOPS (MI455X; MI440X TBD)
Supported PrecisionsFP4, FP6, FP8, BF16, FP16, FP32
Chip-to-Chip Bandwidth3.6 TB/s
Scale-Out Bandwidth300 GB/s per GPU
InterconnectInfinity Fabric + UALink
PackagingCoWoS-L
Form FactorOAM (fits existing UBB boxes)
TDPNot officially disclosed
Target WorkloadTraining and Inference
Release DateH2 2026
PricingNot disclosed

Note: Performance figures in this table reflect the MI455X flagship. AMD has described the MI440X as a "scaled-back" variant but hasn't published separate per-chip performance numbers. Memory capacity, bandwidth, and architectural features are shared across the MI400 family and are confirmed for the MI440X. This page will be updated when AMD discloses MI440X-specific compute benchmarks.

Performance Benchmarks (Estimated)

MetricMI300XMI350XMI440X (est.)NVIDIA B200NVIDIA Rubin R200
FP8 Peak (TFLOPS)2,610~3,600Up to 20,000*9,000 (dense)17,500
FP4 Peak (TFLOPS)N/A~7,200Up to 40,000*18,000 (sparse)50,000 (sparse)
Memory Capacity192GB288GB432GB192GB288GB
Memory Bandwidth5,300 GB/s~6,000 GB/s19,600 GB/s8,000 GB/s22,000 GB/s
Process NodeTSMC 5nm/6nmTSMC 3nmTSMC 2nm/3nmTSMC 4NPTSMC N3P
Memory TypeHBM3HBM3eHBM4HBM3eHBM4

*MI455X confirmed numbers. MI440X is described as "scaled back" - actual per-chip numbers may be lower.

The memory capacity story is clear. At 432GB, the MI440X offers 2.25x the capacity of the B200 and 1.5x the capacity of the Rubin R200. For inference workloads where model size determines how many GPUs you need, this is a direct cost multiplier. A Llama 405B model at FP8 precision requires ~405GB of weight storage - that fits on a single MI440X but requires two R200s. At FP4, even 800B+ parameter models become single-GPU deployable on the MI440X.

The bandwidth comparison is more nuanced. At 19.6 TB/s, the MI440X trails the R200's 22 TB/s by about 11%. For memory-bandwidth-bound workloads like autoregressive LLM decoding, this gap matters. The MI440X partially compensates with its larger memory pool - more capacity means larger batch sizes and more KV cache headroom, which improves throughput by amortizing per-token bandwidth costs across more concurrent requests.

Key Capabilities

CDNA 5 Architecture on TSMC 2nm. The MI440X is AMD's first data center GPU on TSMC's N2 process node for compute chiplets, paired with N3P I/O dies. The 2nm node delivers meaningful transistor density and power efficiency improvements over the 3nm chiplets in the MI350X. AMD packs about 320 billion transistors across the 12-die package - a significant jump from the MI350X's estimated transistor count and competitive with the Rubin R200's 336 billion. The chiplet approach continues to give AMD a manufacturing flexibility advantage: smaller dies yield better than large monolithic designs, and mixing process nodes lets AMD optimize cost versus performance for each die type.

432GB HBM4 at 19.6 TB/s. The MI440X is one of the first GPUs to ship with HBM4, the next-generation high-bandwidth memory standard. Each of the 12 stacks provides 36GB at 8 Gbps per pin, for an aggregate 432GB and 19.6 TB/s. This is a generational leap from the MI350X's HBM3e (288GB at ~6,000 GB/s) - roughly 1.5x the capacity and 3.3x the bandwidth. HBM4's wider interface and higher per-pin data rates enable this jump without proportionally increasing power consumption. The 432GB capacity is, now, the highest of any single AI accelerator announced for the 2026 generation.

UALink - First AMD GPU with Open Interconnect. The MI440X is AMD's first GPU to support UALink alongside Infinity Fabric. UALink is the open-standard interconnect backed by AMD, Intel, Google, Microsoft, Meta, and others as an alternative to NVIDIA's proprietary NVLink. For the MI440X, this means 300 GB/s per GPU of scale-out bandwidth over an industry-standard interface. The chip-to-chip bandwidth (within a node) remains 3.6 TB/s via Infinity Fabric. UALink matters most for organizations building multi-vendor or open-architecture clusters - it removes the lock-in that NVLink creates in NVIDIA deployments.

12-Die Chiplet Design on CoWoS-L. The MI440X uses TSMC's CoWoS-L (chip-on-wafer-on-substrate with local silicon interconnect) advanced packaging to integrate 8 compute chiplets (XCDs) and 4 I/O/interposer dies into a single package. This is a more complex integration than the MI300X's 12-die CoWoS-S design, with CoWoS-L providing higher inter-chiplet bandwidth and more flexible die placement. The active interposer layer enables all 8 XCDs to communicate with low latency, which is critical for maintaining the illusion of a single unified GPU across what is actually a distributed compute fabric.

OAM Form Factor Compatibility. The MI440X ships in the OAM (OCP Accelerator Module) form factor and is designed to drop into existing Universal Baseboard (UBB) server chassis. This is a deliberate choice for enterprise adoption - organizations with existing 8-GPU MI300X or MI350X server infrastructure can upgrade to MI440X without replacing their server hardware. The MI455X, by contrast, targets AMD's Helios rack architecture, which requires a purpose-built 72-GPU system. The MI440X is the option for customers who want CDNA 5 silicon without committing to a full rack-scale redesign.

Pricing and Availability

AMD hasn't announced pricing for any MI400 series variant. Given the TSMC 2nm compute dies, HBM4 memory (which carries a first-generation cost premium), and the 12-die CoWoS-L packaging, the MI440X will almost certainly be more expensive than the MI350X. Industry analysts expect MI400 series pricing in the $25,000 to $45,000 range per accelerator, though the exact positioning of the MI440X within that range is unclear.

DetailInformation
GPU Price (individual)Not disclosed
Expected Range$25,000-$45,000 (analyst estimates for MI400 series)
Form FactorOAM
Server Configuration8-GPU UBB chassis
Target AvailabilityH2 2026
OEM PartnersExpected from major server OEMs (Dell, HPE, Lenovo, Supermicro)

Availability is targeted for H2 2026. AMD showed MI455X and Helios hardware at CES 2026 and confirmed at a February 2026 analyst event that the MI400 series and Helios racks remain on track for the second half of the year. Whether "H2 2026" means initial shipments in Q3 or volume availability in Q4 remains to be seen - AMD's track record on data center GPU launch timelines has been mixed.

Cloud provider availability for the MI440X specifically is uncertain. Hyperscalers like Microsoft Azure and Oracle Cloud, which have adopted previous Instinct generations, are likely candidates for early MI400 deployment. However, the MI455X in the Helios rack configuration may take priority for cloud deployments given its higher compute density per rack, potentially pushing standalone MI440X cloud instances to a later date.

Strengths and Weaknesses

Strengths

  • 432GB HBM4 memory - the highest capacity of any single GPU in the 2026 generation, enabling single-chip deployment of 405B+ parameter models at FP8
  • 19.6 TB/s memory bandwidth - a 3.3x generational leap over the MI350X, competitive with NVIDIA Rubin
  • TSMC 2nm compute chiplets deliver a full node advantage over competitors still on 3nm-class processes
  • UALink support provides open-standard scale-out interconnect - a first for AMD GPUs and a meaningful alternative to NVLink lock-in
  • OAM form factor fits existing UBB infrastructure, enabling upgrade-in-place for current MI300X/MI350X server deployments
  • 12-die chiplet design on CoWoS-L continues AMD's manufacturing cost and yield advantages
  • FP4, FP6, FP8 precision support covers the full range of modern quantization strategies

Weaknesses

  • MI440X-specific compute performance numbers haven't been disclosed - all TFLOPS figures are MI455X upper bounds
  • "Scaled-back MI455X" positioning is vague - the nature and magnitude of the performance reduction is unknown
  • ROCm software ecosystem continues to trail CUDA in maturity, developer tooling, and optimization depth
  • No official TDP disclosed - power consumption and cooling requirements remain unclear for infrastructure planning
  • No pricing information - cost-per-TFLOP and cost-per-GB comparisons are impossible until AMD publishes numbers
  • HBM4 is a first-generation memory technology with potential yield constraints and supply allocation risk
  • The MI455X in Helios may receive AMD's primary software optimization focus, with MI440X treated as secondary
  • Memory bandwidth (19.6 TB/s) trails the Rubin R200's 22 TB/s by 11%, which matters for bandwidth-bound inference

Sources

AMD Instinct MI440X - CDNA 5 Enterprise GPU
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.