AMD Instinct MI325X - 256GB CDNA3 for Inference

AMD Instinct MI325X specs, benchmarks, and analysis. 256GB HBM3e at 6 TB/s, 2.6 PFLOPS FP8, CDNA3 architecture - the memory-capacity upgrade to the MI300X targeting large model inference.

AMD Instinct MI325X - 256GB CDNA3 for Inference

Overview

The AMD Instinct MI325X is a CDNA3-architecture AI accelerator that launched in October 2024, shipping in volume through Q1 2025 across Dell, HPE, Supermicro, and Lenovo platforms. It's best understood as an evolution of the MI300X rather than a new generation: same CDNA3 chiplet architecture, same compute units, meaningfully upgraded memory - 256GB of HBM3e versus the MI300X's 192GB, with memory bandwidth rising from 5.3 TB/s to 6 TB/s.

TL;DR

  • CDNA3 GPU with 256GB HBM3e (33% more than MI300X) and 6 TB/s bandwidth (13% gain)
  • 2.6 PFLOPS FP8 on the same 304 compute units as MI300X - compute unchanged from the predecessor
  • Comes within 3-7% of H200 on MLPerf v5.0 Llama2 70B, pulls ahead at high batch sizes
  • Cloud rental at $2.00-$2.25/hr; now overshadowed by MI350X on the roadmap

For operators running 70B+ parameter models or Mixture-of-Experts architectures with large KV caches, the 256GB capacity matters: it lets a single card hold substantially larger context windows or heavier model slices without swapping to host memory. AMD positioned the MI325X directly against the NVIDIA H200, which carries 141GB of HBM3e, in the inference market.

The chip fits into the same OAM (Open Accelerator Module) form factor as the MI300X, making it a drop-in upgrade for existing MI300X platforms when paired with compatible system firmware.

Key Specifications

SpecificationDetails
ManufacturerAMD
Product FamilyInstinct MI300 Series
ArchitectureCDNA3
Process NodeTSMC 5nm (compute) + 6nm (IO)
Memory256GB HBM3e
Memory Bandwidth6,000 GB/s (6 TB/s)
FP8 Performance2,614.9 TFLOPS
FP8 with Sparsity5,229.8 TFLOPS
FP16 / BF161,307.4 TFLOPS
TF32653.7 TFLOPS
FP32163.4 TFLOPS
TDP1,000W
Peak Engine Clock2,100 MHz
Compute Units304
Stream Processors19,456
Matrix Cores1,216
Form FactorOAM (Open Accelerator Module)
Release DateOctober 2024

Performance Benchmarks

AMD submitted MI325X to MLPerf Inference v5.0 (April 2025), its first formal MLPerf participation for this chip. Results on Llama2 70B show an eight-GPU MI325X system running within 3-7% of a comparably configured H200 system. Image generation tasks also landed within 10% of H200.

BenchmarkMI325X (8x GPU)H200 (8x GPU)MI300X (8x GPU)
MLPerf v5.0 Llama2 70B (offline)~97% of H200Baseline~82% of H200 (est.)
MLPerf v5.0 image gen~90% of H200Baseline~78% of H200 (est.)
Memory Capacity256GB141GB192GB
Memory Bandwidth6 TB/s4.8 TB/s5.3 TB/s
FP8 TFLOPS2,6151,9792,610

On memory bandwidth, the MI325X actually leads the H200 at 6 TB/s versus 4.8 TB/s. AMD claims 40% higher throughput and 20-40% lower latency compared to H200 on Mixtral and Llama 3.1 in its own benchmarking. The MLPerf results don't reproduce that gap at standard batch sizes, but AMD's advantage does appear at higher concurrency and larger batch configurations where memory bandwidth becomes the binding constraint.

The compute delta between MI325X and MI300X is essentially zero at the same workload - both deliver ~2,615 TFLOPS FP8. The upgrade is entirely in memory capacity and bandwidth.

AMD Instinct MI325X OAM module with HBM3e memory stacks visible AMD Instinct MI325X in OAM form factor, with eight HBM3e stacks providing 256GB at 6 TB/s. Source: servethehome.com

Key Capabilities

Memory Capacity as Differentiation

At 256GB per card, the MI325X holds more memory than any competitor shipping at volume in 2024-2025. The NVIDIA H200 offers 141GB, and the B200 tops out at 192GB. For inference workloads running Mixtral 8x22B, Llama 3.1 405B at reduced precision, or retrieval-augmented systems with large KV caches, fitting more model into fewer cards directly reduces inter-card communication overhead.

The tradeoff is power: 1,000W TDP places the MI325X in the same league as the NVIDIA B200 (1,000W) but well above the H200 (700W). Rack-level power budget implications are real when multiplying across 8-GPU nodes.

CDNA3 Architecture and ROCm

The MI325X uses the same CDNA3 chiplet design as the MI300X: multiple compute dies on a common substrate, unified memory across all chiplets, and a flat memory address space visible to the entire chip. The CDNA3 architecture introduced native FP8 matrix operations for the first time in AMD's Instinct line, which is why the FP8 and FP8-with-sparsity numbers matter - they reflect the same hardware tensor cores that run on NVIDIA Hopper chips.

Software support runs through ROCm, AMD's open-source GPU compute platform. The MI325X works with PyTorch, JAX, and vLLM through the HIP runtime. Software compatibility has improved substantially since the MI300X shipped, with ROCm 6.x adding better support for attention kernels and flash attention variants. It's still behind CUDA's ecosystem depth, but the gap has narrowed.

Comparison to MI300X

The MI325X is not a separate generation. It uses the same chiplet configuration as the MI300X with a higher-density HBM3e stack upgrade. This means:

  • Existing MI300X system designs can accommodate MI325X with firmware updates
  • No new compiler or kernel work required for MI325X vs MI300X
  • The performance per watt actually worsens slightly (1,000W vs the MI300X's 750W for roughly the same compute)

The memory upgrade is the sole reason to choose MI325X over MI300X.

Pricing and Availability

Cloud providers including Vultr list MI325X instances, with per-GPU rental running approximately $2.00-$2.25/hr as of early 2026. That places it at roughly the same price as the MI300X, which had dropped from earlier premium pricing. The MI350X (CDNA4, 288GB HBM3e) is now entering market, and the MI325X's cloud availability is patchy as providers begin transitioning their GPU fleets.

No major hyperscaler - AWS, Google Cloud, or Azure - has announced dedicated MI325X VM SKUs. AMD's datacenter GPU presence in large cloud environments remains limited to on-premises OEM deployments through system integrators. Organizations rolling out at scale typically buy through HPE, Dell, Lenovo, Supermicro, or Gigabyte platforms.

List pricing for a 8-GPU system hasn't been formally published by AMD. ServeTheHome noted that the MI325X OAM module fits into existing MI300X-compatible baseplates.

Strengths and Weaknesses

Strengths

  • 256GB HBM3e - more capacity than any competitor at the same generation point
  • 6 TB/s bandwidth outpaces NVIDIA H200's 4.8 TB/s, benefiting memory-bound inference
  • Drop-in upgrade path for MI300X-compatible OAM platforms
  • Within 3-7% of H200 on MLPerf at standard batch sizes, ahead at high concurrency
  • ROCm software ecosystem continues maturing with better kernel coverage

Weaknesses

  • 1,000W TDP (vs H200's 700W) increases rack power costs proportionally
  • Compute unchanged vs MI300X - the upgrade is memory only
  • No hyperscaler cloud availability - limited to on-premises OEM deployments
  • Already being superseded by MI350X (CDNA4) which adds substantially more performance
  • CUDA software ecosystem depth still passes ROCm for custom kernel development

Sources

✓ Last verified April 1, 2026

AMD Instinct MI325X - 256GB CDNA3 for Inference
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.