AMD Instinct MI350P - CDNA 4 PCIe Inference Card

AMD Instinct MI350P brings CDNA 4 to a standard PCIe slot: 144GB HBM3E, 4 TB/s bandwidth, 2.3 PFLOPS MXFP8, and 600W passive cooling for air-cooled servers.

AMD Instinct MI350P - CDNA 4 PCIe Inference Card

TL;DR

  • First AMD Instinct PCIe accelerator card in nearly five years - fits standard air-cooled server slots without liquid cooling or custom rack infrastructure
  • 144GB HBM3E at 4 TB/s - more memory than a H100 SXM (80GB) and similar bandwidth to the H200 (4.8 TB/s)
  • 2.3 PFLOPS MXFP8 / 4.6 PFLOPS MXFP4 - AMD claims roughly 40% faster than the H200 NVL in theoretical FP8/FP16 compute
  • No Infinity Fabric between cards - each PCIe MI350P is a standalone unit, limiting multi-card setups to PCIe bandwidth

Overview

The AMD Instinct MI350P is AMD's first PCIe AI accelerator card since the Instinct MI100 era, and the only CDNA 4 product that drops directly into a standard dual-slot PCIe server slot. Every other MI350-series and MI400-series card requires OAM form factor, liquid cooling, or purpose-built rack infrastructure. The MI350P changes that for customers who need high-density AI inference without rebuilding their data center.

The card uses half the MI350X silicon - one I/O die paired with four compute dies - giving it 128 compute units, 512 matrix cores, and 144GB of HBM3E memory across a 4096-bit interface. That memory configuration is meaningful: a single MI350P can hold models that require multiple H100 PCIe cards, making it a genuine upgrade path for on-premises operators who built out H100 PCIe infrastructure. The AMD Instinct MI350X uses the full die with 288GB HBM3E for comparison.

AMD launched the MI350P in May 2026, targeting the large base of existing air-cooled servers in enterprise and mid-market data centers. The pitch is straightforward: you don't need to change your power distribution, cooling plant, or rack layout to get CDNA 4 inference performance. The card runs passively from the server chassis airflow within a 600W board power budget - and an optional 450W mode reduces thermal demand at some performance cost.

Key Specifications

SpecificationDetails
ManufacturerAMD
Product FamilyInstinct MI350
ArchitectureCDNA 4
Process NodeTSMC 3nm + 6nm FinFET
Chip TypeGPU
Compute Units128
Stream Processors8,192
Matrix Cores512
Engine Clock2,200 MHz
MXFP4 Performance4,600 TFLOPS (4.6 PFLOPS)
MXFP8 / OCP-FP8 Performance2,299 TFLOPS (2.3 PFLOPS)
FP16 / BF16 Performance1,150 TFLOPS (1.15 PFLOPS)
FP64 Performance36 TFLOPS
Memory144GB HBM3E
Memory Interface4096-bit
Memory Bandwidth4,000 GB/s (4 TB/s)
Form FactorPCIe Gen5 x16, FHFL dual-slot, 10.5 inches
CoolingPassive (fanless), air-cooled via server chassis
TDP600W (450W mode available)
Power Connector12V-2x6
InterconnectPCIe Gen5 x16 only (no Infinity Fabric)
Max Cards per Server8
Target WorkloadInference
LaunchMay 2026
PricingNot announced

Performance Benchmarks

AMD hasn't released third-party benchmark data at launch. The numbers below come from AMD's own publications and the Phoronix review of the MI350P. Independent benchmarking from ServeTheHome and others confirms the memory and architecture specs but not inference throughput now.

MetricMI350PH100 PCIeH200 SXM5MI350X
MXFP8 / FP8 (TFLOPS)2,2991,5131,979~3,600 (est.)
MXFP4 / FP4 (TFLOPS)4,600N/AN/A~7,200 (est.)
FP16 / BF16 (TFLOPS)1,150756989~1,800 (est.)
Memory Capacity144GB80GB141GB288GB
Memory Bandwidth4,000 GB/s2,000 GB/s4,800 GB/s~6,000 GB/s (est.)
TDP600W350W700W750W
Form FactorPCIePCIeSXM5OAM

The MI350P beats the H100 PCIe on every compute metric and carries nearly twice the memory. Against the H200 SXM5, the comparison gets more nuanced: MI350P wins on MXFP8 compute (2,299 vs 1,979 TFLOPS) but trails on memory bandwidth (4,000 vs 4,800 GB/s). For inference workloads that are more compute-bound than memory-bandwidth-bound - especially with quantized models at FP4 - the MI350P advantage grows.

The headline comparison AMD cited is a "roughly 40% faster" claim versus the H200 NVL in FP16/FP8 theoretical compute. This appears to compare to the H200 NVL PCIe card (a single-die NVIDIA PCIe product) rather than the NVL72 rack system, though AMD hasn't clarified the specific configuration used in the comparison.

AMD MI350P from alternate angle The MI350P's 10.5-inch passive heatsink design fits a standard dual-slot PCIe slot with no fans of its own. Source: servethehome.com

Key Capabilities

Drop-in PCIe Deployment

The MI350P's defining feature is what it doesn't require. OAM-based accelerators like the MI440X and NVIDIA H100 SXM require custom OCP-style server trays, specialized power delivery rails, and often liquid cooling loops. The MI350P plugs into any PCIe Gen5 x16 slot in a standard air-cooled 1U or 2U server, draws from a standard 12V-2x6 connector, and relies entirely on chassis airflow for cooling. That's the same installation process as adding a discrete GPU to a workstation.

For enterprise customers who built out standard server infrastructure for CPU-based workloads and now need AI inference capacity, the MI350P is a meaningful upgrade path that doesn't require a forklift refresh.

ROCm Software Ecosystem

The MI350P runs AMD's ROCm stack, with support from PyTorch, TensorFlow, JAX, and major inference frameworks including vLLM, TGI (Text Generation Inference), and Triton. ROCm support for CDNA 4 launched with the MI350X, so the MI350P inherits the same software maturity rather than waiting for new driver development.

The caveat remains the same as always: CUDA-optimized code doesn't automatically run well on ROCm. Models and inference stacks written for NVIDIA hardware require porting work, and optimized kernels for specific model architectures often lag behind NVIDIA equivalents by months.

Memory Capacity for Large Model Inference

AMD MI350P vs MI350X chip comparison The MI350P (left) uses one I/O die with four compute dies, compared to the full MI350X dual-IOD design. Source: servethehome.com

144GB in a single PCIe card is substantial. A 70-billion-parameter model in BF16 requires roughly 140GB - the MI350P can hold it natively without offloading. At FP8, that same model fits in about 70GB, leaving room for KV cache expansion. For RAG-heavy inference applications with large context windows, that memory headroom matters.

The lack of Infinity Fabric between MI350P cards means multi-card configurations don't share memory transparently - each card is independent from a software perspective. Large-model inference that doesn't fit on 144GB requires tensor-parallel splitting via PCIe Gen5 x16, which can sustain the bandwidth for reasonable model sizes but won't match OAM-based interconnect performance.

Pricing and Availability

AMD launched the MI350P in May 2026. No MSRP has been announced. Street pricing will emerge as server OEM partners - Dell, HP Enterprise, Supermicro, and others - integrate the card into their product lines.

For context: the NVIDIA H100 PCIe 80GB card retailed in the $15,000-$20,000 range at launch, though prices have declined significantly as supply increased. The H200 PCIe commands similar or slightly higher prices. The MI350P targets the same customer base with more memory and better compute-per-dollar potential, though AMD will need to price below NVIDIA's equivalents to overcome the CUDA ecosystem advantage.

Cloud rental pricing for CDNA 4 isn't established yet. AMD's partnership with Meta and Hyperscalers signals interest in large-scale deployment, but public cloud instances running MI350P aren't announced at this time.

Strengths

  • Fits standard PCIe slots in air-cooled servers - no infrastructure changes required
  • 144GB HBM3E passes any single H100/H200 PCIe card by a wide margin
  • 2.3 PFLOPS MXFP8 beats H100 PCIe by ~52% in theoretical compute
  • Full CDNA 4 feature set: MXFP4, MXFP8, mixed-precision matrix math
  • ROCm support at launch, no wait for driver maturity

Weaknesses

  • No Infinity Fabric - multi-card setups limited to PCIe bandwidth for model parallelism
  • 600W in a PCIe card requires servers with solid power and airflow; 450W mode reduces performance
  • No MSRP announced - pricing uncertainty for procurement planning
  • CUDA ecosystem advantage means significant software porting work for most production workloads
  • Memory bandwidth (4 TB/s) trails H200 SXM5 (4.8 TB/s) for bandwidth-bound workloads

Sources

✓ Last verified June 1, 2026

James Kowalski
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.