TL;DR

First AMD Instinct PCIe accelerator card in nearly five years - fits standard air-cooled server slots without liquid cooling or custom rack infrastructure
144GB HBM3E at 4 TB/s - more memory than a H100 SXM (80GB) and similar bandwidth to the H200 (4.8 TB/s)
2.3 PFLOPS MXFP8 / 4.6 PFLOPS MXFP4 - AMD claims roughly 40% faster than the H200 NVL in theoretical FP8/FP16 compute
No Infinity Fabric between cards - each PCIe MI350P is a standalone unit, limiting multi-card setups to PCIe bandwidth

Overview

The AMD Instinct MI350P is AMD's first PCIe AI accelerator card since the Instinct MI100 era, and the only CDNA 4 product that drops directly into a standard dual-slot PCIe server slot. Every other MI350-series and MI400-series card requires OAM form factor, liquid cooling, or purpose-built rack infrastructure. The MI350P changes that for customers who need high-density AI inference without rebuilding their data center.

The card uses half the MI350X silicon - one I/O die paired with four compute dies - giving it 128 compute units, 512 matrix cores, and 144GB of HBM3E memory across a 4096-bit interface. That memory configuration is meaningful: a single MI350P can hold models that require multiple H100 PCIe cards, making it a genuine upgrade path for on-premises operators who built out H100 PCIe infrastructure. The AMD Instinct MI350X uses the full die with 288GB HBM3E for comparison.

AMD launched the MI350P in May 2026, targeting the large base of existing air-cooled servers in enterprise and mid-market data centers. The pitch is straightforward: you don't need to change your power distribution, cooling plant, or rack layout to get CDNA 4 inference performance. The card runs passively from the server chassis airflow within a 600W board power budget - and an optional 450W mode reduces thermal demand at some performance cost.

Key Specifications

Specification	Details
Manufacturer	AMD
Product Family	Instinct MI350
Architecture	CDNA 4
Process Node	TSMC 3nm + 6nm FinFET
Chip Type	GPU
Compute Units	128
Stream Processors	8,192
Matrix Cores	512
Engine Clock	2,200 MHz
MXFP4 Performance	4,600 TFLOPS (4.6 PFLOPS)
MXFP8 / OCP-FP8 Performance	2,299 TFLOPS (2.3 PFLOPS)
FP16 / BF16 Performance	1,150 TFLOPS (1.15 PFLOPS)
FP64 Performance	36 TFLOPS
Memory	144GB HBM3E
Memory Interface	4096-bit
Memory Bandwidth	4,000 GB/s (4 TB/s)
Form Factor	PCIe Gen5 x16, FHFL dual-slot, 10.5 inches
Cooling	Passive (fanless), air-cooled via server chassis
TDP	600W (450W mode available)
Power Connector	12V-2x6
Interconnect	PCIe Gen5 x16 only (no Infinity Fabric)
Max Cards per Server	8
Target Workload	Inference
Launch	May 2026
Pricing	Not announced

Performance Benchmarks

AMD hasn't released third-party benchmark data at launch. The numbers below come from AMD's own publications and the Phoronix review of the MI350P. Independent benchmarking from ServeTheHome and others confirms the memory and architecture specs but not inference throughput now.

Metric	MI350P	H100 PCIe	H200 SXM5	MI350X
MXFP8 / FP8 (TFLOPS)	2,299	1,513	1,979	~3,600 (est.)
MXFP4 / FP4 (TFLOPS)	4,600	N/A	N/A	~7,200 (est.)
FP16 / BF16 (TFLOPS)	1,150	756	989	~1,800 (est.)
Memory Capacity	144GB	80GB	141GB	288GB
Memory Bandwidth	4,000 GB/s	2,000 GB/s	4,800 GB/s	~6,000 GB/s (est.)
TDP	600W	350W	700W	750W
Form Factor	PCIe	PCIe	SXM5	OAM

The MI350P beats the H100 PCIe on every compute metric and carries nearly twice the memory. Against the H200 SXM5, the comparison gets more nuanced: MI350P wins on MXFP8 compute (2,299 vs 1,979 TFLOPS) but trails on memory bandwidth (4,000 vs 4,800 GB/s). For inference workloads that are more compute-bound than memory-bandwidth-bound - especially with quantized models at FP4 - the MI350P advantage grows.

The headline comparison AMD cited is a "roughly 40% faster" claim versus the H200 NVL in FP16/FP8 theoretical compute. This appears to compare to the H200 NVL PCIe card (a single-die NVIDIA PCIe product) rather than the NVL72 rack system, though AMD hasn't clarified the specific configuration used in the comparison.

AMD MI350P from alternate angle The MI350P's 10.5-inch passive heatsink design fits a standard dual-slot PCIe slot with no fans of its own. Source: servethehome.com

Key Capabilities

Drop-in PCIe Deployment

The MI350P's defining feature is what it doesn't require. OAM-based accelerators like the MI440X and NVIDIA H100 SXM require custom OCP-style server trays, specialized power delivery rails, and often liquid cooling loops. The MI350P plugs into any PCIe Gen5 x16 slot in a standard air-cooled 1U or 2U server, draws from a standard 12V-2x6 connector, and relies entirely on chassis airflow for cooling. That's the same installation process as adding a discrete GPU to a workstation.

For enterprise customers who built out standard server infrastructure for CPU-based workloads and now need AI inference capacity, the MI350P is a meaningful upgrade path that doesn't require a forklift refresh.

ROCm Software Ecosystem

The MI350P runs AMD's ROCm stack, with support from PyTorch, TensorFlow, JAX, and major inference frameworks including vLLM, TGI (Text Generation Inference), and Triton. ROCm support for CDNA 4 launched with the MI350X, so the MI350P inherits the same software maturity rather than waiting for new driver development.

The caveat remains the same as always: CUDA-optimized code doesn't automatically run well on ROCm. Models and inference stacks written for NVIDIA hardware require porting work, and optimized kernels for specific model architectures often lag behind NVIDIA equivalents by months.

Memory Capacity for Large Model Inference

AMD MI350P vs MI350X chip comparison The MI350P (left) uses one I/O die with four compute dies, compared to the full MI350X dual-IOD design. Source: servethehome.com

144GB in a single PCIe card is substantial. A 70-billion-parameter model in BF16 requires roughly 140GB - the MI350P can hold it natively without offloading. At FP8, that same model fits in about 70GB, leaving room for KV cache expansion. For RAG-heavy inference applications with large context windows, that memory headroom matters.

The lack of Infinity Fabric between MI350P cards means multi-card configurations don't share memory transparently - each card is independent from a software perspective. Large-model inference that doesn't fit on 144GB requires tensor-parallel splitting via PCIe Gen5 x16, which can sustain the bandwidth for reasonable model sizes but won't match OAM-based interconnect performance.

Pricing and Availability

AMD launched the MI350P in May 2026. No MSRP has been announced. Street pricing will emerge as server OEM partners - Dell, HP Enterprise, Supermicro, and others - integrate the card into their product lines.

For context: the NVIDIA H100 PCIe 80GB card retailed in the $15,000-$20,000 range at launch, though prices have declined significantly as supply increased. The H200 PCIe commands similar or slightly higher prices. The MI350P targets the same customer base with more memory and better compute-per-dollar potential, though AMD will need to price below NVIDIA's equivalents to overcome the CUDA ecosystem advantage.

Cloud rental pricing for CDNA 4 isn't established yet. AMD's partnership with Meta and Hyperscalers signals interest in large-scale deployment, but public cloud instances running MI350P aren't announced at this time.

Strengths

Fits standard PCIe slots in air-cooled servers - no infrastructure changes required
144GB HBM3E passes any single H100/H200 PCIe card by a wide margin
2.3 PFLOPS MXFP8 beats H100 PCIe by ~52% in theoretical compute
Full CDNA 4 feature set: MXFP4, MXFP8, mixed-precision matrix math
ROCm support at launch, no wait for driver maturity

Weaknesses

No Infinity Fabric - multi-card setups limited to PCIe bandwidth for model parallelism
600W in a PCIe card requires servers with solid power and airflow; 450W mode reduces performance
No MSRP announced - pricing uncertainty for procurement planning
CUDA ecosystem advantage means significant software porting work for most production workloads
Memory bandwidth (4 TB/s) trails H200 SXM5 (4.8 TB/s) for bandwidth-bound workloads

AMD Instinct MI350X - the full OAM-form CDNA 4 flagship with 288GB HBM3E
AMD Instinct MI440X - AMD's next-gen Instinct with CDNA 5 architecture
NVIDIA H100 - the benchmark baseline for enterprise AI inference
NVIDIA H200 - the direct memory-bandwidth competitor

AMD Instinct MI350P - CDNA 4 PCIe Inference Card

Overview

Key Specifications

Performance Benchmarks

Key Capabilities

Drop-in PCIe Deployment

ROCm Software Ecosystem

Memory Capacity for Large Model Inference

Pricing and Availability

Strengths

Weaknesses

Sources

Overview

Key Specifications

Performance Benchmarks

Key Capabilities

Drop-in PCIe Deployment

ROCm Software Ecosystem

Memory Capacity for Large Model Inference

Pricing and Availability

Strengths

Weaknesses

Related Coverage

Sources