TL;DR

Purpose-built inference GPU with 128GB GDDR7 memory - a deliberate trade of HBM bandwidth for massive capacity at lower cost per byte
30 PFLOPS NVFP4 sparse compute and around 20 PFLOPS dense from 192 streaming multiprocessors on TSMC's N3P process
3x faster attention versus GB300 NVL72 through dedicated attention hardware - targeting million-token context windows
PCIe Gen 6 and no NVLink - designed for disaggregated inference racks, not tightly-coupled training clusters
Ships end of 2026 as part of the Vera Rubin NVL144 CPX platform - 144 CPX GPUs plus 144 Rubin GPUs and 36 Vera CPUs in a single rack

Overview

The NVIDIA Rubin CPX is something new from NVIDIA - a GPU designed exclusively for inference, and specifically for the prefill and context-processing phase of large language model serving. Announced at the AI Infrastructure Summit in September 2025, the CPX breaks several NVIDIA conventions simultaneously. It uses GDDR7 instead of HBM. It connects over PCIe Gen 6 without NVLink. And it ships as part of a heterogeneous rack system with standard Rubin GPUs rather than as a standalone product.

The reasoning behind the architecture is straightforward. The prefill phase of LLM inference - where the model processes the full input context before generating tokens - is compute-bound, not memory-bandwidth-bound. A million-token prompt needs enormous compute to process all the attention layers across the full context, but the memory access pattern is sequential and predictable. HBM's extreme bandwidth (8+ TB/s on a GB300 NVL72) is overkill for this access pattern. GDDR7, at roughly 2 TB/s and a fraction of the cost per gigabyte, provides enough bandwidth while enabling 128GB of capacity per GPU at a price point that HBM can't match.

The result is a chip that does one thing exceptionally well: it chews through the compute-intensive context-processing phase of inference, then hands off to the standard Rubin GPUs in the same rack for the memory-bandwidth-sensitive decode phase. This prefill/decode disaggregation is the architectural insight behind the entire Vera Rubin NVL144 CPX platform.

NVIDIA positioned the CPX against the growing trend of million-token context windows in production models. As context lengths scale from 128K to 1M tokens and beyond, the prefill compute cost grows quadratically with standard attention implementations and linearly with flash attention. The CPX's 3x attention performance improvement over Blackwell and its 128GB of model-weight capacity are specifically designed to make million-token prefill economically viable at scale.

Key Specifications

Specification	Details
Manufacturer	NVIDIA
Architecture	Rubin CPX
Process Node	TSMC N3P (3nm-class)
Die Design	Monolithic
Streaming Multiprocessors	192
NVFP4 Compute (sparse)	30 PFLOPS
NVFP4 Compute (dense)	~20 PFLOPS (estimated)
FP8 Compute	~15 PFLOPS (estimated)
Attention Performance	3x vs GB300 NVL72
Memory	128 GB GDDR7
Memory Bandwidth	~2,000 GB/s (~2 TB/s)
Memory Interface	Wide GDDR7 bus
Host Interface	PCIe Gen 6
NVLink	None
TDP	~800-880W
Cooling	Liquid cooling
Rack Configuration	144 CPX GPUs per Vera Rubin NVL144 CPX
Release Date	End of 2026

The monolithic die design is remarkable. While the B200 and B300 use dual-die designs with two chiplets connected via a high-bandwidth bridge, the Rubin CPX is a single die manufactured on TSMC's N3P process. This means simpler manufacturing (no multi-chip module assembly), lower latency between SM clusters, and a potentially higher yield per wafer due to the single-die design - though the exact die size has not been disclosed.

The 192 SMs represent an increase over the B300's 160 SMs, enabled by the N3P process shrink from 4NP. More SMs means more parallel compute, which directly benefits the attention computation that dominates prefill workloads.

Performance Benchmarks

Metric	GB300 NVL72 (per GPU)	Rubin CPX (per GPU)	Comparison
NVFP4 Compute (sparse)	~21 PFLOPS	30 PFLOPS	CPX +43%
Attention Performance	1x (baseline)	3x	CPX 3x faster
Memory Capacity	288 GB HBM3e	128 GB GDDR7	GB300 2.25x
Memory Bandwidth	8,000 GB/s	~2,000 GB/s	GB300 4x
Memory Type	HBM3e	GDDR7	Different class
Host Interface	NVLink-C2C (to Grace)	PCIe Gen 6	Different
Inter-GPU Fabric	NVLink 5 (1,800 GB/s)	None	GB300 only
TDP	1,400W	~800-880W	CPX ~40% lower
Process Node	TSMC 4NP	TSMC N3P	CPX 1 node ahead

Rack-Scale Comparison

Metric	GB300 NVL72	Vera Rubin NVL144 CPX (CPX GPUs only)	NVL144 CPX (full rack)
GPUs	72 (B300)	144 (Rubin CPX)	144 CPX + 144 Rubin
NVFP4 Compute	~1,080 PFLOPS	~4,320 PFLOPS	8 EXAFLOPS
Total Memory	20.7 TB	~18.4 TB (CPX)	~100 TB
Memory Bandwidth	576 TB/s	~288 TB/s	~1,700 TB/s
Power	~120 kW	~127 kW (CPX)	~370 kW

The Vera Rubin NVL144 CPX rack is the full system that puts the CPX in context. It contains 144 Rubin CPX GPUs for prefill processing, 144 standard Rubin GPUs (with HBM4 for high-bandwidth decode), and 36 Vera ARM CPUs. The combined 8 EXAFLOPS of NVFP4 compute and 100TB of total memory represent a generational leap over the GB300 NVL72, though direct comparison is complicated by the heterogeneous architecture.

Attention Performance Analysis

The 3x attention improvement over Blackwell is the CPX's defining feature. NVIDIA hasn't disclosed the exact hardware mechanism, but the improvement likely combines:

Dedicated attention hardware blocks beyond the general-purpose Tensor Cores - potentially hardwired attention patterns similar to how the Groq LPU hardwires the entire transformer computation
Larger on-chip SRAM / Tensor Memory to hold attention intermediates (QK^T products, softmax results) without round-tripping to GDDR7
Architectural improvements in the SM from the N3P process node, providing more transistors per SM for dedicated attention logic

For a million-token prefill on a 70B model, the attention compute is roughly 10x the linear projection compute. A 3x improvement in attention throughput translates to approximately 2.5x faster end-to-end prefill for long-context workloads - a massive improvement that directly reduces time-to-first-token for users waiting on long document processing.

Key Capabilities

GDDR7 Instead of HBM. The CPX is the first NVIDIA datacenter GPU to use GDDR7 since the Tesla line. The choice is economic rather than technical. GDDR7 provides around 2 TB/s of bandwidth from 128GB of capacity at a fraction of the cost of an equivalent HBM3e configuration. For prefill workloads where the memory access pattern is dominated by sequential weight reads (not the random-access patterns that benefit from HBM's 8+ TB/s), GDDR7 bandwidth is sufficient.

The cost advantage is significant. HBM3e at 288GB (as in the GB300) adds an estimated $5,000-8,000 to the GPU's bill of materials. GDDR7 at 128GB likely costs $500-1,000 for the memory modules. This 5-10x cost reduction in memory directly translates to a lower per-GPU price, enabling NVIDIA to pack 144 CPX GPUs into a rack at a price point that would be impossible with HBM.

The trade-off is real: at ~2 TB/s versus ~8 TB/s, the CPX can't serve decode (autoregressive token generation) workloads efficiently. Each generated token requires reading the full KV-cache from memory, and the bandwidth-to-compute ratio of GDDR7 is too low for this access pattern. This is why the CPX is paired with standard Rubin GPUs (which have HBM4) for decode.

Disaggregated Inference Architecture. The Vera Rubin NVL144 CPX platform implements what NVIDIA calls "disaggregated inference" - splitting the prefill and decode phases across specialized hardware within a single rack. The CPX GPUs handle prefill (compute-heavy, bandwidth-light), and the Rubin GPUs handle decode (bandwidth-heavy, compute-light).

This disaggregation addresses a fundamental inefficiency in current LLM serving. On a standard GPU like the H100 or B200, the same hardware serves both prefill and decode, meaning it's alternately compute-underutilized (during decode) and bandwidth-underutilized (during prefill). By splitting these phases across purpose-built hardware, the NVL144 CPX platform achieves higher overall use and better cost efficiency.

PCIe Gen 6, No NVLink. The absence of NVLink is a deliberate design choice. Prefill workloads on a single model can be parallelized across CPX GPUs via PCIe Gen 6 at ~256 GB/s per GPU, which is sufficient for distributing prefill context chunks across GPUs. The tightly-coupled NVLink fabric (at 1,800 GB/s per GPU in the GB300 NVL72) is unnecessary because prefill doesn't require the constant all-reduce communication patterns that training workloads demand.

Removing NVLink also removes the NVLink switches, reduces the die area dedicated to interconnect, and simplifies the rack-level wiring - all of which contribute to fitting 144 GPUs into a single rack at manageable cost and power.

Pricing and Availability

NVIDIA hasn't disclosed per-GPU or per-rack pricing for the Vera Rubin NVL144 CPX platform. Based on the GDDR7 memory cost advantage and the overall system configuration, the per-CPX-GPU cost is expected to be notably lower than Rubin GPUs with HBM4. The full NVL144 CPX rack, containing 288 total GPUs (144 CPX + 144 Rubin) plus 36 Vera CPUs, will likely be priced in the $5-10 million range based on the component count and system complexity.

Configuration	Estimated Range
Vera Rubin NVL144 CPX (full rack)	$5,000,000 - $10,000,000 (estimated)
Per Rubin CPX GPU (estimated)	Notably less than HBM-equipped Rubin
Power per rack	~370 kW
Cooling	Liquid cooling (required)

Availability is expected at end of 2026. This positions the Rubin CPX as a late-2026 product, roughly coinciding with the broader Vera Rubin platform launch.

Who Should Wait for Rubin CPX

Scenario	Recommendation
Serving models with 100K+ token contexts	Strong candidate - 3x attention improvement directly reduces prefill latency
High-throughput inference API provider	Strong candidate - disaggregated architecture improves GPU use
Training workloads	Not relevant - CPX has no NVLink and isn't designed for training
Small-scale inference (single GPU)	Not relevant - CPX only ships as part of NVL144 rack system
Need compute before end of 2026	Deploy GB300 NVL72 or existing Blackwell now

Strengths

3x attention performance over GB300 NVL72 directly addresses the prefill bottleneck for million-token context workloads
128GB GDDR7 provides large model-weight capacity at a fraction of HBM cost per byte
30 PFLOPS NVFP4 sparse compute from 192 SMs on TSMC N3P - the most compute-dense NVIDIA inference GPU announced
Monolithic die design avoids multi-chip-module complexity and inter-chiplet latency
~800-880W TDP is 40% lower than the B300's 1,400W while delivering more NVFP4 compute
Disaggregated prefill/decode architecture in the NVL144 CPX rack improves overall GPU use for inference
PCIe Gen 6 simplifies system design and removes NVLink switch cost

Weaknesses

~2 TB/s GDDR7 bandwidth is 4x lower than HBM3e - the CPX can't efficiently serve decode workloads alone
No NVLink means no tightly-coupled multi-GPU training capability - this is an inference-only product
Only available as part of the Vera Rubin NVL144 CPX rack system - no standalone SKU for smaller deployments
End of 2026 availability means 12+ months of waiting from the announcement date
~370 kW per NVL144 CPX rack requires sizable datacenter power and liquid cooling infrastructure
No independent benchmarks yet - all performance claims are NVIDIA-sourced
GDDR7 memory has higher latency than HBM, which may affect certain memory access patterns during prefill
Rack pricing (estimated $5-10M) limits the buyer pool to hyperscalers and large inference providers

NVIDIA GB300 NVL72 - Blackwell Ultra Rack - The current-generation rack system that Rubin CPX aims to succeed for inference workloads
NVIDIA B200 - Blackwell Flagship GPU - Current-generation Blackwell GPU for comparison
NVIDIA Rubin R200 - The standard Rubin GPU with HBM4 that pairs with CPX in the NVL144 rack
Groq LPU - Deterministic Inference at Scale - Another inference-specialized ASIC taking a radically different architectural approach
Google TPU v7 Ironwood - Google's latest AI accelerator competing at the rack scale

NVIDIA Rubin CPX - Inference GPU With GDDR7

Overview

Key Specifications

Performance Benchmarks

Rack-Scale Comparison

Attention Performance Analysis

Key Capabilities

Pricing and Availability

Who Should Wait for Rubin CPX

Strengths

Weaknesses

Sources

Overview

Key Specifications

Performance Benchmarks

Rack-Scale Comparison

Attention Performance Analysis

Key Capabilities

Pricing and Availability

Who Should Wait for Rubin CPX

Strengths

Weaknesses

Related Coverage

Sources

Google Analytics