Hardware

NVIDIA Rubin CPX - Inference GPU With GDDR7

Full specs, benchmarks, and analysis of the NVIDIA Rubin CPX - a purpose-built inference GPU with 128GB GDDR7, 30 PFLOPS NVFP4, and 3x faster attention versus Blackwell, targeting million-token context workloads.

NVIDIA Rubin CPX - Inference GPU With GDDR7

TL;DR

  • Purpose-built inference GPU with 128GB GDDR7 memory - a deliberate trade of HBM bandwidth for massive capacity at lower cost per byte
  • 30 PFLOPS NVFP4 sparse compute and around 20 PFLOPS dense from 192 streaming multiprocessors on TSMC's N3P process
  • 3x faster attention versus GB300 NVL72 through dedicated attention hardware - targeting million-token context windows
  • PCIe Gen 6 and no NVLink - designed for disaggregated inference racks, not tightly-coupled training clusters
  • Ships end of 2026 as part of the Vera Rubin NVL144 CPX platform - 144 CPX GPUs plus 144 Rubin GPUs and 36 Vera CPUs in a single rack

Overview

The NVIDIA Rubin CPX is something new from NVIDIA - a GPU designed exclusively for inference, and specifically for the prefill and context-processing phase of large language model serving. Announced at the AI Infrastructure Summit in September 2025, the CPX breaks several NVIDIA conventions simultaneously. It uses GDDR7 instead of HBM. It connects over PCIe Gen 6 without NVLink. And it ships as part of a heterogeneous rack system with standard Rubin GPUs rather than as a standalone product.

The reasoning behind the architecture is straightforward. The prefill phase of LLM inference - where the model processes the full input context before generating tokens - is compute-bound, not memory-bandwidth-bound. A million-token prompt needs enormous compute to process all the attention layers across the full context, but the memory access pattern is sequential and predictable. HBM's extreme bandwidth (8+ TB/s on a GB300 NVL72) is overkill for this access pattern. GDDR7, at roughly 2 TB/s and a fraction of the cost per gigabyte, provides enough bandwidth while enabling 128GB of capacity per GPU at a price point that HBM can't match.

The result is a chip that does one thing exceptionally well: it chews through the compute-intensive context-processing phase of inference, then hands off to the standard Rubin GPUs in the same rack for the memory-bandwidth-sensitive decode phase. This prefill/decode disaggregation is the architectural insight behind the entire Vera Rubin NVL144 CPX platform.

NVIDIA positioned the CPX against the growing trend of million-token context windows in production models. As context lengths scale from 128K to 1M tokens and beyond, the prefill compute cost grows quadratically with standard attention implementations and linearly with flash attention. The CPX's 3x attention performance improvement over Blackwell and its 128GB of model-weight capacity are specifically designed to make million-token prefill economically viable at scale.

Key Specifications

SpecificationDetails
ManufacturerNVIDIA
ArchitectureRubin CPX
Process NodeTSMC N3P (3nm-class)
Die DesignMonolithic
Streaming Multiprocessors192
NVFP4 Compute (sparse)30 PFLOPS
NVFP4 Compute (dense)~20 PFLOPS (estimated)
FP8 Compute~15 PFLOPS (estimated)
Attention Performance3x vs GB300 NVL72
Memory128 GB GDDR7
Memory Bandwidth~2,000 GB/s (~2 TB/s)
Memory InterfaceWide GDDR7 bus
Host InterfacePCIe Gen 6
NVLinkNone
TDP~800-880W
CoolingLiquid cooling
Rack Configuration144 CPX GPUs per Vera Rubin NVL144 CPX
Release DateEnd of 2026

The monolithic die design is remarkable. While the B200 and B300 use dual-die designs with two chiplets connected via a high-bandwidth bridge, the Rubin CPX is a single die manufactured on TSMC's N3P process. This means simpler manufacturing (no multi-chip module assembly), lower latency between SM clusters, and a potentially higher yield per wafer due to the single-die design - though the exact die size has not been disclosed.

The 192 SMs represent an increase over the B300's 160 SMs, enabled by the N3P process shrink from 4NP. More SMs means more parallel compute, which directly benefits the attention computation that dominates prefill workloads.

Performance Benchmarks

MetricGB300 NVL72 (per GPU)Rubin CPX (per GPU)Comparison
NVFP4 Compute (sparse)~21 PFLOPS30 PFLOPSCPX +43%
Attention Performance1x (baseline)3xCPX 3x faster
Memory Capacity288 GB HBM3e128 GB GDDR7GB300 2.25x
Memory Bandwidth8,000 GB/s~2,000 GB/sGB300 4x
Memory TypeHBM3eGDDR7Different class
Host InterfaceNVLink-C2C (to Grace)PCIe Gen 6Different
Inter-GPU FabricNVLink 5 (1,800 GB/s)NoneGB300 only
TDP1,400W~800-880WCPX ~40% lower
Process NodeTSMC 4NPTSMC N3PCPX 1 node ahead

Rack-Scale Comparison

MetricGB300 NVL72Vera Rubin NVL144 CPX (CPX GPUs only)NVL144 CPX (full rack)
GPUs72 (B300)144 (Rubin CPX)144 CPX + 144 Rubin
NVFP4 Compute~1,080 PFLOPS~4,320 PFLOPS8 EXAFLOPS
Total Memory20.7 TB~18.4 TB (CPX)~100 TB
Memory Bandwidth576 TB/s~288 TB/s~1,700 TB/s
Power~120 kW~127 kW (CPX)~370 kW

The Vera Rubin NVL144 CPX rack is the full system that puts the CPX in context. It contains 144 Rubin CPX GPUs for prefill processing, 144 standard Rubin GPUs (with HBM4 for high-bandwidth decode), and 36 Vera ARM CPUs. The combined 8 EXAFLOPS of NVFP4 compute and 100TB of total memory represent a generational leap over the GB300 NVL72, though direct comparison is complicated by the heterogeneous architecture.

Attention Performance Analysis

The 3x attention improvement over Blackwell is the CPX's defining feature. NVIDIA hasn't disclosed the exact hardware mechanism, but the improvement likely combines:

  1. Dedicated attention hardware blocks beyond the general-purpose Tensor Cores - potentially hardwired attention patterns similar to how the Groq LPU hardwires the entire transformer computation
  2. Larger on-chip SRAM / Tensor Memory to hold attention intermediates (QK^T products, softmax results) without round-tripping to GDDR7
  3. Architectural improvements in the SM from the N3P process node, providing more transistors per SM for dedicated attention logic

For a million-token prefill on a 70B model, the attention compute is roughly 10x the linear projection compute. A 3x improvement in attention throughput translates to approximately 2.5x faster end-to-end prefill for long-context workloads - a massive improvement that directly reduces time-to-first-token for users waiting on long document processing.

Key Capabilities

GDDR7 Instead of HBM. The CPX is the first NVIDIA datacenter GPU to use GDDR7 since the Tesla line. The choice is economic rather than technical. GDDR7 provides around 2 TB/s of bandwidth from 128GB of capacity at a fraction of the cost of an equivalent HBM3e configuration. For prefill workloads where the memory access pattern is dominated by sequential weight reads (not the random-access patterns that benefit from HBM's 8+ TB/s), GDDR7 bandwidth is sufficient.

The cost advantage is significant. HBM3e at 288GB (as in the GB300) adds an estimated $5,000-8,000 to the GPU's bill of materials. GDDR7 at 128GB likely costs $500-1,000 for the memory modules. This 5-10x cost reduction in memory directly translates to a lower per-GPU price, enabling NVIDIA to pack 144 CPX GPUs into a rack at a price point that would be impossible with HBM.

The trade-off is real: at ~2 TB/s versus ~8 TB/s, the CPX can't serve decode (autoregressive token generation) workloads efficiently. Each generated token requires reading the full KV-cache from memory, and the bandwidth-to-compute ratio of GDDR7 is too low for this access pattern. This is why the CPX is paired with standard Rubin GPUs (which have HBM4) for decode.

Disaggregated Inference Architecture. The Vera Rubin NVL144 CPX platform implements what NVIDIA calls "disaggregated inference" - splitting the prefill and decode phases across specialized hardware within a single rack. The CPX GPUs handle prefill (compute-heavy, bandwidth-light), and the Rubin GPUs handle decode (bandwidth-heavy, compute-light).

This disaggregation addresses a fundamental inefficiency in current LLM serving. On a standard GPU like the H100 or B200, the same hardware serves both prefill and decode, meaning it's alternately compute-underutilized (during decode) and bandwidth-underutilized (during prefill). By splitting these phases across purpose-built hardware, the NVL144 CPX platform achieves higher overall use and better cost efficiency.

PCIe Gen 6, No NVLink. The absence of NVLink is a deliberate design choice. Prefill workloads on a single model can be parallelized across CPX GPUs via PCIe Gen 6 at ~256 GB/s per GPU, which is sufficient for distributing prefill context chunks across GPUs. The tightly-coupled NVLink fabric (at 1,800 GB/s per GPU in the GB300 NVL72) is unnecessary because prefill doesn't require the constant all-reduce communication patterns that training workloads demand.

Removing NVLink also removes the NVLink switches, reduces the die area dedicated to interconnect, and simplifies the rack-level wiring - all of which contribute to fitting 144 GPUs into a single rack at manageable cost and power.

Pricing and Availability

NVIDIA hasn't disclosed per-GPU or per-rack pricing for the Vera Rubin NVL144 CPX platform. Based on the GDDR7 memory cost advantage and the overall system configuration, the per-CPX-GPU cost is expected to be notably lower than Rubin GPUs with HBM4. The full NVL144 CPX rack, containing 288 total GPUs (144 CPX + 144 Rubin) plus 36 Vera CPUs, will likely be priced in the $5-10 million range based on the component count and system complexity.

ConfigurationEstimated Range
Vera Rubin NVL144 CPX (full rack)$5,000,000 - $10,000,000 (estimated)
Per Rubin CPX GPU (estimated)Notably less than HBM-equipped Rubin
Power per rack~370 kW
CoolingLiquid cooling (required)

Availability is expected at end of 2026. This positions the Rubin CPX as a late-2026 product, roughly coinciding with the broader Vera Rubin platform launch.

Who Should Wait for Rubin CPX

ScenarioRecommendation
Serving models with 100K+ token contextsStrong candidate - 3x attention improvement directly reduces prefill latency
High-throughput inference API providerStrong candidate - disaggregated architecture improves GPU use
Training workloadsNot relevant - CPX has no NVLink and isn't designed for training
Small-scale inference (single GPU)Not relevant - CPX only ships as part of NVL144 rack system
Need compute before end of 2026Deploy GB300 NVL72 or existing Blackwell now

Strengths

  • 3x attention performance over GB300 NVL72 directly addresses the prefill bottleneck for million-token context workloads
  • 128GB GDDR7 provides large model-weight capacity at a fraction of HBM cost per byte
  • 30 PFLOPS NVFP4 sparse compute from 192 SMs on TSMC N3P - the most compute-dense NVIDIA inference GPU announced
  • Monolithic die design avoids multi-chip-module complexity and inter-chiplet latency
  • ~800-880W TDP is 40% lower than the B300's 1,400W while delivering more NVFP4 compute
  • Disaggregated prefill/decode architecture in the NVL144 CPX rack improves overall GPU use for inference
  • PCIe Gen 6 simplifies system design and removes NVLink switch cost

Weaknesses

  • ~2 TB/s GDDR7 bandwidth is 4x lower than HBM3e - the CPX can't efficiently serve decode workloads alone
  • No NVLink means no tightly-coupled multi-GPU training capability - this is an inference-only product
  • Only available as part of the Vera Rubin NVL144 CPX rack system - no standalone SKU for smaller deployments
  • End of 2026 availability means 12+ months of waiting from the announcement date
  • ~370 kW per NVL144 CPX rack requires sizable datacenter power and liquid cooling infrastructure
  • No independent benchmarks yet - all performance claims are NVIDIA-sourced
  • GDDR7 memory has higher latency than HBM, which may affect certain memory access patterns during prefill
  • Rack pricing (estimated $5-10M) limits the buyer pool to hyperscalers and large inference providers

Sources

NVIDIA Rubin CPX - Inference GPU With GDDR7
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.