Hardware

NVIDIA GB200 NVL72 - Rack-Scale Blackwell

Complete specs, benchmarks, and analysis of the NVIDIA GB200 NVL72 - the 72-GPU rack-scale Blackwell system delivering 1,440 PFLOPS FP4 for trillion-parameter AI training and inference.

NVIDIA GB200 NVL72 - Rack-Scale Blackwell

TL;DR

  • 72 Blackwell GPUs and 36 Grace CPUs in a single liquid-cooled rack - 13.5TB of HBM3e memory operating as one unified compute domain
  • 1,440 PFLOPS (1.44 EXAFLOPS) of FP4 Tensor performance per rack - designed for trillion-parameter training and real-time inference
  • 130 TB/s aggregate NVLink bisection bandwidth via fifth-generation NVLink and NVLink Switch - all 72 GPUs can communicate as a single GPU
  • ~120kW power draw, 1.36 metric tons (3,000 lbs), fully liquid-cooled - this is datacenter infrastructure, not a server
  • Priced at $2-3 million per rack, with shipments beginning in early 2025 and demand far exceeding supply

Overview

The NVIDIA GB200 NVL72 is not a GPU. It is not even a server. It is a 72-GPU rack-scale computer that functions as a single, unified AI accelerator. Each rack contains 36 GB200 Superchips - where each Superchip combines one Grace CPU with two B200 GPUs - connected through a fifth-generation NVLink fabric that provides 130 TB/s of aggregate bisection bandwidth. All 72 GPUs share a unified 13.5TB HBM3e memory pool and can communicate with any other GPU in the rack at NVLink speed, not InfiniBand speed. The result is 1,440 PFLOPS of FP4 Tensor performance - 1.44 exaflops - in a single rack.

This is the system that NVIDIA designed for the trillion-parameter era. When you are training a model with trillions of parameters, the communication overhead between GPUs becomes the dominant bottleneck. Traditional GPU clusters use InfiniBand for inter-node communication, which operates at 400-800 Gb/s per port - fast by networking standards, but orders of magnitude slower than NVLink. The GB200 NVL72 eliminates this bottleneck for 72-GPU training runs by keeping all communication on NVLink. Every GPU can access every other GPU's memory at 1,800 GB/s, enabling all-reduce operations that would take milliseconds over InfiniBand to complete in microseconds.

The physical engineering is as impressive as the compute specs. The GB200 NVL72 is a fully liquid-cooled system weighing 1.36 metric tons (3,000 pounds) and consuming approximately 120kW of power. It requires dedicated liquid cooling infrastructure with high-capacity chilled water loops - this is not something you drop into an existing air-cooled datacenter. NVIDIA positions it as the building block for "AI factories" - purpose-built datacenters designed from the ground up for AI training at scale. Meta, Microsoft, Google, Oracle, and every major hyperscaler have placed orders for GB200 NVL72 racks.

Key Specifications

SpecificationDetails
ManufacturerNVIDIA
System TypeRack-scale AI supercomputer
ArchitectureBlackwell (GB200 Superchip)
Process NodeTSMC 4NP (GPU), TSMC 4N (Grace CPU)
GPUs per Rack72 (B200)
CPUs per Rack36 (Grace ARM)
Superchips per Rack36 (1 Grace + 2 B200 each)
GPU Transistors208 billion per GPU
CUDA Cores per GPU18,432
Tensor Cores per GPU576 (5th generation)
GPU Memory per GPU192 GB HBM3e
Total GPU Memory13,824 GB (13.5 TB)
Memory Bandwidth per GPU8,000 GB/s (8 TB/s)
Total Memory Bandwidth576 TB/s
FP4 Performance (per GPU)10 PFLOPS (dense)
FP4 Performance (per rack)720 PFLOPS dense, 1,440 PFLOPS sparse
FP8 Performance (per rack)360 PFLOPS dense, 720 PFLOPS sparse
NVLink per GPU5th generation, 1,800 GB/s
NVLink Domain72 GPUs (full rack)
NVLink Bisection Bandwidth130 TB/s
GPU-to-GPU FabricNVLink Switch System
CPU Memory per CPU480 GB LPDDR5X
Total CPU Memory17,280 GB (17.3 TB)
Network I/OConnectX-8 SuperNIC, 800 Gb/s per GPU
Total Rack Power~120 kW
Rack Weight1.36 metric tons (3,000 lbs)
Rack Height72U (full rack)
CoolingLiquid cooling (required)
Release DateH1 2025

The specifications deserve unpacking because the GB200 NVL72 is a fundamentally different product category from discrete GPUs. When NVIDIA says "72-GPU NVLink domain," they mean that every one of the 72 B200 GPUs can directly access every other GPU's memory at NVLink speeds (1,800 GB/s per GPU). In a traditional GPU cluster, GPUs in the same server node communicate via NVLink (fast), but GPUs in different nodes communicate via InfiniBand (10-20x slower). The GB200 NVL72 eliminates this two-tier hierarchy within the rack - all 72 GPUs are peers on the same high-bandwidth fabric.

The 36 Grace CPUs provide the host processing layer. Each Grace CPU features 72 ARM Neoverse V2 cores and 480 GB of LPDDR5X memory, totaling 17.3 TB of CPU memory across the rack. The Grace CPUs handle data preprocessing, model orchestration, tokenization, and other CPU-side tasks. The NVLink-C2C (chip-to-chip) interconnect between each Grace CPU and its paired B200 GPUs delivers 900 GB/s of coherent bandwidth - approximately 7x faster than PCIe Gen 5. This tight CPU-GPU coupling reduces the data staging latency that plagues traditional x86-based GPU servers.

Performance Benchmarks

MetricDGX A100 (8 GPUs)DGX H100 (8 GPUs)DGX B200 (8 GPUs)GB200 NVL72 (72 GPUs)
FP8 TFLOPS (total)N/A31,66472,000720,000
FP4 TFLOPS (total)N/AN/A144,0001,440,000
GPU Memory (total)640 GB640 GB1,536 GB13,824 GB
Memory Bandwidth (total)16.3 TB/s26.8 TB/s64 TB/s576 TB/s
Inter-GPU Bandwidth600 GB/s NVLink900 GB/s NVLink1,800 GB/s NVLink1,800 GB/s NVLink (72-GPU domain)
GPUs in NVLink Domain88872
Power~6.5 kW~10.2 kW~14.3 kW~120 kW
Approximate Price$200K+$300K-$400K$400K-$500K$2M-$3M

The comparison against traditional 8-GPU DGX systems highlights what makes the GB200 NVL72 fundamentally different. It is not just 9x more GPUs than a DGX H100 - it is 72 GPUs in a single NVLink domain. In a DGX H100, the 8 GPUs communicate at 900 GB/s via NVLink. If you need more than 8 GPUs, you go to InfiniBand at 400 Gb/s (50 GB/s) - an 18x bandwidth drop. In the GB200 NVL72, all 72 GPUs communicate at 1,800 GB/s via NVLink. There is no InfiniBand cliff within the rack.

Scaling Efficiency Analysis

This architectural difference is what enables the GB200 NVL72 to deliver near-linear scaling for training workloads that would otherwise hit communication bottlenecks at the 8-GPU boundary.

Consider training a 70B-parameter model. On a cluster of DGX H100 nodes (8 GPUs each), training with tensor parallelism (TP=8) within each node and pipeline parallelism (PP) across nodes is standard. The TP communication is fast (NVLink at 900 GB/s), but PP communication crosses the InfiniBand boundary. Each pipeline stage boundary incurs at least one round-trip over InfiniBand per micro-batch - typically 200-400 microseconds per step on 400 Gb/s NDR.

On the GB200 NVL72, the same 70B model can use TP=72 (all 72 GPUs) or TP=8 with PP=9, and all communication stays on NVLink at 1,800 GB/s. The pipeline stage latency drops from 200-400 microseconds (InfiniBand) to 5-15 microseconds (NVLink) - a 20-40x reduction. Over thousands of training steps, this communication advantage compounds into a significant wall-clock time improvement.

NVIDIA claims up to 30x faster real-time inference for trillion-parameter models compared to the prior-generation DGX H100 - a number driven by the combination of Blackwell compute, FP4 precision, and the unified NVLink fabric.

Trillion-Parameter Inference

For inference on trillion-parameter models, the GB200 NVL72 opens a deployment paradigm that was previously impractical. A 1-trillion-parameter model at FP4 precision requires approximately 500GB of memory for weights alone. On the GB200 NVL72, this fits within 3 GPUs' worth of HBM (576GB), with the remaining 69 GPUs available for KV-cache and parallel processing. The NVLink fabric enables all GPUs to access the weight data cooperatively at NVLink speeds, making real-time inference on trillion-parameter models feasible.

On a traditional H100 cluster, the same trillion-parameter model would need to be sharded across 10+ GPUs at FP8, with InfiniBand-limited cross-GPU communication adding tens of milliseconds of latency per token. The GB200 NVL72 reduces this to NVLink latency - sub-millisecond - making interactive (sub-second) token generation viable for even the largest models.

Key Capabilities

Unified 72-GPU NVLink Domain. The GB200 NVL72's defining feature is its 72-GPU NVLink domain - the largest NVLink fabric NVIDIA has ever shipped. Using NVLink Switch chips, all 72 GPUs can communicate directly at 1,800 GB/s per GPU with 130 TB/s of aggregate bisection bandwidth. This creates what is effectively a single GPU with 13.5TB of unified memory and 1.44 exaflops of FP4 compute.

The NVLink Switch System uses dedicated L1 NVLink Switch chips - each supporting 72 NVLink ports - to create a fully connected topology. The switch layer sits between the GPU trays and provides any-to-any connectivity. The switch chips themselves are passively cooled by the rack's liquid cooling system and consume approximately 3.5kW of power collectively. The switch fabric is non-blocking, meaning any GPU can communicate with any other GPU at full 1,800 GB/s bandwidth simultaneously without contention.

For training workloads, this eliminates the traditional all-reduce bottleneck that degrades scaling efficiency when GPU count exceeds the NVLink domain size. For inference workloads, it enables model parallelism across 72 GPUs with NVLink-speed communication - making it practical to serve trillion-parameter models in real time.

GB200 Superchip Architecture. Each of the 36 GB200 Superchips pairs one Grace CPU with two B200 GPUs via NVLink-C2C (chip-to-chip interconnect). The Grace CPU provides 72 ARM Neoverse V2 cores and 480GB of LPDDR5X memory, serving as the host processor for data preprocessing, orchestration, and CPU-side computation.

The Grace CPU is not an afterthought. Its 72 ARM cores deliver meaningful CPU compute for data loading, tokenization, and preprocessing tasks that traditionally bottleneck GPU utilization. The 480GB of LPDDR5X per CPU provides a staging area for training data - large enough to hold multiple batches in memory, reducing the I/O overhead of reading from storage. The NVLink-C2C connection between Grace and B200 delivers 900 GB/s of coherent bandwidth - far faster than PCIe Gen 5's ~128 GB/s. This tight CPU-GPU coupling reduces the data staging latency that plagues traditional x86-based GPU servers.

The ARM architecture also matters for power efficiency. The Grace CPU consumes approximately 500W for 72 cores and 480GB of memory - comparable to a high-end x86 server CPU, but with higher memory bandwidth per watt due to the LPDDR5X memory subsystem. In a system where every watt counts (120kW total), the Grace CPU's efficiency is a meaningful contribution to the overall power budget.

Full Liquid Cooling Architecture. The GB200 NVL72 is 100% liquid-cooled - there are no fans in the rack. Every heat-generating component - GPUs, CPUs, NVLink switches, power supplies - uses direct liquid cooling. This is not optional; there is no air-cooled variant. The system requires facility chilled water at 25-45 degrees Celsius with adequate flow capacity to handle 120kW of continuous thermal dissipation.

The all-liquid design enables a level of component density that would be impossible with air cooling. The 72 GPUs and 36 CPUs fit in a standard 72U rack only because the liquid cooling system removes heat far more efficiently than airflow. An air-cooled system with the same compute density would require multiple racks and significantly more floor space. The liquid cooling also delivers more consistent temperatures across all components, reducing thermal throttling and improving sustained performance compared to air-cooled alternatives.

The liquid cooling infrastructure requirements are non-trivial: facility chilled water supply and return piping, rack-level manifolds with quick-disconnect fittings, leak detection systems, and redundant cooling loops. The capital cost of liquid cooling infrastructure varies significantly by datacenter but typically adds $200,000-$500,000 per rack for a new installation, with lower costs for facilities that already have liquid cooling capability.

ConnectX-8 SuperNIC and Multi-Rack Scaling. For multi-rack deployments, the GB200 NVL72 racks connect via ConnectX-8 SuperNICs providing 800 Gb/s (100 GB/s) of network bandwidth per GPU. With 72 GPUs per rack, the total rack-level network bandwidth is 57.6 Tb/s - enough to sustain high-bandwidth inter-rack communication for distributed training across many racks.

Inter-rack communication still uses InfiniBand (via Quantum-2 switches) or Ethernet/RoCE (via Spectrum-X switches), so the NVLink advantage applies only within a single rack. The 800 Gb/s per-GPU network bandwidth is 2x the per-GPU bandwidth of previous-generation systems (400 Gb/s), which partially compensates for the NVLink-to-network bandwidth cliff at the rack boundary.

For training runs that span hundreds or thousands of GPUs, NVIDIA's Quantum-2 InfiniBand platform provides 400 Gb/s NDR per port with support for adaptive routing and congestion control. A typical multi-rack deployment uses a fat-tree or dragonfly network topology to provide full bisection bandwidth between racks.

Pricing and Availability

The GB200 NVL72 is priced at approximately $2-3 million per rack, with the exact price depending on configuration, volume commitments, and OEM partner. Each rack contains 36 GB200 Superchips (72 GPUs + 36 CPUs), NVLink Switch hardware, ConnectX-8 SuperNICs, power distribution, and liquid cooling manifolds.

ConfigurationEstimated Price (2026)
GB200 NVL72 (single rack)$2,000,000 - $3,000,000
GB200 Superchip (1 Grace + 2 B200)$60,000 - $70,000
Per-GPU equivalent cost~$28,000 - $42,000
Liquid cooling infrastructure$200,000 - $500,000 (new installation)
Cloud rental (per GPU hour, estimated)$5.00 - $8.00

Shipments began in early 2025, and NVIDIA has stated that Blackwell production is ramping at full capacity. Microsoft, Meta, Oracle, Amazon, and Google have all confirmed large GB200 NVL72 deployments. Despite the ramp, demand continues to significantly exceed supply, and lead times for new orders remain in the 6-12 month range for most buyers.

Cloud availability is limited but growing. Microsoft Azure was among the first to deploy GB200 NVL72 at scale, and other providers are expected to offer GB200-based instances throughout 2026. For most organizations, cloud access will be the practical path to Blackwell NVL72 compute, as the capital cost, power requirements, and liquid cooling infrastructure make on-premises deployment viable only for the largest operators.

Organizations evaluating the GB200 NVL72 should also consider the upcoming GB300 NVL72, which upgrades from 192GB to 288GB HBM3e per GPU with higher bandwidth and improved FP4 performance. The GB300 is expected in H2 2025 and may be worth waiting for if the primary workload is inference on the largest models.

Total Cost of Ownership

The GB200 NVL72's TCO calculation is complex because it involves infrastructure costs that do not apply to traditional GPU deployments:

Cost ComponentEstimated Amount
Rack hardware (one-time)$2,000,000 - $3,000,000
Liquid cooling infrastructure (one-time)$200,000 - $500,000
Annual power cost (120kW at $0.10/kWh)~$105,000
Annual cooling operational cost~$30,000 - $50,000
Annual maintenance and support~$200,000 - $400,000
3-year total cost of ownership~$3.2M - $5.2M

Despite the high absolute cost, the GB200 NVL72 can be cost-effective compared to equivalent H100 clusters. To match the GB200 NVL72's 720 PFLOPS of FP8 compute, you would need approximately 182 H100 SXM GPUs (720,000 / 3,958 TFLOPS each). At $25,000-$30,000 per H100, that is $4.5M-$5.5M in GPU costs alone, plus servers, networking, and facility costs. The GB200 NVL72 delivers the same compute in a denser, more power-efficient package with the added benefit of NVLink-speed communication across all 72 GPUs.

Strengths

  • 1.44 exaflops of FP4 in a single rack - more compute than entire GPU clusters deployed just two years ago
  • 72-GPU NVLink domain with 130 TB/s bisection bandwidth eliminates the InfiniBand bottleneck for large-scale training
  • 13.5TB of unified HBM3e memory enables trillion-parameter models to fit within a single NVLink domain
  • Grace CPU integration via NVLink-C2C delivers 900 GB/s coherent CPU-GPU bandwidth - 7x faster than PCIe Gen 5
  • Full liquid cooling enables higher density and more consistent thermals than air-cooled alternatives
  • ConnectX-8 SuperNIC with 800 Gb/s per GPU enables efficient multi-rack scaling for the largest training runs
  • Supported by the full NVIDIA AI Enterprise software stack - CUDA, cuDNN, TensorRT-LLM, Megatron, NeMo
  • ARM-based Grace CPUs provide power-efficient host processing with 17.3TB of LPDDR5X for data staging
  • Non-blocking NVLink Switch fabric ensures contention-free GPU-to-GPU communication

Weaknesses

  • $2-3 million per rack is an enormous capital expenditure - only hyperscalers and well-funded AI labs can deploy at scale
  • 120kW per rack requires purpose-built datacenter infrastructure with high-density liquid cooling
  • 1.36 metric tons per rack exceeds the floor loading capacity of many existing datacenters
  • Supply is severely constrained with 6-12 month lead times as of early 2026
  • Inter-rack communication still falls back to InfiniBand/RoCE, creating a bandwidth cliff at the rack boundary
  • The all-liquid-cooling requirement eliminates deployment in existing air-cooled facilities without major retrofitting
  • About to be superseded by the GB300 NVL72 with 50% more memory per GPU and higher performance
  • Liquid cooling infrastructure adds $200K-$500K to the deployment cost that is separate from the rack price
  • Spare parts and field service for a $3M liquid-cooled rack system require specialized NVIDIA-certified technicians

Who Should Deploy the GB200 NVL72

The GB200 NVL72 is not a product for most organizations. It is designed for hyperscalers, well-funded AI labs, and sovereign AI initiatives that need the highest compute density available and can invest in purpose-built infrastructure.

Good Fit

Frontier model training labs. If you are training models with hundreds of billions to trillions of parameters, the GB200 NVL72's 72-GPU NVLink domain eliminates the communication bottleneck that limits training efficiency on smaller clusters. The 13.5TB of unified HBM3e enables model parallelism strategies that would require costly InfiniBand communication on traditional GPU clusters.

Hyperscale inference providers. Companies serving AI inference to millions of users need the highest throughput per rack. The GB200 NVL72's 1.44 EXAFLOPS of FP4 per rack, combined with the NVLink fabric for efficient model-parallel inference, enables real-time serving of the largest models at scale.

Sovereign AI programs. National AI infrastructure programs that need to deploy large-scale compute quickly benefit from the GB200 NVL72's integrated design. A single rack provides more compute than most countries' entire GPU fleet deployed just three years ago. The integrated liquid cooling, power distribution, and networking simplify deployment compared to assembling equivalent compute from individual servers.

Poor Fit

Organizations with less than $10M AI compute budget. At $2-3M per rack plus infrastructure costs, the GB200 NVL72 requires a minimum investment of $3-5M per rack. Organizations with smaller budgets should consider DGX B200 (8 GPUs, $400K-$500K) or cloud-based Blackwell instances.

Workloads that fit on 8 GPUs. If your training or inference workload fits within a single 8-GPU DGX node, the 72-GPU NVLink domain provides no benefit. You are paying for 64 GPUs' worth of NVLink connectivity that your workload does not use. A standard DGX B200 is the right choice for 8-GPU workloads.

Facilities without liquid cooling infrastructure. The GB200 NVL72 requires facility chilled water with adequate capacity for 120kW of sustained thermal load. Retrofitting an existing air-cooled datacenter for liquid cooling is a 6-12 month project. If your facility cannot support liquid cooling within your deployment timeline, cloud-based Blackwell compute is the alternative.

Deployment Guide

Datacenter Prerequisites

Deploying a GB200 NVL72 rack requires the following infrastructure:

Power: 120kW per rack at 200-480V AC or 240-400V DC. Three-phase power distribution with adequate circuit breaker capacity. Each rack typically requires 2-3 power whips depending on the voltage and amperage configuration. Uninterruptible power supply (UPS) coverage for 120kW per rack is expensive - many deployments opt for generator backup only, without UPS, for cost reasons.

Cooling: Facility chilled water at 25-45C supply temperature with flow capacity of approximately 20-25 liters per minute per rack. The cooling distribution unit (CDU) converts facility water to the GPU-side coolant loop. Leak detection sensors must be installed throughout the rack and in the under-floor plenum or ceiling return path.

Floor loading: 1.36 metric tons (3,000 lbs) per rack. Many existing datacenters have floor loading limits of 2,000-2,500 lbs per rack, which is insufficient for the GB200 NVL72. Structural reinforcement may be required.

Network: Each rack requires 72 ConnectX-8 network ports (one per GPU) at 800 Gb/s each. For InfiniBand connectivity, this means 72 NDR400 or NDR800 ports per rack, connected to leaf switches with appropriate oversubscription ratio. A typical deployment uses a 2:1 or 4:1 oversubscription ratio to manage switch port costs.

Physical space: The rack occupies a standard 42U or 72U rack footprint but requires additional clearance for coolant piping on top and bottom and network cabling on the rear. Plan for approximately 4 feet of front clearance and 4 feet of rear clearance beyond the rack footprint.

Multi-Rack Network Architecture

For training runs that span multiple GB200 NVL72 racks, the inter-rack network design is critical. NVIDIA recommends a fat-tree or rail-optimized topology using Quantum-2 InfiniBand switches (400 Gb/s NDR per port) or Spectrum-X Ethernet switches (400 GbE per port).

A typical multi-rack deployment connects each GPU's ConnectX-8 SuperNIC (800 Gb/s) to a leaf switch, with spine switches providing full bisection bandwidth between racks. For a 10-rack deployment (720 GPUs), the network requires:

  • 720 leaf switch ports (800 Gb/s each) - typically 20-30 Quantum-2 leaf switches
  • Sufficient spine switch ports for the desired oversubscription ratio
  • Total switch cost of $500K-$1.5M depending on topology and oversubscription

The 800 Gb/s per-GPU network bandwidth (via ConnectX-8 dual-port) is 2x higher than previous-generation systems, which partially compensates for the bandwidth cliff between NVLink (1,800 GB/s) and network (100 GB/s effective per port). For all-reduce operations across racks, the network bandwidth is approximately 18x lower than within-rack NVLink bandwidth - this ratio dictates the optimal training parallelism strategy for multi-rack deployments.

Operational Considerations

Monitoring: NVIDIA provides DCGM (Data Center GPU Manager) for GPU health monitoring, including temperature, power, memory errors, and NVLink link health. The GB200 NVL72 adds rack-level monitoring for coolant temperature, flow rate, pressure, and leak detection. All monitoring data should be integrated into your existing DCIM (Data Center Infrastructure Management) platform.

Maintenance: GPU failures in the GB200 NVL72 require NVIDIA-certified field service technicians. Individual GPUs can be replaced without draining the entire coolant loop, but the liquid-cooled design adds complexity compared to air-cooled GPU swaps. NVIDIA's standard support contract includes 4-hour hardware replacement SLAs for critical components.

Software updates: NVIDIA driver and firmware updates for the GB200 NVL72 require careful coordination. NVLink Switch firmware updates may require draining the NVLink domain (taking all 72 GPUs offline simultaneously), which necessitates workload migration or a maintenance window. Plan for quarterly maintenance windows to accommodate critical firmware updates.

Use Case Deep Dives

Trillion-Parameter Model Training

The GB200 NVL72 was specifically designed for training models with trillions of parameters. Here is how the training parallelism maps to the GB200 NVL72's architecture for a hypothetical 1-trillion-parameter model:

Parallelism DimensionAssignmentCommunication Fabric
Tensor Parallelism (TP=8)8 GPUs within a compute groupNVLink (1,800 GB/s)
Expert Parallelism (EP=8)8 GPUs across compute groupsNVLink (1,800 GB/s)
Pipeline Parallelism (PP=varies)Pipeline stages across GPUsNVLink (1,800 GB/s)
Data Parallelism (DP=varies)Across racksInfiniBand (800 Gb/s)

The critical advantage is that TP, EP, and PP all stay on NVLink within the rack. On a traditional H100 cluster, only TP (within a single DGX node of 8 GPUs) uses NVLink. EP and PP must traverse InfiniBand, which is 18-36x slower. The GB200 NVL72 eliminates this bottleneck for training configurations that fit within 72 GPUs.

For a 1T model with TP=8, PP=9, 72 GPUs provide one complete pipeline replica with full NVLink connectivity. Data parallelism across multiple racks provides scaling beyond 72 GPUs, with only the DP all-reduce operations using InfiniBand. Since DP communication is amortized over the full training batch (unlike TP and PP which occur per micro-batch), the InfiniBand overhead is manageable.

Real-Time Inference for Large Models

The GB200 NVL72 enables real-time inference for models that cannot be served interactively on smaller systems. A 400B+ parameter model at FP4 requires approximately 200GB+ of memory for weights, which exceeds any single GPU's capacity. On the GB200 NVL72, this model can be sharded across 2-3 GPUs within the NVLink domain, with the remaining 69-70 GPUs available for prefill compute and KV-cache storage.

The NVLink-connected sharding eliminates the latency penalty that InfiniBand-connected sharding introduces. For interactive applications targeting <200ms time-to-first-token, the difference between NVLink latency (microseconds) and InfiniBand latency (hundreds of microseconds) per layer boundary can mean the difference between meeting and missing the latency target.

NVIDIA reports that the GB200 NVL72 can serve real-time inference on trillion-parameter models at 30x the throughput of an equivalent DGX H100 cluster. While the absolute throughput depends on the model architecture and serving configuration, the NVLink fabric's role in enabling low-latency model parallelism across many GPUs is the key enabler.

Comparison with H100 Cluster Economics

A common evaluation question is whether to deploy GB200 NVL72 racks or build a larger H100 cluster for the same budget. Here is a direct comparison at the $3M budget level:

MetricGB200 NVL72 (1 rack)DGX H100 Cluster (10 nodes, 80 GPUs)
Budget~$3M~$3.5M (GPUs + networking)
FP8 Compute720 PFLOPS317 PFLOPS
Total Memory13.5 TB6.4 TB
NVLink Domain72 GPUs8 GPUs (per node)
Power~120 kW~82 kW (GPUs) + ~20 kW (networking)
CoolingLiquid (required)Air (feasible)
Inter-GPU Bandwidth (worst case)1,800 GB/s (NVLink)50 GB/s (InfiniBand between nodes)

The GB200 NVL72 delivers 2.3x more compute in less physical space, with a dramatically faster interconnect fabric. The H100 cluster's only advantages are air cooling compatibility (avoiding liquid cooling infrastructure) and lower power density (100 kW vs 120 kW, spread across more racks). For organizations with liquid cooling infrastructure, the GB200 NVL72 is the clear winner.

For organizations without liquid cooling, the H100 cluster is deployable immediately while the GB200 NVL72 requires infrastructure investment. This practical constraint often determines the choice regardless of performance comparisons.

Multi-Model Serving

An underappreciated capability of the GB200 NVL72 is multi-model serving. The 13.5TB of unified HBM3e can hold many models simultaneously, with the NVLink fabric enabling dynamic load balancing across GPUs. A single GB200 NVL72 rack could simultaneously serve:

  • 1x 400B model (FP4, ~200GB, across 2 GPUs)
  • 5x 70B models (FP4, ~35GB each, 1 GPU each)
  • 10x 7B models (FP4, ~3.5GB each, sharing remaining GPUs)

This multi-model capability is valuable for inference providers that need to serve diverse model catalogs from shared infrastructure. The NVLink fabric enables flexible partitioning without the overhead of separate network-connected servers for each model.

Cloud Provider Availability

Cloud availability for the GB200 NVL72 is emerging, with Microsoft Azure leading the deployment:

Cloud ProviderStatusInstance Details
Microsoft AzureAvailableND GB200 v6 (72 GPUs per instance)
Google CloudExpectedCloud TPU/GPU instances (timing TBD)
AWSExpectedH2 2026 (estimated)
Oracle CloudConfirmedOCI Superclusters (shipping)
CoreWeaveExpected2026 (estimated)

For most organizations, cloud access is the practical path to GB200 NVL72 compute. The $2-3M per-rack capital cost, liquid cooling requirements, and specialized datacenter infrastructure make on-premises deployment viable only for organizations with AI compute budgets exceeding $10M.

Cloud providers typically sell GB200 NVL72 compute in large blocks - you rent an entire 72-GPU NVLink domain rather than individual GPUs. This makes pricing models different from traditional GPU cloud instances where you can rent a single GPU. Expect per-GPU-hour pricing of $5-8 with minimum commitment requirements.

Export Control Considerations

The GB200 NVL72 is subject to the same US export controls as individual Blackwell GPUs. Given that the system contains 72 B200 GPUs, the export control implications are amplified - a single rack represents significantly more compute than the thresholds targeted by current regulations.

For sovereign AI programs in allied nations, the GB200 NVL72 is available through standard procurement channels. For cloud deployments serving global customers, providers must implement geofencing and customer verification to ensure compliance with export restrictions on the underlying compute.

Sources

NVIDIA GB200 NVL72 - Rack-Scale Blackwell
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.