Hardware

NVIDIA GB300 NVL72 - Blackwell Ultra Rack

Complete specs, benchmarks, and analysis of the NVIDIA GB300 NVL72 - the Blackwell Ultra rack-scale system with 288GB HBM3e per GPU, 1.5x more FP4 compute, and 2x attention performance over GB200.

NVIDIA GB300 NVL72 - Blackwell Ultra Rack

TL;DR

  • 288GB HBM3e per GPU using 12-Hi stacks - a 50% memory increase over the GB200 NVL72's 192GB per GPU, totaling 20.7TB per rack
  • 1.5x more dense FP4 compute and 2x higher attention performance compared to first-generation Blackwell
  • 20,480 CUDA cores and 640 fifth-generation Tensor Cores per B300 GPU with support for FP8, FP6, and NVFP4 precision
  • 1,400W TDP per GPU with the same fully liquid-cooled rack design and 72-GPU NVLink domain as the GB200
  • Expected H2 2025 availability at an estimated $3-4 million per rack - Microsoft Azure confirmed as first large-scale deployer

Overview

The NVIDIA GB300 NVL72 is the Blackwell Ultra generation of NVIDIA's rack-scale AI supercomputer. Announced at GTC 2025, the GB300 upgrades every dimension of the GB200 NVL72: 50% more memory per GPU (288GB vs 192GB), 1.5x more dense FP4 compute, 2x higher attention performance, and more CUDA cores (20,480 vs 18,432). The rack-level architecture stays the same - 72 GPUs and 36 Grace CPUs in a unified NVLink domain - but the per-GPU improvements compound across all 72 GPUs to deliver a significant generational leap.

The B300 GPU at the heart of the GB300 NVL72 pushes the Blackwell architecture to its limits. Each B300 features 160 streaming multiprocessors with 20,480 CUDA cores and 640 fifth-generation Tensor Cores, packed into a dual-die package with 208 billion transistors. The key innovation is the memory: 288GB of HBM3e using 12-Hi stacks (compared to 8-Hi stacks in the B200), delivering the same 8 TB/s bandwidth in a 50% larger capacity envelope. At the rack level, this gives the GB300 NVL72 20.7TB of unified HBM3e memory - enough to hold the weights and KV-cache for the largest models in production today.

The performance improvements are specifically targeted at reasoning and agentic AI workloads. The 2x attention performance improvement - achieved through Tensor Core architecture enhancements and dedicated attention acceleration hardware - directly addresses the compute pattern that dominates transformer inference at long context lengths. As AI models move from simple question-answering to multi-step reasoning chains with extended context, attention compute becomes the bottleneck. The GB300 is NVIDIA's answer to that shift.

Microsoft Azure has confirmed the first large-scale deployment, with a 4,608-GPU cluster (64 racks) capable of 92.1 exaflops of FP4 compute. This cluster will power OpenAI's next-generation workloads, underscoring the GB300's position as the infrastructure of choice for frontier AI development.

Key Specifications

SpecificationDetails
ManufacturerNVIDIA
System TypeRack-scale AI supercomputer
ArchitectureBlackwell Ultra (B300 GPU + Grace CPU)
Process NodeTSMC 4NP (GPU), TSMC 4N (Grace CPU)
GPUs per Rack72 (B300)
CPUs per Rack36 (Grace ARM)
Superchips per Rack36 (1 Grace + 2 B300 each)
GPU Transistors208 billion per GPU (dual-die)
CUDA Cores per GPU20,480
Tensor Cores per GPU640 (5th generation)
Streaming Multiprocessors per GPU160
GPU Boost Clock2.6 GHz
GPU Memory per GPU288 GB HBM3e (12-Hi stacks)
Total GPU Memory20,736 GB (20.7 TB)
Memory Bandwidth per GPU8,000 GB/s (8 TB/s)
Memory Controller16 x 512-bit
Total Memory Bandwidth576 TB/s
FP4 Performance per GPU15 PFLOPS (dense)
FP4 Performance per Rack~1,080 PFLOPS (dense), ~2,160 PFLOPS (sparse)
FP8 Performance per GPU~10 PFLOPS (estimated dense)
Attention Performance2x vs GB200 NVL72
Tensor Memory (TMEM) per SM256 KB
NVLink per GPU5th generation, 1,800 GB/s
NVLink Domain72 GPUs (full rack)
NVLink Bisection Bandwidth130 TB/s
Network I/O per GPUConnectX-8 SuperNIC, 800 Gb/s
CPU Memory per CPU480 GB LPDDR5X
Total CPU Memory17,280 GB (17.3 TB)
GPU TDP1,400W
Total Rack Power~120 kW
CoolingLiquid cooling (required)
PCIeGen 6
Release DateH2 2025 (expected)

The B300's SM architecture deserves specific attention. With 160 SMs (versus 148 in the B200), each containing 128 CUDA cores and 4 fifth-generation Tensor Cores, the B300 provides 8% more SMs and 11% more CUDA/Tensor cores. But the per-core performance improvements - particularly in Tensor Core throughput and the addition of 256KB of Tensor Memory (TMEM) per SM - deliver a disproportionately larger performance gain. The 50% more FP4 compute (15 PFLOPS vs 10 PFLOPS per GPU) comes from the combination of more SMs, higher per-SM throughput, and the new NVFP4 precision format.

The 1,400W TDP per B300 GPU is 40% higher than the B200's 1,000W. At 72 GPUs, the raw GPU power draw alone would be 100.8kW - the vast majority of the rack's ~120kW total power budget. The remaining ~19kW covers 36 Grace CPUs, NVLink switches, ConnectX-8 SuperNICs, power conversion, and cooling pumps. Despite the higher per-GPU TDP, the total rack power remains approximately the same as the GB200 NVL72 due to improved power management and efficiency optimizations in the system design.

Performance Benchmarks

MetricGB200 NVL72GB300 NVL72Improvement
GPU Memory per GPU192 GB288 GB+50%
Total Rack Memory13.5 TB20.7 TB+53%
FP4 Dense (per GPU)10 PFLOPS15 PFLOPS+50%
FP4 Dense (per rack)720 PFLOPS~1,080 PFLOPS+50%
Attention Performance1x (baseline)2x+100%
CUDA Cores per GPU18,43220,480+11%
Tensor Cores per GPU576640+11%
SMs per GPU148160+8%
GPU TDP1,000W1,400W+40%
Memory Bandwidth per GPU8 TB/s8 TB/sSame
NVLink Bandwidth per GPU1,800 GB/s1,800 GB/sSame

Memory Capacity Analysis

The memory increase from 192GB to 288GB per GPU - and from 13.5TB to 20.7TB per rack - crosses several important deployment thresholds.

The 50% more memory per GPU means the GB300 NVL72 can handle larger models, longer context windows, or more concurrent inference requests. For a quantized trillion-parameter model at FP4 (approximately 500GB for weights), the GB200 NVL72 requires at least 3 GPUs for weight storage. The GB300 NVL72 needs only 2, freeing the third GPU's memory for KV-cache and activations. Across the full rack, this improved packing efficiency compounds.

For inference with long context windows, KV-cache memory consumption is the binding constraint. A 70B model with 1 million token context at FP16 KV-cache requires approximately 160GB of KV-cache per request. On a 192GB GPU (GB200), a single long-context request would consume most of the available memory after model weights. On a 288GB GPU (GB300), there is substantially more headroom for concurrent requests or even longer contexts.

Model SizeFP4 Weight SizeKV-Cache (128K ctx)Fits per GPU (GB200)Fits per GPU (GB300)
7B~3.5 GB~16 GBYes (many concurrent)Yes (many concurrent)
70B~35 GB~160 GBYes (limited cache)Yes (generous cache)
400B~200 GB~400 GBNeeds 2+ GPUsNeeds 1-2 GPUs
1T~500 GB~1 TBNeeds 3+ GPUsNeeds 2+ GPUs

Attention Performance Deep Dive

The 2x attention performance improvement is the GB300 NVL72's most architecturally significant advancement. Attention computation in transformers scales quadratically with sequence length (O(n^2) for standard attention, O(n) with flash attention on key-value length but still linear with query length). For reasoning workloads that generate thousands of tokens of chain-of-thought, the attention compute for each new token must attend to all previous tokens in the context.

The B300 addresses this through two hardware mechanisms:

First, each SM includes 256KB of Tensor Memory (TMEM) - a dedicated memory tier positioned between registers and shared memory in the SM's memory hierarchy. TMEM is optimized for the data access patterns of attention computation, providing low-latency, high-bandwidth access to attention intermediate values (QK^T products, softmax results, attention output accumulation). By keeping these intermediates in TMEM rather than shared memory, the attention pipeline avoids the latency of shared memory bank conflicts and the bandwidth limitations of the register-to-shared-memory path.

Second, the fifth-generation Tensor Cores include dedicated attention instructions that fuse the multi-step attention computation (Q*K^T, scaling, masking, softmax, V multiplication) into fewer hardware operations. This fusion reduces instruction dispatch overhead and improves Tensor Core utilization during attention computation, which historically underutilizes Tensor Cores due to the non-GEMM operations (softmax, masking) interspersed with matrix multiplications.

The combined effect is 2x attention throughput per GPU, which translates to 2x throughput for attention-dominated workloads. For a 70B model generating 4,096 tokens of chain-of-thought reasoning, attention computation accounts for approximately 60-70% of total inference time. A 2x improvement in attention directly translates to a 1.4-1.5x improvement in end-to-end token generation speed for reasoning workloads.

Key Capabilities

12-Hi HBM3e Stacks. The B300 GPU achieves its 288GB capacity by using 12-Hi HBM3e stacks - each stack containing 12 layers of DRAM dies bonded vertically, compared to the 8-Hi stacks used in the B200. This is a packaging innovation rather than a silicon change: the same HBM3e DRAM technology is used, but with 50% more layers per stack. The 16x512-bit memory controller design maintains the same 8 TB/s bandwidth at the higher capacity, meaning the B300 does not trade bandwidth for capacity.

The 12-Hi stacking is a significant manufacturing achievement. Each additional DRAM layer increases the mechanical stress on the Through-Silicon Vias (TSVs) that carry signals between layers, the thermal resistance of the stack (making cooling more challenging), and the overall yield risk. SK Hynix, Samsung, and Micron have each invested heavily in developing 12-Hi HBM3e, and the B300 is the first GPU to deploy it at scale. The fact that the bandwidth per stack is maintained at 12-Hi (implying the same or higher per-layer signaling rate) speaks to the maturity of the technology.

At the rack level, the 72 GPUs provide 20.7TB of unified memory at an aggregate 576 TB/s - the largest and fastest memory pool available in any commercially shipping AI system. Combined with the 72-GPU NVLink domain, this memory pool is accessible to any GPU at NVLink speeds, enabling models and datasets that span the full 20.7TB to be processed without InfiniBand-limited cross-rack communication.

NVFP4 Precision Format. While the B200 introduced FP4 support, the B300 adds NVIDIA's proprietary NVFP4 format - a 4-bit floating-point representation optimized for inference accuracy. NVFP4 uses a custom exponent/mantissa split and per-tensor scaling that delivers better accuracy preservation than standard FP4 at the same bitwidth.

The NVFP4 format is specifically designed for the weight distribution patterns observed in large language models. Standard FP4 (E2M1 - 2-bit exponent, 1-bit mantissa) provides only 4 representable values per sign, which limits its ability to capture the typical weight distributions that cluster around zero but include important outlier values. NVFP4 uses a different encoding that provides better coverage of these distributions, reducing the quantization error for the most common weight patterns.

Combined with the second-generation Transformer Engine's automatic precision management, NVFP4 on the B300 enables 4-bit inference on a broader range of models with acceptable accuracy degradation. At 15 PFLOPS dense NVFP4 per GPU, the B300 is built for the era where 4-bit quantized inference becomes the production default.

PCIe Gen 6 Interface. The B300 is the first NVIDIA datacenter GPU to support PCIe Gen 6, doubling the host-to-device bandwidth from ~128 GB/s (Gen 5 x16) to ~256 GB/s (Gen 6 x16). While this matters less in the GB300 NVL72 context - where the Grace CPU connects via NVLink-C2C at 900 GB/s - it is significant for standalone B300 GPUs deployed in x86 server platforms.

PCIe Gen 6 also enables faster NVMe storage access for data loading and checkpointing, which can be a bottleneck in large-scale training runs. At 256 GB/s per GPU, a B300 in a PCIe server can load training data from high-speed NVMe arrays at rates that keep the GPU pipeline fed without data staging delays.

NVLink 5 with Enhanced Collective Operations. The GB300 NVL72 uses the same NVLink 5 interconnect as the GB200 NVL72 (1,800 GB/s per GPU, 130 TB/s aggregate), but with software-level enhancements to collective operations. The NVLink Switch firmware in the GB300 includes optimized all-reduce, all-gather, and reduce-scatter implementations that take advantage of the B300's higher per-GPU compute throughput. These optimizations reduce the software overhead of collective communication operations, improving the ratio of useful compute to communication overhead in distributed training.

Pricing and Availability

The GB300 NVL72 is expected to ship in H2 2025. Pricing is estimated at $3-4 million per rack, representing a 30-50% premium over the GB200 NVL72's $2-3 million price point. The premium reflects the 50% more memory per GPU, higher compute density, and updated components.

ConfigurationEstimated Price (2026)
GB300 NVL72 (single rack)$3,000,000 - $4,000,000 (estimated)
Per-GPU equivalent cost~$42,000 - $56,000
Liquid cooling infrastructure$200,000 - $500,000 (new installation)
Cloud rental (per GPU hour, estimated)$6.00 - $10.00

Microsoft Azure has confirmed the first large-scale deployment: a 4,608-GPU cluster consisting of 64 GB300 NVL72 racks. This cluster delivers 92.1 exaflops of FP4 compute and will be used for OpenAI workloads. Azure's early deployment suggests that cloud availability for GB300 instances could begin in late 2025 or early 2026, though on-premises availability may lag cloud deployments.

The GB300 NVL72 shares the same physical form factor, liquid cooling requirements, and datacenter infrastructure needs as the GB200 NVL72. Organizations that have already deployed GB200 NVL72 infrastructure should be able to deploy GB300 racks into the same facilities with minimal modifications - the cooling loops, power distribution, and physical rack footprint are compatible.

Buy-Now-or-Wait Analysis

For buyers currently evaluating GB200 NVL72 orders, the decision to wait for GB300 depends on the workload:

ScenarioRecommendationReasoning
Need compute now for trainingDeploy GB200 NVL72Shipping today, GB300 is 6+ months out
Primary need is large-model inferenceConsider waiting for GB30050% more memory per GPU significantly improves serving capacity
Reasoning/agentic workloadsConsider waiting for GB3002x attention performance directly benefits chain-of-thought
Budget-constrainedDeploy GB200 NVL7230-50% lower cost per rack
Building new datacenter (6+ month lead)Target GB300 NVL72Infrastructure will be ready when GB300 ships

The infrastructure compatibility between GB200 and GB300 is a significant advantage. Organizations that deploy GB200 NVL72 racks now can plan to add GB300 racks later using the same cooling, power, and networking infrastructure. The two generations can coexist in the same datacenter, with workloads allocated to the appropriate hardware generation.

Total Cost of Ownership

Cost ComponentGB200 NVL72GB300 NVL72Delta
Rack hardware (one-time)$2M - $3M$3M - $4M+$1M
Liquid cooling (one-time)$200K - $500K$200K - $500KSame
Annual power (~120kW)~$105K~$105KSame
Annual maintenance~$200K - $400K~$250K - $500K+$50K-$100K
3-year TCO$3.2M - $5.2M$4.3M - $6.5M+$1.1M - $1.3M

The GB300 NVL72's 30-50% cost premium buys 50% more memory and 50% more FP4 compute per rack. On a dollar-per-PFLOPS basis, the GB300 is approximately the same cost as the GB200 - the extra money buys proportionally more capability. The decision between the two comes down to whether the workload benefits from the additional memory and attention performance, or whether raw FP4 compute at the lowest cost is the priority.

Cloud Provider Availability

Microsoft Azure has confirmed the first large-scale GB300 NVL72 deployment, and other cloud providers are expected to follow. Cloud availability is expected to ramp in late 2025 and throughout 2026.

Cloud ProviderStatusExpected Availability
Microsoft AzureConfirmed 4,608-GPU deploymentLate 2025 / Early 2026
Google CloudExpectedH1 2026 (estimated)
AWSExpectedH1 2026 (estimated)
Oracle CloudExpected2026 (estimated)

For most organizations, cloud access will be the practical path to GB300 NVL72 compute. The capital cost ($3-4M per rack), liquid cooling infrastructure requirements, and deployment complexity make on-premises deployment viable only for organizations operating at hyperscaler or near-hyperscaler scale.

Cloud pricing for GB300 NVL72 instances is expected to be 30-50% higher than GB200 NVL72 instances per GPU-hour, reflecting the 50% more memory and compute per GPU. Whether this premium is justified depends on the workload: for memory-bound inference with long contexts, the GB300's extra memory delivers proportionally more throughput, making the cost-per-token comparable to or lower than GB200. For compute-bound training workloads, the GB200 NVL72 at its lower price point may offer better cost efficiency.

Export Control Considerations

The GB300 NVL72 is subject to the same US export controls as other Blackwell products. Each rack contains 72 B300 GPUs with cumulative compute far exceeding any export control threshold. Sovereign AI deployments in allied nations can procure GB300 NVL72 through standard channels, but availability may be restricted for certain markets and end-users.

The GB300 NVL72's Strategic Significance

The GB300 NVL72 represents the peak of NVIDIA's Blackwell product generation. It combines the most capable Blackwell GPU (B300 with 288GB HBM3e and enhanced attention performance) with the rack-scale NVLink architecture that defined the GB200 NVL72. For organizations at the frontier of AI development, it is the most capable commercially available AI system.

The GB300's focus on reasoning performance - through the 2x attention improvement, TMEM, and larger memory for KV-caches - reflects NVIDIA's bet on where AI is heading. The shift from "generate plausible text" (the GPT-3 era) to "reason through complex problems" (the current era of o1, DeepSeek R1, and chain-of-thought models) changes the compute profile from pure GEMM throughput to attention-dominated workloads. The GB300 is the first GPU system designed specifically for this shift.

Microsoft's decision to deploy 4,608 B300 GPUs for OpenAI's workloads validates this bet. OpenAI's reasoning models (GPT-5.2 Pro mode, the expected GPT-6 reasoning capabilities) are exactly the workload type that benefits most from the GB300's attention performance improvement. The fact that the world's largest AI company chose GB300 over GB200 for its flagship deployment is the strongest possible endorsement of the product's positioning.

Strengths

  • 288GB HBM3e per GPU provides the largest per-GPU memory of any shipping NVIDIA accelerator - critical for large KV-caches and model weights
  • 2x attention performance improvement directly addresses the compute bottleneck for reasoning and agentic AI workloads
  • 50% more dense FP4 compute per GPU (15 PFLOPS vs 10 PFLOPS) with improved NVFP4 precision for better accuracy
  • Same rack-level architecture as GB200 NVL72 - compatible infrastructure reduces migration cost for existing Blackwell deployers
  • 20.7TB of unified HBM3e per rack is the largest GPU memory pool in any commercially available system
  • 256KB of Tensor Memory per SM provides dedicated low-latency storage for attention intermediates
  • PCIe Gen 6 support future-proofs the platform for next-generation host and storage interfaces
  • Microsoft Azure's 4,608-GPU deployment validates the platform for frontier AI workloads at hyperscale
  • Infrastructure compatibility with GB200 NVL72 enables mixed-generation datacenter deployments

Weaknesses

  • Estimated $3-4 million per rack is 30-50% more expensive than the GB200 NVL72 for 50% more memory and compute
  • 1,400W TDP per GPU is the highest of any NVIDIA datacenter GPU - total power density pushes cooling infrastructure limits
  • Not yet shipping as of early 2026 - organizations with immediate compute needs should deploy GB200 NVL72 now
  • Memory bandwidth per GPU remains at 8 TB/s - the same as the B200 despite the 50% memory capacity increase
  • NVLink bandwidth per GPU remains at 1,800 GB/s - no improvement in inter-GPU communication speed
  • Liquid cooling requirement and 120kW+ rack power draw restrict deployment to purpose-built datacenters
  • Limited availability data - pricing, exact ship dates, and allocation quantities are still based on estimates and early partner announcements
  • Inter-rack communication still requires InfiniBand - NVLink advantage is contained within a single rack
  • NVFP4 precision format is NVIDIA-proprietary and may not be supported in all inference frameworks at launch

Who Should Deploy the GB300 NVL72

The GB300 NVL72 targets the same class of buyer as the GB200 NVL72 - hyperscalers, frontier AI labs, and sovereign AI programs - but with specific advantages for reasoning and long-context workloads.

Good Fit

Reasoning and agentic AI workloads. The 2x attention performance improvement directly addresses the compute bottleneck for chain-of-thought reasoning, tree-of-thought exploration, and multi-step agentic planning. If your primary workload generates long sequences of reasoning tokens, the GB300 NVL72 provides a proportionally larger advantage over the GB200 than the raw FP4 numbers suggest.

Inference on the largest models with long contexts. The 288GB per GPU provides 50% more KV-cache headroom than the GB200's 192GB. For applications serving models at 1M+ token context windows, the KV-cache memory requirement is the binding constraint, and the GB300's extra memory directly translates to more concurrent long-context requests per GPU.

Organizations planning H2 2025 or later deployments. If your datacenter infrastructure is still under construction (a common situation for new AI factory builds), the GB300 NVL72's expected H2 2025 availability aligns well with typical 12-18 month datacenter construction timelines. There is no reason to deploy GB200 NVL72 racks into a facility that will not be ready until the GB300 is shipping.

Mixed-generation datacenter expansion. Organizations that already have GB200 NVL72 racks deployed can add GB300 racks into the same facility using the same cooling, power, and networking infrastructure. The two generations interoperate at the multi-rack level via InfiniBand, allowing workloads to be allocated to the appropriate hardware generation based on their compute and memory requirements.

Poor Fit

Organizations that need compute today. The GB300 NVL72 is not yet shipping as of early 2026. If you need large-scale compute immediately, deploy GB200 NVL72 now and plan for GB300 expansion later.

Workloads that are compute-bound rather than memory-bound. If your workload fully utilizes the GB200 NVL72's compute and is not constrained by memory capacity, the GB300's 50% more memory does not help. The 50% more FP4 compute is meaningful, but the GB200 NVL72 at its lower price point may deliver better cost efficiency for compute-bound workloads.

Organizations without long-context or reasoning requirements. The GB300's 2x attention improvement is specifically valuable for reasoning and long-context workloads. If your primary workload is short-context inference or training, this advantage is underutilized, and the GB200 NVL72 at 30-50% lower cost per rack is the better choice.

Deployment Considerations

Infrastructure Compatibility with GB200 NVL72

The GB300 NVL72 is designed for infrastructure compatibility with the GB200 NVL72. The key compatibility dimensions:

  • Rack form factor: Same physical footprint and rack design
  • Liquid cooling: Same coolant type, temperature range, and flow requirements
  • Power: Same ~120kW total rack power draw
  • Network: Same ConnectX-8 SuperNIC with 800 Gb/s per GPU
  • Floor loading: Similar weight (~1.36 metric tons)
  • Management: Same DCGM and BMC management interfaces

This compatibility means that organizations deploying GB200 NVL72 racks today can plan for GB300 NVL72 expansion with high confidence that the same infrastructure will support both generations. The cooling loops, power distribution, network fabric, and physical rack positions are interchangeable.

Software Considerations

The B300 GPU introduces new hardware capabilities (NVFP4 precision, Tensor Memory, enhanced attention instructions, PCIe Gen 6) that require updated software to exploit. The CUDA toolkit, TensorRT-LLM, and other NVIDIA software will need Blackwell Ultra-specific code paths to deliver the full 2x attention improvement and NVFP4 performance gains.

At launch, the B300 will support the same CUDA compute capability as the B200 for backward compatibility, meaning existing Blackwell-compiled applications will run without modification. However, the Blackwell Ultra-specific features will only be accessible through updated libraries and frameworks. Organizations planning GB300 NVL72 deployments should budget for software testing and optimization time, particularly for:

  • NVFP4 quantization workflows (model conversion, accuracy validation)
  • Attention kernel optimization using TMEM (may require custom kernel development for proprietary models)
  • TensorRT-LLM Blackwell Ultra profiles (expected to ship with TensorRT-LLM 0.14+)

Multi-Rack Deployment at Scale

Microsoft Azure's 4,608-GPU cluster (64 GB300 NVL72 racks) provides a reference architecture for large-scale GB300 deployments. This cluster delivers 92.1 exaflops of FP4 compute across 64 racks connected by a high-bandwidth InfiniBand or Ethernet fabric.

At this scale, the deployment considerations are primarily:

Network: 4,608 ConnectX-8 SuperNICs at 800 Gb/s each require approximately 128-192 leaf switches and a multi-tier spine network. The total network fabric cost for a 64-rack deployment is estimated at $5-15M depending on the topology and switch vendor.

Power: 64 racks at ~120kW each require approximately 7.7MW of continuous power delivery. At $0.10/kWh, the annual power cost is approximately $6.7M. This power requirement is comparable to a small industrial facility and typically requires dedicated utility connections and on-site transformer infrastructure.

Cooling: 7.7MW of thermal load requires industrial-scale liquid cooling infrastructure. Typical implementations use central chiller plants with redundant chillers, pumps, and cooling towers. The cooling infrastructure for a 64-rack deployment costs $10-20M and requires 6-12 months to design and install.

Physical space: 64 racks in a high-density configuration occupy approximately 3,000-5,000 square feet of raised floor space, plus additional space for networking, cooling distribution, and maintenance access. Most purpose-built AI datacenters allocate 8,000-12,000 square feet per 64-rack deployment including all supporting infrastructure.

The Roadmap Beyond GB300

NVIDIA's GPU architecture roadmap extends beyond Blackwell Ultra. The Vera Rubin architecture, expected in 2026, will introduce HBM4 memory and a new GPU architecture. Organizations making long-term infrastructure decisions should consider that the GB300 NVL72 is the peak of the Blackwell generation, and the next major architecture transition will bring new performance levels, power requirements, and potentially new form factors.

For a two-year deployment plan (2025-2027), the GB300 NVL72 is the right choice. For a three-year plan (2025-2028), organizations should plan for Vera Rubin capacity in years two and three while deploying GB300 in year one. The liquid cooling infrastructure and datacenter power/space investments are forward-compatible with future NVIDIA architectures, which are expected to maintain or increase power density.

Use Case Deep Dives

Reasoning AI at Scale

The GB300 NVL72's 2x attention performance improvement makes it the purpose-built platform for reasoning AI workloads. Modern reasoning models generate thousands of tokens of chain-of-thought before producing a final answer. Each generated token must attend to all previous tokens in the context, making attention the dominant compute operation for long reasoning chains.

Consider a reasoning model that generates 8,192 tokens of chain-of-thought for a complex query. On the GB200 NVL72, the attention computation for the last token must process 8,192 key-value pairs. On the GB300 NVL72, the same operation completes in half the time thanks to the 2x attention improvement, reducing the total generation time for the full reasoning chain by approximately 30-40% (since attention accounts for 60-70% of total compute at long sequence lengths).

For inference providers serving reasoning models, this translates to either 30-40% more requests per GPU per hour or 30-40% lower latency per request. At scale, this is the difference between a reasoning model that responds in 10 seconds and one that responds in 6-7 seconds - a meaningful improvement for interactive applications.

Long-Context Document Processing

The GB300 NVL72's 20.7TB of unified memory enables processing of extremely long documents. A 70B model serving requests with 1M-token context windows requires approximately:

  • Model weights (FP4): ~35 GB per GPU (1 GPU)
  • KV-cache per request: ~160 GB (for 1M tokens at FP16)
  • Total per request: ~195 GB

On the GB300 NVL72, each GPU has 288GB, so a single GPU can hold the model weights and one 1M-token request's KV-cache. With 72 GPUs, the system can theoretically handle up to 72 concurrent 1M-token requests - though practical limits from compute and bandwidth constraints reduce this to approximately 20-30 concurrent 1M-token requests with acceptable latency.

On the GB200 NVL72, the same workload is more constrained: 192GB per GPU means the model weights plus one 1M-token KV-cache (195GB total) does not quite fit on a single GPU, requiring at least 2 GPUs per request. This halves the maximum concurrent requests and introduces NVLink communication overhead for each request.

Frontier Model Training

For organizations training the next generation of frontier models, the GB300 NVL72's improvements compound at scale. A 64-rack GB300 NVL72 deployment (Microsoft Azure's reference configuration) provides:

  • 4,608 B300 GPUs
  • 92.1 EXAFLOPS of FP4 compute
  • 1.3 PB of unified HBM3e memory
  • ~7.7 MW power consumption

This is enough compute to train a multi-trillion-parameter model in weeks rather than months. The 50% more memory per GPU (versus GB200) means less aggressive activation checkpointing during training, which can recover 20-30% of compute that would otherwise be spent recomputing activations. Over a multi-month training run, this efficiency gain can save millions of dollars in compute costs.

Sources

NVIDIA GB300 NVL72 - Blackwell Ultra Rack
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.