TL;DR

288GB HBM3e per GPU using 12-Hi stacks - a 50% memory increase over the GB200 NVL72's 192GB per GPU, totaling 20.7TB per rack
1.5x more dense FP4 compute and 2x higher attention performance compared to first-generation Blackwell
20,480 CUDA cores and 640 fifth-generation Tensor Cores per B300 GPU with support for FP8, FP6, and NVFP4 precision
1,400W TDP per GPU with the same fully liquid-cooled rack design and 72-GPU NVLink domain as the GB200
Expected H2 2025 availability at an estimated $3-4 million per rack - Microsoft Azure confirmed as first large-scale deployer

Overview

The NVIDIA GB300 NVL72 is the Blackwell Ultra generation of NVIDIA's rack-scale AI supercomputer. Announced at GTC 2025, the GB300 upgrades every dimension of the GB200 NVL72: 50% more memory per GPU (288GB vs 192GB), 1.5x more dense FP4 compute, 2x higher attention performance, and more CUDA cores (20,480 vs 18,432). The rack-level architecture stays the same - 72 GPUs and 36 Grace CPUs in a unified NVLink domain - but the per-GPU improvements compound across all 72 GPUs to deliver a significant generational leap.

The B300 GPU at the heart of the GB300 NVL72 pushes the Blackwell architecture to its limits. Each B300 features 160 streaming multiprocessors with 20,480 CUDA cores and 640 fifth-generation Tensor Cores, packed into a dual-die package with 208 billion transistors. The key innovation is the memory: 288GB of HBM3e using 12-Hi stacks (compared to 8-Hi stacks in the B200), delivering the same 8 TB/s bandwidth in a 50% larger capacity envelope. At the rack level, this gives the GB300 NVL72 20.7TB of unified HBM3e memory - enough to hold the weights and KV-cache for the largest models in production today.

The performance improvements are specifically targeted at reasoning and agentic AI workloads. The 2x attention performance improvement - achieved through Tensor Core architecture enhancements and dedicated attention acceleration hardware - directly addresses the compute pattern that dominates transformer inference at long context lengths. As AI models move from simple question-answering to multi-step reasoning chains with extended context, attention compute becomes the bottleneck. The GB300 is NVIDIA's answer to that shift.

Microsoft Azure has confirmed the first large-scale deployment, with a 4,608-GPU cluster (64 racks) capable of 92.1 exaflops of FP4 compute. This cluster will power OpenAI's next-generation workloads, underscoring the GB300's position as the infrastructure of choice for frontier AI development.

Key Specifications

Specification	Details
Manufacturer	NVIDIA
System Type	Rack-scale AI supercomputer
Architecture	Blackwell Ultra (B300 GPU + Grace CPU)
Process Node	TSMC 4NP (GPU), TSMC 4N (Grace CPU)
GPUs per Rack	72 (B300)
CPUs per Rack	36 (Grace ARM)
Superchips per Rack	36 (1 Grace + 2 B300 each)
GPU Transistors	208 billion per GPU (dual-die)
CUDA Cores per GPU	20,480
Tensor Cores per GPU	640 (5th generation)
Streaming Multiprocessors per GPU	160
GPU Boost Clock	2.6 GHz
GPU Memory per GPU	288 GB HBM3e (12-Hi stacks)
Total GPU Memory	20,736 GB (20.7 TB)
Memory Bandwidth per GPU	8,000 GB/s (8 TB/s)
Memory Controller	16 x 512-bit
Total Memory Bandwidth	576 TB/s
FP4 Performance per GPU	15 PFLOPS (dense)
FP4 Performance per Rack	~1,080 PFLOPS (dense), ~2,160 PFLOPS (sparse)
FP8 Performance per GPU	~10 PFLOPS (estimated dense)
Attention Performance	2x vs GB200 NVL72
Tensor Memory (TMEM) per SM	256 KB
NVLink per GPU	5th generation, 1,800 GB/s
NVLink Domain	72 GPUs (full rack)
NVLink Bisection Bandwidth	130 TB/s
Network I/O per GPU	ConnectX-8 SuperNIC, 800 Gb/s
CPU Memory per CPU	480 GB LPDDR5X
Total CPU Memory	17,280 GB (17.3 TB)
GPU TDP	1,400W
Total Rack Power	~120 kW
Cooling	Liquid cooling (required)
PCIe	Gen 6
Release Date	H2 2025 (expected)

The B300's SM architecture deserves specific attention. With 160 SMs (versus 148 in the B200), each containing 128 CUDA cores and 4 fifth-generation Tensor Cores, the B300 provides 8% more SMs and 11% more CUDA/Tensor cores. But the per-core performance improvements - particularly in Tensor Core throughput and the addition of 256KB of Tensor Memory (TMEM) per SM - deliver a disproportionately larger performance gain. The 50% more FP4 compute (15 PFLOPS vs 10 PFLOPS per GPU) comes from the combination of more SMs, higher per-SM throughput, and the new NVFP4 precision format.

The 1,400W TDP per B300 GPU is 40% higher than the B200's 1,000W. At 72 GPUs, the raw GPU power draw alone would be 100.8kW - the vast majority of the rack's ~120kW total power budget. The remaining ~19kW covers 36 Grace CPUs, NVLink switches, ConnectX-8 SuperNICs, power conversion, and cooling pumps. Despite the higher per-GPU TDP, the total rack power remains approximately the same as the GB200 NVL72 due to improved power management and efficiency optimizations in the system design.

Performance Benchmarks

Metric	GB200 NVL72	GB300 NVL72	Improvement
GPU Memory per GPU	192 GB	288 GB	+50%
Total Rack Memory	13.5 TB	20.7 TB	+53%
FP4 Dense (per GPU)	10 PFLOPS	15 PFLOPS	+50%
FP4 Dense (per rack)	720 PFLOPS	~1,080 PFLOPS	+50%
Attention Performance	1x (baseline)	2x	+100%
CUDA Cores per GPU	18,432	20,480	+11%
Tensor Cores per GPU	576	640	+11%
SMs per GPU	148	160	+8%
GPU TDP	1,000W	1,400W	+40%
Memory Bandwidth per GPU	8 TB/s	8 TB/s	Same
NVLink Bandwidth per GPU	1,800 GB/s	1,800 GB/s	Same

Memory Capacity Analysis

The memory increase from 192GB to 288GB per GPU - and from 13.5TB to 20.7TB per rack - crosses several important deployment thresholds.

The 50% more memory per GPU means the GB300 NVL72 can handle larger models, longer context windows, or more concurrent inference requests. For a quantized trillion-parameter model at FP4 (approximately 500GB for weights), the GB200 NVL72 requires at least 3 GPUs for weight storage. The GB300 NVL72 needs only 2, freeing the third GPU's memory for KV-cache and activations. Across the full rack, this improved packing efficiency compounds.

For inference with long context windows, KV-cache memory consumption is the binding constraint. A 70B model with 1 million token context at FP16 KV-cache requires approximately 160GB of KV-cache per request. On a 192GB GPU (GB200), a single long-context request would consume most of the available memory after model weights. On a 288GB GPU (GB300), there is substantially more headroom for concurrent requests or even longer contexts.

Model Size	FP4 Weight Size	KV-Cache (128K ctx)	Fits per GPU (GB200)	Fits per GPU (GB300)
7B	~3.5 GB	~16 GB	Yes (many concurrent)	Yes (many concurrent)
70B	~35 GB	~160 GB	Yes (limited cache)	Yes (generous cache)
400B	~200 GB	~400 GB	Needs 2+ GPUs	Needs 1-2 GPUs
1T	~500 GB	~1 TB	Needs 3+ GPUs	Needs 2+ GPUs

Attention Performance Deep Dive

The 2x attention performance improvement is the GB300 NVL72's most architecturally significant advancement. Attention computation in transformers scales quadratically with sequence length (O(n^2) for standard attention, O(n) with flash attention on key-value length but still linear with query length). For reasoning workloads that generate thousands of tokens of chain-of-thought, the attention compute for each new token must attend to all previous tokens in the context.

The B300 addresses this through two hardware mechanisms:

First, each SM includes 256KB of Tensor Memory (TMEM) - a dedicated memory tier positioned between registers and shared memory in the SM's memory hierarchy. TMEM is optimized for the data access patterns of attention computation, providing low-latency, high-bandwidth access to attention intermediate values (QK^T products, softmax results, attention output accumulation). By keeping these intermediates in TMEM rather than shared memory, the attention pipeline avoids the latency of shared memory bank conflicts and the bandwidth limitations of the register-to-shared-memory path.

Second, the fifth-generation Tensor Cores include dedicated attention instructions that fuse the multi-step attention computation (Q*K^T, scaling, masking, softmax, V multiplication) into fewer hardware operations. This fusion reduces instruction dispatch overhead and improves Tensor Core utilization during attention computation, which historically underutilizes Tensor Cores due to the non-GEMM operations (softmax, masking) interspersed with matrix multiplications.

The combined effect is 2x attention throughput per GPU, which translates to 2x throughput for attention-dominated workloads. For a 70B model generating 4,096 tokens of chain-of-thought reasoning, attention computation accounts for approximately 60-70% of total inference time. A 2x improvement in attention directly translates to a 1.4-1.5x improvement in end-to-end token generation speed for reasoning workloads.

Key Capabilities

12-Hi HBM3e Stacks. The B300 GPU achieves its 288GB capacity by using 12-Hi HBM3e stacks - each stack containing 12 layers of DRAM dies bonded vertically, compared to the 8-Hi stacks used in the B200. This is a packaging innovation rather than a silicon change: the same HBM3e DRAM technology is used, but with 50% more layers per stack. The 16x512-bit memory controller design maintains the same 8 TB/s bandwidth at the higher capacity, meaning the B300 does not trade bandwidth for capacity.

The 12-Hi stacking is a significant manufacturing achievement. Each additional DRAM layer increases the mechanical stress on the Through-Silicon Vias (TSVs) that carry signals between layers, the thermal resistance of the stack (making cooling more challenging), and the overall yield risk. SK Hynix, Samsung, and Micron have each invested heavily in developing 12-Hi HBM3e, and the B300 is the first GPU to deploy it at scale. The fact that the bandwidth per stack is maintained at 12-Hi (implying the same or higher per-layer signaling rate) speaks to the maturity of the technology.

At the rack level, the 72 GPUs provide 20.7TB of unified memory at an aggregate 576 TB/s - the largest and fastest memory pool available in any commercially shipping AI system. Combined with the 72-GPU NVLink domain, this memory pool is accessible to any GPU at NVLink speeds, enabling models and datasets that span the full 20.7TB to be processed without InfiniBand-limited cross-rack communication.

NVFP4 Precision Format. While the B200 introduced FP4 support, the B300 adds NVIDIA's proprietary NVFP4 format - a 4-bit floating-point representation optimized for inference accuracy. NVFP4 uses a custom exponent/mantissa split and per-tensor scaling that delivers better accuracy preservation than standard FP4 at the same bitwidth.

The NVFP4 format is specifically designed for the weight distribution patterns observed in large language models. Standard FP4 (E2M1 - 2-bit exponent, 1-bit mantissa) provides only 4 representable values per sign, which limits its ability to capture the typical weight distributions that cluster around zero but include important outlier values. NVFP4 uses a different encoding that provides better coverage of these distributions, reducing the quantization error for the most common weight patterns.

Combined with the second-generation Transformer Engine's automatic precision management, NVFP4 on the B300 enables 4-bit inference on a broader range of models with acceptable accuracy degradation. At 15 PFLOPS dense NVFP4 per GPU, the B300 is built for the era where 4-bit quantized inference becomes the production default.

PCIe Gen 6 Interface. The B300 is the first NVIDIA datacenter GPU to support PCIe Gen 6, doubling the host-to-device bandwidth from ~128 GB/s (Gen 5 x16) to ~256 GB/s (Gen 6 x16). While this matters less in the GB300 NVL72 context - where the Grace CPU connects via NVLink-C2C at 900 GB/s - it is significant for standalone B300 GPUs deployed in x86 server platforms.

PCIe Gen 6 also enables faster NVMe storage access for data loading and checkpointing, which can be a bottleneck in large-scale training runs. At 256 GB/s per GPU, a B300 in a PCIe server can load training data from high-speed NVMe arrays at rates that keep the GPU pipeline fed without data staging delays.

NVLink 5 with Enhanced Collective Operations. The GB300 NVL72 uses the same NVLink 5 interconnect as the GB200 NVL72 (1,800 GB/s per GPU, 130 TB/s aggregate), but with software-level enhancements to collective operations. The NVLink Switch firmware in the GB300 includes optimized all-reduce, all-gather, and reduce-scatter implementations that take advantage of the B300's higher per-GPU compute throughput. These optimizations reduce the software overhead of collective communication operations, improving the ratio of useful compute to communication overhead in distributed training.

Pricing and Availability

The GB300 NVL72 is expected to ship in H2 2025. Pricing is estimated at $3-4 million per rack, representing a 30-50% premium over the GB200 NVL72's $2-3 million price point. The premium reflects the 50% more memory per GPU, higher compute density, and updated components.

Configuration	Estimated Price (2026)
GB300 NVL72 (single rack)	$3,000,000 - $4,000,000 (estimated)
Per-GPU equivalent cost	~$42,000 - $56,000
Liquid cooling infrastructure	$200,000 - $500,000 (new installation)
Cloud rental (per GPU hour, estimated)	$6.00 - $10.00

Microsoft Azure has confirmed the first large-scale deployment: a 4,608-GPU cluster consisting of 64 GB300 NVL72 racks. This cluster delivers 92.1 exaflops of FP4 compute and will be used for OpenAI workloads. Azure's early deployment suggests that cloud availability for GB300 instances could begin in late 2025 or early 2026, though on-premises availability may lag cloud deployments.

The GB300 NVL72 shares the same physical form factor, liquid cooling requirements, and datacenter infrastructure needs as the GB200 NVL72. Organizations that have already deployed GB200 NVL72 infrastructure should be able to deploy GB300 racks into the same facilities with minimal modifications - the cooling loops, power distribution, and physical rack footprint are compatible.

Buy-Now-or-Wait Analysis

For buyers currently evaluating GB200 NVL72 orders, the decision to wait for GB300 depends on the workload:

Scenario	Recommendation	Reasoning
Need compute now for training	Deploy GB200 NVL72	Shipping today, GB300 is 6+ months out
Primary need is large-model inference	Consider waiting for GB300	50% more memory per GPU significantly improves serving capacity
Reasoning/agentic workloads	Consider waiting for GB300	2x attention performance directly benefits chain-of-thought
Budget-constrained	Deploy GB200 NVL72	30-50% lower cost per rack
Building new datacenter (6+ month lead)	Target GB300 NVL72	Infrastructure will be ready when GB300 ships

The infrastructure compatibility between GB200 and GB300 is a significant advantage. Organizations that deploy GB200 NVL72 racks now can plan to add GB300 racks later using the same cooling, power, and networking infrastructure. The two generations can coexist in the same datacenter, with workloads allocated to the appropriate hardware generation.

Total Cost of Ownership

Cost Component	GB200 NVL72	GB300 NVL72	Delta
Rack hardware (one-time)	$2M - $3M	$3M - $4M	+$1M
Liquid cooling (one-time)	$200K - $500K	$200K - $500K	Same
Annual power (~120kW)	~$105K	~$105K	Same
Annual maintenance	~$200K - $400K	~$250K - $500K	+$50K-$100K
3-year TCO	$3.2M - $5.2M	$4.3M - $6.5M	+$1.1M - $1.3M

The GB300 NVL72's 30-50% cost premium buys 50% more memory and 50% more FP4 compute per rack. On a dollar-per-PFLOPS basis, the GB300 is approximately the same cost as the GB200 - the extra money buys proportionally more capability. The decision between the two comes down to whether the workload benefits from the additional memory and attention performance, or whether raw FP4 compute at the lowest cost is the priority.

Cloud Provider Availability

Microsoft Azure has confirmed the first large-scale GB300 NVL72 deployment, and other cloud providers are expected to follow. Cloud availability is expected to ramp in late 2025 and throughout 2026.

Cloud Provider	Status	Expected Availability
Microsoft Azure	Confirmed 4,608-GPU deployment	Late 2025 / Early 2026
Google Cloud	Expected	H1 2026 (estimated)
AWS	Expected	H1 2026 (estimated)
Oracle Cloud	Expected	2026 (estimated)

For most organizations, cloud access will be the practical path to GB300 NVL72 compute. The capital cost ($3-4M per rack), liquid cooling infrastructure requirements, and deployment complexity make on-premises deployment viable only for organizations operating at hyperscaler or near-hyperscaler scale.

Cloud pricing for GB300 NVL72 instances is expected to be 30-50% higher than GB200 NVL72 instances per GPU-hour, reflecting the 50% more memory and compute per GPU. Whether this premium is justified depends on the workload: for memory-bound inference with long contexts, the GB300's extra memory delivers proportionally more throughput, making the cost-per-token comparable to or lower than GB200. For compute-bound training workloads, the GB200 NVL72 at its lower price point may offer better cost efficiency.

Export Control Considerations

The GB300 NVL72 is subject to the same US export controls as other Blackwell products. Each rack contains 72 B300 GPUs with cumulative compute far exceeding any export control threshold. Sovereign AI deployments in allied nations can procure GB300 NVL72 through standard channels, but availability may be restricted for certain markets and end-users.

The GB300 NVL72's Strategic Significance

The GB300 NVL72 represents the peak of NVIDIA's Blackwell product generation. It combines the most capable Blackwell GPU (B300 with 288GB HBM3e and enhanced attention performance) with the rack-scale NVLink architecture that defined the GB200 NVL72. For organizations at the frontier of AI development, it is the most capable commercially available AI system.

The GB300's focus on reasoning performance - through the 2x attention improvement, TMEM, and larger memory for KV-caches - reflects NVIDIA's bet on where AI is heading. The shift from "generate plausible text" (the GPT-3 era) to "reason through complex problems" (the current era of o1, DeepSeek R1, and chain-of-thought models) changes the compute profile from pure GEMM throughput to attention-dominated workloads. The GB300 is the first GPU system designed specifically for this shift.

Microsoft's decision to deploy 4,608 B300 GPUs for OpenAI's workloads validates this bet. OpenAI's reasoning models (GPT-5.2 Pro mode, the expected GPT-6 reasoning capabilities) are exactly the workload type that benefits most from the GB300's attention performance improvement. The fact that the world's largest AI company chose GB300 over GB200 for its flagship deployment is the strongest possible endorsement of the product's positioning.

Strengths

288GB HBM3e per GPU provides the largest per-GPU memory of any shipping NVIDIA accelerator - critical for large KV-caches and model weights
2x attention performance improvement directly addresses the compute bottleneck for reasoning and agentic AI workloads
50% more dense FP4 compute per GPU (15 PFLOPS vs 10 PFLOPS) with improved NVFP4 precision for better accuracy
Same rack-level architecture as GB200 NVL72 - compatible infrastructure reduces migration cost for existing Blackwell deployers
20.7TB of unified HBM3e per rack is the largest GPU memory pool in any commercially available system
256KB of Tensor Memory per SM provides dedicated low-latency storage for attention intermediates
PCIe Gen 6 support future-proofs the platform for next-generation host and storage interfaces
Microsoft Azure's 4,608-GPU deployment validates the platform for frontier AI workloads at hyperscale
Infrastructure compatibility with GB200 NVL72 enables mixed-generation datacenter deployments

Weaknesses

Estimated $3-4 million per rack is 30-50% more expensive than the GB200 NVL72 for 50% more memory and compute
1,400W TDP per GPU is the highest of any NVIDIA datacenter GPU - total power density pushes cooling infrastructure limits
Not yet shipping as of early 2026 - organizations with immediate compute needs should deploy GB200 NVL72 now
Memory bandwidth per GPU remains at 8 TB/s - the same as the B200 despite the 50% memory capacity increase
NVLink bandwidth per GPU remains at 1,800 GB/s - no improvement in inter-GPU communication speed
Liquid cooling requirement and 120kW+ rack power draw restrict deployment to purpose-built datacenters
Limited availability data - pricing, exact ship dates, and allocation quantities are still based on estimates and early partner announcements
Inter-rack communication still requires InfiniBand - NVLink advantage is contained within a single rack
NVFP4 precision format is NVIDIA-proprietary and may not be supported in all inference frameworks at launch

Who Should Deploy the GB300 NVL72

The GB300 NVL72 targets the same class of buyer as the GB200 NVL72 - hyperscalers, frontier AI labs, and sovereign AI programs - but with specific advantages for reasoning and long-context workloads.

Good Fit

Reasoning and agentic AI workloads. The 2x attention performance improvement directly addresses the compute bottleneck for chain-of-thought reasoning, tree-of-thought exploration, and multi-step agentic planning. If your primary workload generates long sequences of reasoning tokens, the GB300 NVL72 provides a proportionally larger advantage over the GB200 than the raw FP4 numbers suggest.

Inference on the largest models with long contexts. The 288GB per GPU provides 50% more KV-cache headroom than the GB200's 192GB. For applications serving models at 1M+ token context windows, the KV-cache memory requirement is the binding constraint, and the GB300's extra memory directly translates to more concurrent long-context requests per GPU.

Organizations planning H2 2025 or later deployments. If your datacenter infrastructure is still under construction (a common situation for new AI factory builds), the GB300 NVL72's expected H2 2025 availability aligns well with typical 12-18 month datacenter construction timelines. There is no reason to deploy GB200 NVL72 racks into a facility that will not be ready until the GB300 is shipping.

Mixed-generation datacenter expansion. Organizations that already have GB200 NVL72 racks deployed can add GB300 racks into the same facility using the same cooling, power, and networking infrastructure. The two generations interoperate at the multi-rack level via InfiniBand, allowing workloads to be allocated to the appropriate hardware generation based on their compute and memory requirements.

Poor Fit

Organizations that need compute today. The GB300 NVL72 is not yet shipping as of early 2026. If you need large-scale compute immediately, deploy GB200 NVL72 now and plan for GB300 expansion later.

Workloads that are compute-bound rather than memory-bound. If your workload fully utilizes the GB200 NVL72's compute and is not constrained by memory capacity, the GB300's 50% more memory does not help. The 50% more FP4 compute is meaningful, but the GB200 NVL72 at its lower price point may deliver better cost efficiency for compute-bound workloads.

Organizations without long-context or reasoning requirements. The GB300's 2x attention improvement is specifically valuable for reasoning and long-context workloads. If your primary workload is short-context inference or training, this advantage is underutilized, and the GB200 NVL72 at 30-50% lower cost per rack is the better choice.

Deployment Considerations

Infrastructure Compatibility with GB200 NVL72

The GB300 NVL72 is designed for infrastructure compatibility with the GB200 NVL72. The key compatibility dimensions:

Rack form factor: Same physical footprint and rack design
Liquid cooling: Same coolant type, temperature range, and flow requirements
Power: Same ~120kW total rack power draw
Network: Same ConnectX-8 SuperNIC with 800 Gb/s per GPU
Floor loading: Similar weight (~1.36 metric tons)
Management: Same DCGM and BMC management interfaces

This compatibility means that organizations deploying GB200 NVL72 racks today can plan for GB300 NVL72 expansion with high confidence that the same infrastructure will support both generations. The cooling loops, power distribution, network fabric, and physical rack positions are interchangeable.

Software Considerations

The B300 GPU introduces new hardware capabilities (NVFP4 precision, Tensor Memory, enhanced attention instructions, PCIe Gen 6) that require updated software to exploit. The CUDA toolkit, TensorRT-LLM, and other NVIDIA software will need Blackwell Ultra-specific code paths to deliver the full 2x attention improvement and NVFP4 performance gains.

At launch, the B300 will support the same CUDA compute capability as the B200 for backward compatibility, meaning existing Blackwell-compiled applications will run without modification. However, the Blackwell Ultra-specific features will only be accessible through updated libraries and frameworks. Organizations planning GB300 NVL72 deployments should budget for software testing and optimization time, particularly for:

NVFP4 quantization workflows (model conversion, accuracy validation)
Attention kernel optimization using TMEM (may require custom kernel development for proprietary models)
TensorRT-LLM Blackwell Ultra profiles (expected to ship with TensorRT-LLM 0.14+)

Multi-Rack Deployment at Scale

Microsoft Azure's 4,608-GPU cluster (64 GB300 NVL72 racks) provides a reference architecture for large-scale GB300 deployments. This cluster delivers 92.1 exaflops of FP4 compute across 64 racks connected by a high-bandwidth InfiniBand or Ethernet fabric.

At this scale, the deployment considerations are primarily:

Network: 4,608 ConnectX-8 SuperNICs at 800 Gb/s each require approximately 128-192 leaf switches and a multi-tier spine network. The total network fabric cost for a 64-rack deployment is estimated at $5-15M depending on the topology and switch vendor.

Power: 64 racks at ~120kW each require approximately 7.7MW of continuous power delivery. At $0.10/kWh, the annual power cost is approximately $6.7M. This power requirement is comparable to a small industrial facility and typically requires dedicated utility connections and on-site transformer infrastructure.

Cooling: 7.7MW of thermal load requires industrial-scale liquid cooling infrastructure. Typical implementations use central chiller plants with redundant chillers, pumps, and cooling towers. The cooling infrastructure for a 64-rack deployment costs $10-20M and requires 6-12 months to design and install.

Physical space: 64 racks in a high-density configuration occupy approximately 3,000-5,000 square feet of raised floor space, plus additional space for networking, cooling distribution, and maintenance access. Most purpose-built AI datacenters allocate 8,000-12,000 square feet per 64-rack deployment including all supporting infrastructure.

The Roadmap Beyond GB300

NVIDIA's GPU architecture roadmap extends beyond Blackwell Ultra. The Vera Rubin architecture, expected in 2026, will introduce HBM4 memory and a new GPU architecture. Organizations making long-term infrastructure decisions should consider that the GB300 NVL72 is the peak of the Blackwell generation, and the next major architecture transition will bring new performance levels, power requirements, and potentially new form factors.

For a two-year deployment plan (2025-2027), the GB300 NVL72 is the right choice. For a three-year plan (2025-2028), organizations should plan for Vera Rubin capacity in years two and three while deploying GB300 in year one. The liquid cooling infrastructure and datacenter power/space investments are forward-compatible with future NVIDIA architectures, which are expected to maintain or increase power density.

Use Case Deep Dives

Reasoning AI at Scale

The GB300 NVL72's 2x attention performance improvement makes it the purpose-built platform for reasoning AI workloads. Modern reasoning models generate thousands of tokens of chain-of-thought before producing a final answer. Each generated token must attend to all previous tokens in the context, making attention the dominant compute operation for long reasoning chains.

Consider a reasoning model that generates 8,192 tokens of chain-of-thought for a complex query. On the GB200 NVL72, the attention computation for the last token must process 8,192 key-value pairs. On the GB300 NVL72, the same operation completes in half the time thanks to the 2x attention improvement, reducing the total generation time for the full reasoning chain by approximately 30-40% (since attention accounts for 60-70% of total compute at long sequence lengths).

For inference providers serving reasoning models, this translates to either 30-40% more requests per GPU per hour or 30-40% lower latency per request. At scale, this is the difference between a reasoning model that responds in 10 seconds and one that responds in 6-7 seconds - a meaningful improvement for interactive applications.

Long-Context Document Processing

The GB300 NVL72's 20.7TB of unified memory enables processing of extremely long documents. A 70B model serving requests with 1M-token context windows requires approximately:

Model weights (FP4): ~35 GB per GPU (1 GPU)
KV-cache per request: ~160 GB (for 1M tokens at FP16)
Total per request: ~195 GB

On the GB300 NVL72, each GPU has 288GB, so a single GPU can hold the model weights and one 1M-token request's KV-cache. With 72 GPUs, the system can theoretically handle up to 72 concurrent 1M-token requests - though practical limits from compute and bandwidth constraints reduce this to approximately 20-30 concurrent 1M-token requests with acceptable latency.

On the GB200 NVL72, the same workload is more constrained: 192GB per GPU means the model weights plus one 1M-token KV-cache (195GB total) does not quite fit on a single GPU, requiring at least 2 GPUs per request. This halves the maximum concurrent requests and introduces NVLink communication overhead for each request.

Frontier Model Training

For organizations training the next generation of frontier models, the GB300 NVL72's improvements compound at scale. A 64-rack GB300 NVL72 deployment (Microsoft Azure's reference configuration) provides:

4,608 B300 GPUs
92.1 EXAFLOPS of FP4 compute
1.3 PB of unified HBM3e memory
~7.7 MW power consumption

This is enough compute to train a multi-trillion-parameter model in weeks rather than months. The 50% more memory per GPU (versus GB200) means less aggressive activation checkpointing during training, which can recover 20-30% of compute that would otherwise be spent recomputing activations. Over a multi-month training run, this efficiency gain can save millions of dollars in compute costs.

NVIDIA GB200 NVL72 - Rack-Scale Blackwell - The first-generation Blackwell rack that the GB300 supersedes
NVIDIA B200 - Blackwell Flagship GPU - The individual Blackwell GPU architecture that the B300 builds on
NVIDIA H100 SXM - The AI Training Benchmark - The Hopper GPU that Blackwell systems replace for large-scale training
NVIDIA H200 - Inference-Optimized Hopper - Hopper alternative for smaller-scale deployments
NVIDIA A100 - The GPU That Built Modern AI - The Ampere GPU still widely deployed at scale

NVIDIA GB300 NVL72 - Blackwell Ultra Rack

Overview

Key Specifications

Performance Benchmarks

Memory Capacity Analysis

Attention Performance Deep Dive

Key Capabilities

Pricing and Availability

Buy-Now-or-Wait Analysis

Total Cost of Ownership

Cloud Provider Availability

Export Control Considerations

The GB300 NVL72's Strategic Significance

Strengths

Weaknesses

Who Should Deploy the GB300 NVL72

Good Fit

Poor Fit

Deployment Considerations

Infrastructure Compatibility with GB200 NVL72

Software Considerations

Multi-Rack Deployment at Scale

The Roadmap Beyond GB300

Use Case Deep Dives

Reasoning AI at Scale

Long-Context Document Processing

Frontier Model Training

Sources

Overview

Key Specifications

Performance Benchmarks

Memory Capacity Analysis

Attention Performance Deep Dive

Key Capabilities

Pricing and Availability

Buy-Now-or-Wait Analysis

Total Cost of Ownership

Cloud Provider Availability

Export Control Considerations

The GB300 NVL72's Strategic Significance

Strengths

Weaknesses

Who Should Deploy the GB300 NVL72

Good Fit

Poor Fit

Deployment Considerations

Infrastructure Compatibility with GB200 NVL72

Software Considerations

Multi-Rack Deployment at Scale

The Roadmap Beyond GB300

Use Case Deep Dives

Reasoning AI at Scale

Long-Context Document Processing

Frontier Model Training

Related Coverage

Sources

Google Analytics