NVIDIA H100 SXM - The AI Training Benchmark
Complete specs, benchmarks, and analysis of the NVIDIA H100 SXM - the Hopper-architecture GPU that defined the standard for AI training and inference performance.

TL;DR
- 80GB HBM3 with 3,350 GB/s bandwidth and 3,958 TFLOPS FP8 Tensor performance - the GPU that set the benchmark for AI training from 2023 onward
- First datacenter GPU with native FP8 support via the Transformer Engine, enabling dynamic mixed-precision training with minimal accuracy loss
- 16,896 CUDA cores and 528 fourth-generation Tensor Cores on 80 billion transistors at TSMC 4N
- Fourth-generation NVLink at 900 GB/s enables scaling to 256 GPUs via NVLink Switch System
- Priced at $25,000-$30,000 new - still the volume workhorse for large-scale training clusters worldwide
Overview
The NVIDIA H100 is the GPU that the AI scaling era was built on. Announced at GTC 2022 and shipping in volume from early 2023, the H100 SXM introduced two technologies that fundamentally changed how AI models are trained: FP8 precision and the Transformer Engine. Together, they delivered a roughly 6x improvement in transformer training throughput over the A100 while maintaining model accuracy - a leap that enabled the training runs behind GPT-4, Claude 3, Gemini 1.5, and every other major foundation model released between 2023 and 2025.
The Hopper architecture, built on TSMC's 4N process with 80 billion transistors, was NVIDIA's answer to the exponentially growing compute demands of transformer-based AI. The H100 SXM packs 16,896 CUDA cores and 528 fourth-generation Tensor Cores into an 814 mm² die, with 80GB of HBM3 memory delivering 3,350 GB/s of bandwidth. Fourth-generation NVLink doubled the per-link bandwidth to 900 GB/s, and the new NVLink Switch System extended the NVLink fabric across up to 256 GPUs - enabling the massive training clusters that hyperscalers deployed throughout 2023 and 2024.
By early 2026, the H100 is being superseded by the B200 and rack-scale Blackwell systems like the GB200 NVL72, which deliver 2-3x the compute per GPU. But the H100 remains the volume backbone of most production AI infrastructure. Tens of thousands of H100s are deployed across every major cloud provider, and the software ecosystem - CUDA libraries, inference engines, training frameworks - is more mature on Hopper than on any other architecture. For organizations that need proven, battle-tested GPU compute at scale, the H100 is still the rational default.
Key Specifications
| Specification | Details |
|---|---|
| Manufacturer | NVIDIA |
| Architecture | Hopper (GH100) |
| Process Node | TSMC 4N |
| Transistors | 80 billion |
| Die Size | 814 mm² |
| CUDA Cores | 16,896 |
| Tensor Cores | 528 (4th generation) |
| Streaming Multiprocessors | 132 |
| GPU Memory | 80 GB HBM3 |
| Memory Bandwidth | 3,350 GB/s (3.35 TB/s) |
| Memory Bus Width | 5,120-bit |
| L2 Cache | 50 MB |
| FP64 Performance | 33.5 TFLOPS (67 TFLOPS with sparsity) |
| FP32 Performance | 67 TFLOPS (133.8 TFLOPS with sparsity) |
| TF32 Performance | 495 TFLOPS (989 TFLOPS with sparsity) |
| FP16 / BF16 Performance | 989 TFLOPS (1,979 TFLOPS with sparsity) |
| FP8 Performance | 1,979 TFLOPS (3,958 TFLOPS with sparsity) |
| INT8 Performance | 1,979 TOPS (3,958 TOPS with sparsity) |
| Transformer Engine | 1st generation (FP8/FP16 dynamic) |
| NVLink | 4th generation, 900 GB/s |
| NVLink Switch System | Up to 256 GPUs |
| PCIe | Gen 5.0 x16 |
| Multi-Instance GPU | Up to 7 instances |
| TDP | 700W (SXM), 350W (PCIe) |
| Form Factor | SXM5, PCIe |
| Cooling | Passive (both variants) |
| Release Date | March 2023 (volume shipments) |
The GH100 die is fabricated on TSMC's 4N process, a custom variant of the N4 node optimized for NVIDIA's design requirements. With 80 billion transistors in 814 mm², it delivers a 48% increase in transistor count over the A100's 54.2 billion while shrinking from 826 mm² - a direct result of the process node shrink from 7N to 4N. The full GH100 die contains 144 SMs, but the H100 SXM product enables 132 SMs, providing yield headroom while still delivering a 22% increase in SM count over the A100's 108.
Each H100 SM contains 128 FP32 CUDA cores (double the A100's 64 per SM) and 4 fourth-generation Tensor Cores. The doubling of CUDA cores per SM is a significant architectural change from Ampere - it means each SM can process twice as many FP32 operations, which improves performance on non-Tensor-Core workloads like data preprocessing, activation functions, and general-purpose GPU computing.
The 50 MB L2 cache is a 25% increase over the A100's 40 MB, providing additional on-chip data reuse for inference workloads. The L2 partitioning has also been redesigned to improve effective bandwidth for the access patterns typical of transformer models.
Performance Benchmarks
| Metric | A100 80GB | H100 SXM | H200 | B200 |
|---|---|---|---|---|
| FP8 Tensor TFLOPS | N/A | 3,958 (sparse) | 3,958 (sparse) | 9,000 (dense) |
| FP16 Tensor TFLOPS | 624 (sparse) | 1,979 (sparse) | 1,979 (sparse) | 4,500 (sparse) |
| Memory Capacity | 80 GB HBM2e | 80 GB HBM3 | 141 GB HBM3e | 192 GB HBM3e |
| Memory Bandwidth | 2,039 GB/s | 3,350 GB/s | 4,800 GB/s | 8,000 GB/s |
| NVLink Bandwidth | 600 GB/s | 900 GB/s | 900 GB/s | 1,800 GB/s |
| TDP | 400W | 700W | 700W | 1,000W |
| Transistors | 54.2B | 80B | 80B | 208B |
| Process | TSMC 7N | TSMC 4N | TSMC 4N | TSMC 4NP |
The H100's generational leap over the A100 is most dramatic at FP8 precision, where the Transformer Engine delivers 3,958 TFLOPS with sparsity - a capability the A100 simply does not have. At FP16, the improvement is roughly 3.2x (1,979 vs. 624 TFLOPS, both with sparsity). Memory bandwidth improved 1.64x to 3,350 GB/s, and NVLink bandwidth jumped 1.5x to 900 GB/s.
Compared to its successors, the H100 shows the pace of NVIDIA's generational improvements. The H200 delivers identical compute but with 76% more memory (141GB vs 80GB) and 43% more bandwidth (4,800 vs 3,350 GB/s) - a meaningful upgrade for memory-bound inference workloads. The B200 roughly doubles everything: 2.3x the FP8 compute, 2.4x the memory, and 2.4x the bandwidth.
Real-World Training Performance
On large language model training, the H100 delivers approximately 2.5-4x the throughput of the A100 at FP8 precision, depending on model size and training configuration. For a 7B-parameter model using Megatron-LM with tensor parallelism across 8 H100s, typical training throughput ranges from 8,000-12,000 tokens per second per GPU - roughly 3x the A100's performance at BF16.
The training speedup narrows at larger model sizes due to increased communication overhead. For 70B+ parameter models that require pipeline parallelism across multiple nodes, the all-reduce operations over InfiniBand become a larger fraction of total training time. Within a single DGX H100 node (8 GPUs connected via NVSwitch at 900 GB/s each), scaling is near-linear. Across nodes, the InfiniBand interconnect (400 Gb/s NDR per port) becomes the bottleneck, and real-world scaling efficiency drops to 70-80% depending on the parallelism strategy.
This is precisely the problem that the NVLink Switch System addresses. By extending the NVLink domain from 8 GPUs (single DGX) to up to 256 GPUs (across multiple nodes), the NVLink Switch System keeps the high-bandwidth NVLink fabric available for all-reduce and all-to-all operations that would otherwise fall back to InfiniBand. For organizations training at the 32-256 GPU scale, the NVLink Switch System can improve training efficiency by 20-40% compared to InfiniBand-only multi-node setups.
Real-World Inference Performance
For LLM inference, the H100's FP8 Transformer Engine is its primary advantage over the A100. Using TensorRT-LLM with FP8 quantization, the H100 can serve a 7B-parameter model at approximately 350-500 tokens per second for batch-1 autoregressive generation - roughly 2.5-3.5x the A100's performance at FP16 or INT8.
At higher batch sizes, the H100's advantage grows. Batch-64 inference on a 7B model can exceed 10,000 tokens per second aggregate throughput, leveraging both the FP8 Tensor Cores and the higher memory bandwidth to keep the compute pipeline fed. For production inference serving, where high throughput at moderate latency is the primary metric, the H100 delivers substantially better cost-efficiency than the A100.
The 80GB memory limitation is more acutely felt for inference than for training. A 70B-parameter model in FP16 requires ~140GB of memory for weights alone, which does not fit on a single H100. FP8 quantization reduces this to ~70GB, which fits but leaves minimal headroom for KV-cache and activations. In practice, serving 70B models on H100 typically requires FP8 quantization (tight fit on 1 GPU) or multi-GPU sharding (2+ GPUs), while the H200 with 141GB of HBM3e provides a much more comfortable single-GPU option.
Key Capabilities
Transformer Engine. The H100's most important innovation is the first-generation Transformer Engine, which automatically manages precision switching between FP8 and FP16 on a per-layer, per-tensor basis during training. The engine monitors the dynamic range of each tensor and chooses the optimal precision format to maximize throughput without sacrificing model accuracy. This is not just a new number format - it is a hardware-software co-designed system that makes FP8 training practical. Before the Transformer Engine, mixed-precision training required careful manual tuning of scaling factors. Hopper automates this entirely. The result: FP8 training runs that match FP16 accuracy at nearly 2x the throughput.
The Transformer Engine implements FP8 through two formats: E4M3 (4-bit exponent, 3-bit mantissa) for forward pass computations and E5M2 (5-bit exponent, 2-bit mantissa) for backward pass gradients. The wider exponent range of E5M2 accommodates the larger dynamic range of gradient values, while E4M3's extra mantissa bit provides better precision for activations and weights. Per-tensor scaling factors are stored in FP32 and maintained automatically by the Transformer Engine, ensuring numerical stability without developer intervention.
Fourth-Generation NVLink and NVLink Switch System. The H100 SXM delivers 900 GB/s of GPU-to-GPU NVLink bandwidth - a 1.5x increase over the A100's 600 GB/s. But the bigger story is the NVLink Switch System, which extends the NVLink fabric beyond a single 8-GPU node to connect up to 256 GPUs across multiple nodes. This creates a unified memory space and low-latency interconnect for massive distributed training runs, eliminating the InfiniBand bottleneck for all-reduce operations within the NVLink domain. For trillion-parameter training clusters, this is the feature that makes H100 systems scale efficiently.
The NVLink Switch uses dedicated NVSwitch chips - third-generation in the DGX H100 - with each NVSwitch supporting 64 NVLink 4.0 ports. A DGX H100 node contains 4 NVSwitch chips, creating a fully connected all-to-all topology among its 8 GPUs. The external NVLink Switch System extends this fabric with additional NVSwitch chips in external switch boxes, maintaining the 900 GB/s per-GPU bandwidth across up to 32 DGX nodes (256 GPUs).
DPX Instructions and Confidential Computing. Beyond AI, the H100 introduced DPX instructions for dynamic programming algorithms - delivering up to 7x speedups for genomics, graph analytics, and route optimization compared to the A100. The H100 also introduced confidential computing support at the GPU level, enabling encrypted GPU memory and attestation for secure multi-tenant cloud deployments. While these features get less attention than the AI capabilities, they expanded the H100's addressable market into regulated industries that previously could not use shared GPU infrastructure.
Thread Block Clusters. The H100 introduced a new level of the CUDA programming hierarchy called Thread Block Clusters. Clusters allow groups of thread blocks running on different SMs to synchronize and share data through distributed shared memory. This is particularly useful for attention operations in transformers, where different parts of the attention matrix need to exchange data during computation. Thread Block Clusters enable this exchange without going through the global memory hierarchy, reducing latency and improving SM utilization for attention-bound workloads.
Architecture Deep Dive
The Hopper SM
The H100's SM architecture represents a major redesign from Ampere. Each SM contains 128 FP32 CUDA cores (versus 64 in the A100), organized into 4 processing blocks of 32 cores each. The fourth-generation Tensor Cores support all precisions from FP64 down to FP8, with the FP8 path providing the highest throughput at 3,958 TFLOPS across the full GPU.
The SM also includes 256 KB of combined shared memory and L1 data cache (up from 192 KB in the A100), which can be configured in various split ratios depending on the workload. For transformer inference, where shared memory is used to hold attention tiles, the larger allocation provides headroom for larger tile sizes and reduces the need for register spilling.
HBM3 Memory
The H100 was the first GPU to ship with HBM3 memory, which increased per-pin bandwidth from HBM2e's 3.2 Gbps to HBM3's 6.4 Gbps. With five 16-Hi HBM3 stacks, the H100 delivers 3,350 GB/s of aggregate bandwidth - a 64% improvement over the A100's 2,039 GB/s. This bandwidth improvement is critical for LLM inference, where token generation throughput is directly proportional to memory bandwidth for large models.
However, the H100's 80GB capacity remained the same as the A100 80GB, which created a tension between the compute and memory dimensions. The Tensor Cores can process data much faster than the A100, but the memory pool that feeds them did not grow. This mismatch is what motivated the H200 - an identical compute die paired with higher-capacity, higher-bandwidth HBM3e memory.
PCIe vs SXM Variants
The performance gap between the H100 PCIe and H100 SXM is more significant than many buyers realize. The SXM variant operates at 700W TDP with full NVLink connectivity (900 GB/s via NVSwitch), while the PCIe variant is limited to 350W TDP with either no NVLink or limited NVLink bridge connectivity (supporting only 2 GPUs). The compute performance difference is roughly proportional to the power difference - the PCIe variant delivers approximately 50-60% of the SXM variant's sustained throughput for training workloads.
For inference, the gap is smaller because inference workloads are more memory-bandwidth-bound than compute-bound, and both variants use the same HBM3 memory subsystem. But for any workload that requires multi-GPU scaling - which includes serving most models larger than 13B parameters - the SXM variant's NVLink connectivity is essential.
Pricing and Availability
The H100 SXM has been the volume GPU for hyperscale AI deployments since 2023. Pricing has remained relatively stable, with new units in the $25,000-$30,000 range. The DGX H100 system - eight H100 SXM GPUs in a single server node - lists at approximately $300,000-$400,000 depending on configuration and volume.
| Configuration | Estimated Price (2026) |
|---|---|
| H100 SXM (new) | $25,000 - $30,000 |
| H100 PCIe (new) | $25,000 - $30,000 |
| DGX H100 (8x H100 SXM) | $300,000 - $400,000 |
| HGX H100 (8-GPU board) | $200,000 - $300,000 |
| Cloud rental (per GPU hour) | $2.50 - $4.50 |
Cloud availability is excellent. AWS (p5 instances), GCP (a3-highgpu), Azure (ND H100 v5), OCI, Lambda, CoreWeave, and numerous smaller providers all offer H100 instances. Reserved instance pricing can bring costs down significantly from on-demand rates, and spot availability has improved as Blackwell systems come online and some cloud capacity shifts to newer GPUs.
The secondary market for H100s is developing but less mature than the A100 market. As organizations upgrade to Blackwell, expect increasing secondary market supply throughout 2026, which should push used H100 prices below the $25,000 floor.
Total Cost of Ownership
The H100's 700W TDP is a significant operational cost factor. At $0.10/kWh with a 1.3 PUE, a single H100 costs approximately $798 per year in electricity - 75% more than an A100's $456. Over a three-year deployment, power costs add approximately $2,394 per GPU. For an 8-GPU DGX H100, that is $19,152 in electricity alone over three years.
However, the H100's higher performance typically more than compensates for the higher power cost. If an H100 delivers 3x the training throughput of an A100, it needs one-third the GPU-hours to complete the same training run. The time-to-completion advantage translates directly to lower total compute cost despite higher per-GPU power consumption. For organizations where compute throughput is the primary constraint, the H100 is more cost-efficient than the A100 on a per-FLOP basis even after accounting for higher power costs.
Strengths
- FP8 Transformer Engine delivers 3,958 TFLOPS with sparsity - a 6x leap over A100 for transformer workloads
- The proven workhorse for large-scale training - every major foundation model from 2023-2025 was trained on H100 clusters
- NVLink Switch System enables scaling to 256 GPUs with low-latency, high-bandwidth interconnect
- Mature software ecosystem with deep optimization across CUDA, cuDNN, TensorRT, Megatron, DeepSpeed, and all major frameworks
- Multi-Instance GPU (MIG) supports up to 7 isolated instances for multi-tenant inference
- PCIe Gen 5 and 80GB HBM3 provide a balanced configuration for both training and inference
- Confidential computing support enables deployment in regulated and multi-tenant environments
- Thread Block Clusters improve SM utilization for attention-heavy transformer workloads
- 50 MB L2 cache provides 25% more on-chip data reuse than the A100
Weaknesses
- 80GB HBM3 memory is the most significant limitation - insufficient for serving unquantized 70B+ models on a single GPU
- 3,350 GB/s memory bandwidth is 30% lower than the H200's 4,800 GB/s, limiting inference throughput for memory-bound workloads
- 700W TDP nearly doubles the A100's power consumption, with significant implications for datacenter power and cooling budgets
- At $25,000-$30,000, the price premium over used A100s ($10,000-$15,000) may not be justified for workloads that do not benefit from FP8
- Being superseded by Blackwell - new large-scale deployments are increasingly choosing B200 or GB200 NVL72
- NVLink Switch System requires dedicated switch hardware, adding significant cost to multi-node NVLink deployments
- PCIe variant at 350W has significantly lower performance than the SXM variant, but pricing is similar
- No FP4 support - the precision format that gives Blackwell its 5x inference advantage over Hopper
Who Should Buy the H100 in 2026
The H100 sits in an interesting position in 2026 - it is simultaneously the proven default and the outgoing generation. For many buyers, it remains the right choice. For others, Blackwell is worth the wait.
Good Fit
Medium-scale training (7B-70B models). For organizations training models in the 7B-70B parameter range, the H100 delivers proven performance with a mature software stack. The FP8 Transformer Engine cuts training time by roughly 2x compared to FP16 on the A100, and the NVLink Switch System enables efficient scaling to 256 GPUs. The training frameworks (Megatron-LM, DeepSpeed, PyTorch FSDP) are deeply optimized for Hopper.
Production inference at scale. If you need to deploy a large inference fleet today with proven reliability, the H100 is the safest choice. TensorRT-LLM, vLLM, and other inference engines have been optimized for Hopper over two years, and the FP8 inference path is battle-tested in production at hyperscale. The software maturity advantage over Blackwell is real and meaningful for production deployments where reliability is paramount.
Organizations with existing Hopper infrastructure. If you already have DGX H100 or HGX H100 systems deployed, adding more H100 capacity allows you to expand without changing your software stack, operational procedures, or datacenter infrastructure. The incremental cost of H100 is lower than the migration cost to Blackwell for many organizations.
Poor Fit
Memory-bound inference on 70B+ models. The H100's 80GB memory limit is its biggest practical weakness. If your workload involves serving 70B-class models, the H200 with 141GB is strictly better - same compute, more memory, same power. There is no use case where you should choose an H100 over an H200 for 70B inference, if H200 supply is available.
Greenfield deployments at massive scale. If you are building a new training cluster from scratch with 500+ GPUs, the B200 or GB200 NVL72 will deliver 2-3x more compute for a modest increase in per-GPU cost. The infrastructure investment in a new datacenter is dominated by construction, power, and cooling costs, not GPU costs - so buying the most capable GPU per slot is the economically optimal choice.
Budget-constrained workloads that fit on A100. If your workload runs adequately on an A100 at INT8 and you are primarily cost-sensitive, the H100's premium over a used A100 may not be justified. The A100's lower acquisition and operating costs make it more economical for workloads where FP8 provides marginal benefit.
Model Compatibility Guide
| Model | FP16 Size | FP8 Size | Fits on H100 (FP16)? | Fits on H100 (FP8)? | KV-Cache Headroom (FP8) |
|---|---|---|---|---|---|
| Llama 3 8B | ~16 GB | ~8 GB | Yes | Yes | ~72 GB |
| Mistral 7B | ~14 GB | ~7 GB | Yes | Yes | ~73 GB |
| Llama 3 70B | ~140 GB | ~70 GB | No | Tight (~10 GB left) | ~10 GB |
| Qwen 2.5 72B | ~144 GB | ~72 GB | No | Tight (~8 GB left) | ~8 GB |
| Mixtral 8x7B | ~93 GB | ~47 GB | No | Yes | ~33 GB |
| Llama 3 13B | ~26 GB | ~13 GB | Yes | Yes | ~67 GB |
| Code Llama 34B | ~68 GB | ~34 GB | Yes (tight) | Yes | ~46 GB |
The H100's 80GB is sufficient for most models up to 30B parameters at FP16 and most models up to 70B at FP8. The critical constraint is KV-cache headroom - a 70B model at FP8 leaves only ~10GB for KV-cache, which limits concurrent requests to single-digit counts with standard context windows. For production serving of 70B models, the H200's 141GB provides a much more comfortable deployment option.
Software Ecosystem
The H100's software ecosystem is the most mature of any current-generation GPU, benefiting from two years of production deployment at hyperscale.
Training Framework Optimization. PyTorch 2.x, JAX, and Megatron-LM include dedicated Hopper optimization paths that exploit the Transformer Engine, Thread Block Clusters, and TMA (Tensor Memory Accelerator) hardware. The FP8 training recipes have been validated on models up to 175B parameters, with peer-reviewed results confirming no accuracy degradation compared to FP16 training. This validation history gives the H100 a trust advantage for training workloads where accuracy is non-negotiable.
Inference Engine Maturity. TensorRT-LLM's H100 code path is the single most optimized inference implementation in the industry. It exploits FP8 Tensor Core execution, in-flight batching, paged attention with Hopper-specific memory management, and speculative decoding - all features that have been tested under production load at companies processing billions of tokens per day. The vLLM project also has deep H100 optimization, including PagedAttention v2 and chunked prefill support.
Distributed Training Libraries. NCCL (NVIDIA Collective Communications Library) includes NVLink Switch System support that is exclusive to Hopper, enabling the 256-GPU NVLink domain. Megatron-LM's tensor parallelism, pipeline parallelism, and context parallelism modes are all optimized for the H100's NVLink topology. DeepSpeed's ZeRO-3 and ZeRO-Infinity are tested and validated on H100 clusters.
Generational Comparison
H100 vs A100: The Architecture Gap
The H100 represents the largest generational improvement in NVIDIA's datacenter GPU line when measured against the A100. The raw numbers - 6x FP8 throughput, 1.64x bandwidth, 1.5x NVLink bandwidth - translate to approximately 2.5-4x real-world improvement for transformer workloads. The Transformer Engine's ability to automatically manage FP8 precision eliminated the manual precision tuning that limited FP8 adoption, making the full compute improvement accessible to any training or inference workload using standard frameworks.
The H100 also brought qualitative architectural improvements that do not show up in FLOPS comparisons: Thread Block Clusters for better SM coordination, DPX instructions for dynamic programming, confidential computing for secure multi-tenant deployment, and the NVLink Switch System for multi-node scaling. These features expanded the H100's addressable market beyond pure AI training into genomics, financial modeling, and regulated industries.
H100 vs H200: Same Silicon, More Memory
The H200 is the easiest comparison because the compute is identical. The H200 provides 76% more memory (141GB vs 80GB) and 43% more bandwidth (4,800 vs 3,350 GB/s) for a 0-25% price premium. For any memory-bound workload - which includes most LLM inference at production batch sizes - the H200 is strictly superior.
The H100's only advantage over the H200 is price and availability. If H200 supply is limited and H100s are available, the H100 remains a capable GPU. But given the choice at similar prices, there is no technical reason to prefer the H100 over the H200 for new deployments.
H100 vs B200: The Generational Leap
The B200 delivers 2.3x the FP8 compute, 2.4x the memory, 2.4x the bandwidth, and 2x the NVLink bandwidth compared to the H100. It also introduces FP4 support, which provides up to 5x the H100's FP8 inference throughput. At roughly 1.2-1.5x the price, the B200 is significantly more cost-efficient per TFLOPS.
The tradeoffs are operational: the B200 draws 1,000W (vs 700W), often requires liquid cooling, and the software ecosystem is less mature. For production-critical workloads where reliability and software stability are paramount, the H100's maturity advantage is meaningful. For performance-optimized workloads where the highest throughput matters most, the B200 is the clear choice.
Deployment Considerations
Datacenter Requirements
The H100 SXM requires an SXM5-compatible baseboard (DGX H100 or HGX H100 from OEM partners) with a power delivery system capable of sustained 700W per GPU. For 8-GPU configurations, the server power supply must handle a minimum of 10.2kW sustained GPU power plus CPU, memory, storage, and networking power.
Cooling requirements for an 8-GPU H100 system in a 4U form factor are significant but manageable with conventional air cooling in well-designed datacenters. Typical inlet temperatures of 20-25C are recommended, with rear-door heat exchangers or in-row cooling units for high-density deployments. The H100 does not require liquid cooling (unlike the B200 and the rack-scale Blackwell systems), which is a practical advantage for organizations deploying into existing air-cooled facilities.
Network Topology
For multi-node H100 training clusters, the network design is critical to performance. NVIDIA recommends a rail-optimized network topology with 400 Gb/s NDR InfiniBand per GPU for inter-node communication. An 8-GPU DGX H100 requires 8 InfiniBand ports, typically connected to a leaf-spine fat-tree network. For clusters up to 256 GPUs with NVLink Switch System, the NVLink switches handle intra-domain communication while InfiniBand handles inter-domain traffic.
The NVLink Switch System hardware includes external NVSwitch trays that occupy additional rack space - typically 1-2U per 32-GPU NVLink domain. This is additional infrastructure cost and complexity that is not present in standalone DGX deployments, but the 20-40% training efficiency improvement for large-scale workloads typically justifies the investment.
Use Case Deep Dives
Large-Scale Model Training
The H100's primary use case is large-scale model training, and it has proven itself as the backbone of frontier model development. Typical training cluster configurations and their capabilities:
| Cluster Size | Configuration | FP8 Compute | Approximate Training Capability |
|---|---|---|---|
| 8 GPUs | 1x DGX H100 | 31.7 PFLOPS | 7B-13B models (days) |
| 64 GPUs | 8x DGX H100 | 253 PFLOPS | 70B models (weeks) |
| 256 GPUs | 32x DGX H100 + NVLink Switch | 1,013 PFLOPS | 175B models (weeks) |
| 1,024 GPUs | 128x DGX H100 | 4,053 PFLOPS | 500B+ models (weeks-months) |
| 4,096 GPUs | 512x DGX H100 | 16,213 PFLOPS | Trillion-parameter models (months) |
The H100's NVLink Switch System provides its biggest advantage at the 32-256 GPU scale, where it keeps all inter-GPU communication on the high-bandwidth NVLink fabric. Beyond 256 GPUs, training clusters must use InfiniBand for inter-domain communication, which introduces the same bandwidth cliff that exists in A100 clusters at the 8-GPU boundary. The GB200 NVL72 addresses this by extending the NVLink domain to 72 GPUs per rack.
Production Inference Serving
For production inference, the H100 delivers strong performance with the most mature software stack. A single H100 can serve a quantized 70B model (FP8, ~70GB) with approximately 50-80 tokens per second at batch-1, scaling to 1,000+ tokens per second at batch-32. Using TensorRT-LLM with in-flight batching, a single H100 can handle 20-50 concurrent interactive sessions at standard latency targets (<500ms time-to-first-token).
An 8-GPU DGX H100 serving a 70B model across all 8 GPUs (tensor parallel = 8) can achieve over 5,000 tokens per second aggregate throughput at batch-64 - enough to serve hundreds of concurrent users. For comparison, an equivalent A100 setup delivers approximately 1,500-2,000 tokens per second for the same model at INT8, making the H100 roughly 2.5-3x more efficient for this specific workload.
Inference for Mixture-of-Experts Models
Mixture-of-Experts (MoE) models like Mixtral, DeepSeek V3.2, and DBRX present unique demands for GPU hardware. MoE models have large total parameter counts but activate only a fraction of parameters per token (e.g., Mixtral 8x7B has 47B total parameters but activates only 12.9B per token). The H100's 80GB memory can hold the full Mixtral 8x7B model at FP8 (~47GB), while its FP8 Tensor Cores process only the active 12.9B parameters per token.
This creates an interesting performance dynamic: the H100 is memory-capacity-bound (needs to store all experts) but compute-light (processes only active experts). The 80GB limit restricts which MoE models can be served on a single H100 - larger MoE models like DeepSeek V3.2 (671B total) require multi-GPU sharding. The H200 with 141GB and the B200 with 192GB provide more room for large MoE models.
Cloud Provider Availability
The H100 is available from all major cloud providers, with most offering both on-demand and reserved instance pricing.
| Cloud Provider | Instance Type | GPUs per Instance | On-Demand Price (approx.) |
|---|---|---|---|
| AWS | p5.48xlarge | 8x H100 80GB | ~$98/hr |
| Google Cloud | a3-highgpu-8g | 8x H100 80GB | ~$100/hr |
| Microsoft Azure | ND H100 v5 | 8x H100 80GB | ~$98/hr |
| Oracle Cloud | BM.GPU.H100.8 | 8x H100 80GB | ~$65/hr |
| Lambda Cloud | gpu_8x_h100_sxm | 8x H100 80GB | ~$28/hr |
| CoreWeave | H100 80GB | 1-8x H100 80GB | ~$3.85/hr/GPU |
The per-GPU cloud pricing for H100 ($2.50-$4.50/hour per GPU) is roughly 1.5-2x the A100 pricing, reflecting the higher performance. For organizations with predictable workloads, 1-3 year reserved instances can reduce H100 cloud costs by 30-50%.
The H100's cloud availability has improved significantly through 2025-2026 as Blackwell systems come online and some hyperscaler capacity shifts to newer GPUs. Spot H100 instances are increasingly available, particularly during off-peak hours, providing a cost-effective option for batch training and non-latency-sensitive inference workloads.
Export Control Considerations
Like the A100, the H100 is subject to US export controls restricting sales to China, Russia, and other specified countries. NVIDIA offers the H800 as an export-compliant variant with reduced NVLink bandwidth (400 GB/s versus 900 GB/s) and reduced interconnect capabilities. The H800's compute performance is slightly below the H100, designed to fall below the export control performance thresholds.
For cloud providers serving global customers, export control compliance requires careful management of which GPU SKUs are deployed in which regions and which customer workloads can access H100 versus H800 instances. The evolving regulatory landscape means compliance requirements may change, and organizations should monitor US Bureau of Industry and Security (BIS) guidance for updates.
The H100's Legacy
The H100 will likely be remembered as the single most important GPU in the history of AI. Not because it was the first AI GPU (that distinction belongs to earlier Tesla/V100 products), and not because it was the most powerful (the B200 and beyond will surpass it), but because it was the GPU that was in the right place at the right time.
The H100 arrived just as the transformer revolution was hitting its stride. GPT-4, Claude 3, Gemini 1.5, Llama 2 and 3, Mistral, and every other major foundation model released between 2023 and 2025 was trained on H100 clusters. The FP8 Transformer Engine made it economically feasible to train models that would have been prohibitively expensive on A100s. The NVLink Switch System enabled the 10,000+ GPU clusters that frontier labs now treat as standard infrastructure. And the 80GB of HBM3 - while now considered modest - was sufficient for the model architectures of its era.
As of 2026, the H100 is transitioning from "the GPU to buy" to "the GPU that is already deployed." With tens of thousands of H100s in production across hyperscale clouds and enterprise datacenters, it will remain the dominant AI compute platform for at least another 2-3 years as organizations amortize their investments and Blackwell production catches up with demand. The software ecosystem built on Hopper will continue to receive optimization investment, and the installed base ensures that H100-optimized code paths will be maintained indefinitely.
For buyers making decisions today, the H100's status as a "known quantity" with proven reliability, mature software, and predictable performance is itself a valuable attribute. In an industry where every new GPU generation promises revolutionary improvements, the H100 delivers exactly what its specifications promise - no more, no less. That predictability has value.
Related Coverage
- NVIDIA A100 - The GPU That Built Modern AI - The Ampere predecessor that the H100 replaced
- NVIDIA H200 - Inference-Optimized Hopper - Same compute, more memory and bandwidth
- NVIDIA B200 - Blackwell Flagship GPU - The next-generation flagship with 2x+ the compute
- NVIDIA GB200 NVL72 - Rack-Scale Blackwell - 72-GPU rack system for trillion-parameter workloads
- NVIDIA GB300 NVL72 - Blackwell Ultra - Next-generation rack system with 288GB HBM3e per GPU
Sources
- NVIDIA H100 Tensor Core GPU - Product Page
- NVIDIA H100 Datasheet (PDF)
- NVIDIA Hopper Architecture In-Depth - NVIDIA Technical Blog
- NVIDIA H100 Tensor Core GPU Architecture Whitepaper (PDF)
- NVIDIA H100 Deep Dive: Specs, Pricing, Best Uses - Fluence
- NVIDIA H100: Price, Specs, Benchmarks and Decision Guide - Clarifai
