Available October 2024 NVIDIA® H200 clusters

NVIDIA GB200 NVL72 for AI Training and Inference

5 min read
NVIDIA GB200 NVL72 for AI Training and Inference

As new AI startups and hyper-scalers push the boundaries of what’s possible in deep learning, the demand for high-performance AI computing has never been greater.  

NVIDIA’s GB200 NVL72 represents the next big leap forward in AI-focused datacenter technology. It is the next generation in immensely powerful, highly scalable, and more energy-efficient computing. For sure you’ve never seen a rack like this before. 

nvidia gb200 nvl72

GB200 NVL72 datacenter rack. Source: nvidia.com 

What is the NVIDIA GB200? 

The NVIDIA GB200 NVL72 is a customized datacenter rack containing 36 Grace CPUs and 72 Blackwell GPUs connected by a 130 TB/s NVLink Switch System.  

The GB200 NVL72 is designed to work as one coherent and unified GPU capable of handling the most complex AI and high-performance computing (HPC) workloads with exceptional efficiency and speed. 

GB200 NVL72 Architecture 

The GB200 NVL72 is powered by NVIDIA’s latest Blackwell GPU architecture, which offers a significant boost in computational power to the previous Ampere- and Hopper-architectures powering the A100,  H100 and H200. 

NVidia Blackwell GPU

Within the GB200 NVL72 you’ll find 36 GB200 superchips where one Grace CPU and two Blackwell GPUs are set on a single die board. According to NVIDIA, each Blackwell GPU contains 208 billion transistors, more than 2.5x the amount of transistors in NVIDIA Hopper GPUs. 

GB200 Networking and Memory Bandwidth  Connectivity 

We can expect to see four different types of networks in the GB200 NVL72 systems: 

Quantum-X800 InfiniBand is the foundation of the GB200’s AI compute fabric, capable of  scaling beyond 10,000 GPU, which is 5X higher than the previous NVIDIA Quantum-2 generation. While most use AI projects won’t scale to this level, the GB200 is likely to be the benchmark to beat in datacenter GPUs for some time to come. 

Blackwell GPUs include 18 fifth-generation NVLink links to provide 1.8 TB/sec total  bandwidth, 900 GB/sec in each direction.

gb200 nvlink switch system

GB200 NVlink system. Source: nvidia.com 

Thermal Power Management  

The GB200 NVL72 is a power-hungry machine. It is likely to require 120 kW per rack, which is approximately 3x more than an air-cooled rack of H100s.  

Still, the GB200 architecture incorporates several improvements in thermal power management, reducing overall energy consumption per TFLOPS. It comes with an advanced liquid cooling solution that allow it to maintain peak performance even under heavy loads. 

gb200 compute tray with liquid cooling

GB200 compute tray with liquid cooling. Source: nvidia.com

The per GPU power consumption of the GB200 is effectively 1200 Watts. On the whole, NVIDIA estimates the GB200 NVL72 to deliver 25X better energy efficiency at the same performance for trillion parameter AI models compared to an air-cooled H100 infrastructure. 

GB200 NVL72 Specs Comparison to H100 and H200 

The most natural comparison for the GB200 NVL72 is with the current highest performance NVIDIA GPUs in the market, the H100 and H200. 

 

H100 

H200 

GB200 NVL72 

Watts (Per GPU) 

700 

700 

1,200 

NVLink Bandwidth (GB/s) 

450 

450 

900 

Memory Capacity (GB) 

80GB 

141GB 

192GB 

Memory Bandwidth (GB/s) 

3,352 

4,800 

8,000 

TF32 TFLOPS 

495 

495 

1,250 

FP16/BF16 TFLOPS 

989 

989 

2,500 

FP8 / FP6 / Int8 TFLOPS 

1,979 

1,979 

5,000 

FP4 TFLOPS 

1,979 

1,979 

10,000 

Source: nvidia.com, SemiAnalysis 

*Get a more detailed summary of H100 specs and H200 specs

AI Performance Summary 

The GB200 NVL72 provides up to 30x higher throughput on AI-related tasks than the H100, especially for the dense matrix operations. Optimizations by NVIDIA help reduce latency and increase efficiency in multi-GPU configurations, leading to up to 4x speedup over the H100 in GPT-Moe-1.8T model training. 

nvidia gb200 vs h100

Comparison of GB200 NVL72 system with a H100. Source: nvidia.com 

On the whole, the GB200 NVL72 system can do 1.44 exaFLOPS of super-low-precision floating point mathematics, making it the first exascale GPU solution. 

In a real world example, OpenAI training GPT-4 in 90 days with 25k A100s. It should be possible to train GPT4- in less than 2 days with a 100k GB200 NVL72 set-up.

New Precision Capabilities with Second-Generation Transformer Engine 

The GB200 NVL72 uses NVIDIA's second-generation Transformer Engine to introduce advanced precision formats, including community-defined microscaling, formats such as MX-FP6, to improve both accuracy and throughput for large language models (LLMs) and Mixture-of-Experts (MoE) models. 

Micro-tensor scaling uses dynamic range management and fine-grain scaling techniques to optimize performance and accuracy, enabling the use of FP4 AI. This innovation effectively doubles the performance with Blackwell’s FP4 Tensor Core and also increases parameter bandwidth to HBM memory, allowing for significantly larger next-generation models per GPU. 

microscaling precision support

Conceptual framework for microscaling precision. Source: nvidia.com 

The integration of TensorRT-LLM, with quantization to 4-bit precision and custom kernels, enables real-time inference on massive models with reduced hardware, energy consumption, and cost. On the training side, the second-generation Transformer Engine, combined with the Nemo Framework and Megatron-Core PyTorch library, provide unparalleled model performance through Multi-GPU parallelism techniques and fifth-generation NVLink support. 

Bottom line on the GB200 NVL72 

The GB200 NVL72 is unlike any datacenter rack we’ve ever seen. Its advancements in GPU architecture, memory bandwidth, and energy efficiency make it a powerful tool for tackling the most demanding AI workloads. For AI engineers, the NVL72 offers not only superior performance but also the flexibility and scalability needed to stay competitive in a fast-paced industry.