PCIe and SXM5 Comparison for NVIDIA H100 Tensor Core GPUs

If you plan to use NVIDIA H100 GPUs for AI training or inference, you have a key decision to make with the socket form factor. Do you choose PCIe or do you need the extra performance of SXM?

Let’s review how the PCIe and SXM5 form factors of the NVIDIA H100 compare in specifications and performance.

Why H100 comes in PCIe and SXM5 form factors

Demand for high-performance GPUs is at an all-time high. It feels like everyone from enterprise companies to hyper-scalers to new AI startups are going after the same GPUs to feed their computational needs. As a response, NVIDIA has brought to market both PCIe and SXM form factors of the H100.

By giving more flexibility to different use-cases, NVIDIA has made the H100 the top performing GPU across many use-cases including high-performance computing (HPC) and AI training and inference.

What is PCIe?

PCIe (Peripheral Component Interconnect Express) is a high-speed interface standard used to connect various hardware components to a computer's motherboard. PCI was developed originally by Intel and introduced in 1992. PCIe (where "e" stands for express) was launched in 2003.

PCIe operates using lanes, which are pairs of wires that transfer data between the motherboard and the peripheral device. Each PCIe slot can have multiple lanes (x1, x4, x8, x16, etc.), with x16 being the most common for GPUs, providing the highest bandwidth.

What is SXM?

SXM is a custom socket form factor designed by NVIDIA specifically for high-performance computing and intense machine learning workloads. Originally introduced with the P100 GPUs in 2016, it offers higher density and performance by integrating GPUs more closely with the system board.

SXM stands for Server PCI Express Module. It is designed to work with NVIDIA’s NVLink interconnects for direct GPU-to-GPU communication with higher bandwidth, up to 900GB/s per connection. Up to 8 GPUs can be connected to a single SXM board.

H100 PCIe vs SXM5 Specifications Comparison

Feature	H100 PCIe	H100 SXM5
Form Factor	PCIe Gen 5	SXM5
Memory Bandwidth	2 TB/s	3.35 TB/s
L2 Cache	50 MB	50 MB
Transistors	80 billion	80 billion
GPU Memory	80 GB	80 GB
Memory Interface	5120-bit HBM2e	5120-bit HBM3
Interconnect	NVLink: 600 GB/s PCIe Gen5 128 GB/s	- NVLink: 900 GB/s - PCIe Gen5 128 GB/s
Max Thermal Design Power (TDP)	300-350W	Up to 700W

H100 PCIe Form Factor

The PCIe version of the H100 is designed to slot into standard server racks, making it a versatile option for many datacenters. It's a full-length, full-height, double-width card that you can easily integrate into existing infrastructure.

One of the standout features of the H100 PCIe is its support for PCIe Gen 5. This latest standard offers significantly improved bandwidth over previous generations of GPUs like the A100 or V100, allowing up to 120 GB/s transfer speeds between the GPU and the rest of your system.

Many of the H100 PCIe form factor elements are identical to SXM, including 80 billion transistors, 50 MB L2 Cache and 80GB GPU memory. The PCIe version of H100 comes with HBM2e memory interface and has a lower 2 TB/s memory bandwidth and 300-350 Thermal Design Power.

H100 SXM5 Form Factor

The SXM5 module is designed to be directly integrated into specialized server boards. With a 5120-bit HBM3 memory interface H100 SXM5 is capable of up to 3.35 TB/s memory bandwidth.

It supports NVIDIA's fourth-generation NVLink technology allowing for 900GB/s GPU-to-GPU interconnect bandwidth. When combined with the third-generation NVSwitch, you can create incredibly powerful multi-GPU systems that leverage a higher 700W maximum TDP.

H100 PCIe vs SXM5 Performance Comparison

	NVIDIA H100 PCIe	NVIDIA H100 SXM5
Peak FP64	24 TFLOPS	30 TFLOPS
Peak FP64 Tensor Core	48 TFLOPS	60 TFLOPS
Peak FP32	48 TFLOPS	60 TFLOPS
Peak FP16	96 TFLOPS	120 TFLOPS
Peak BF16	96 TFLOPS	120 TFLOPS
Peak TF32 Tensor Core	400 TFLOPS \| 800 TFLOPS*	500 TFLOPS \| 1000 TFLOPS*
Peak FP16 Tensor Core	800 TFLOPS \| 1600 TFLOPS*	1000 TFLOPS \| 2000 TFLOPS*
Peak BF16 Tensor Core	800 TFLOPS \| 1600 TFLOPS*	1000 TFLOPS \| 2000 TFLOPS*
Peak FP8 Tensor Core	1600 TFLOPS \| 3200 TFLOPS*	2000 TFLOPS \| 4000 TFLOPS*
Peak INT8 Tensor Core	1600 TOPS \| 3200 TOPS*	2000 TOPS \| 4000 TOPS*

*with sparsity.

The peak performance specs of the SXM5 are not substantially better than the PCIe. The major difference comes when you leverage the NVLink technology and 900 GB/s interconnect to allow up to 8 H100 SXM GPUs to function effectively as one massive GPU.

For large language model training where more than one GPU is likely to be used the SXM5 version performs considerably better than the PCIe. In memory-intensive workloads like AI model training the 3.35 TB/s memory bandwidth of the SXM5 also results in much faster training speeds.

H100 PCIe vs. SXM5 in the MLPerf benchmark

To illustrate the performance gap between the SXM5 and PCIe form factors, we evaluated their inference performance on the MLPerf machine learning benchmark, focusing on two popular tasks:

LLM inference using the Llama 2 70B LoRA model, and
Image generation using Stable Diffusion XL.

For both tasks, we used configurations with 8 GPUs, each equipped with 80 GB of memory.

The results clearly demonstrate the advantages of the SXM5 form factor. SXM5 delivers a striking 2.6x speedup on LLM inference compared to PCIe. On image generation, SXM5 still outperforms PCIe by 1.6x, though the performance gap is less pronounced. These findings underscore the SXM5’s significant edge over PCIe, particularly with large modern deep learning models.

H100 SXM5 Use Cases

The SXM5 version of the H100 is ideally suited for high-performance computing (HPC) and large-scale AI training workloads that demand the absolute peak of GPU performance. It thrives in multi-GPU computational scaling and high-bandwidth inter-GPU communications.

For HPC applications the SXM5's superior performance in tasks like lattice QCD and 3D FFT makes it the top choice. The ability to leverage NVLink for fast GPU-to-GPU communication can significantly accelerate complex scientific computations that require frequent data exchange between GPUs.

In AI training and inference related use-cases, the SXM5 is much more powerful when training massive AI models, particularly in natural language processing. For instance, when working with very large language models, the SXM5's enhanced multi-GPU scaling capabilities can dramatically reduce training times.

The SXM5 is also well-suited for AI applications that require processing enormous datasets or handling multiple AI tasks simultaneously. Its higher memory bandwidth and better scaling make it optimal for these data-intensive, computationally complex workloads.

DataCrunch exclusively uses SXM technology on the A100 and H100 GPU series.

H100 PCIe Use Cases

The PCIe version of the H100, while still extremely powerful, is better suited for more flexible and diverse deployment scenarios. It's an excellent choice for organizations that need to upgrade their existing infrastructure without a complete overhaul of their server architecture.

For AI inference workloads, particularly those that don't require extensive multi-GPU communication, the PCIe version can offer balance of performance and cost-effectiveness. It's well-suited for deploying production AI models in scenarios like recommendation systems, natural language understanding, or computer vision applications where single-GPU performance is good enough.

The PCIe form factor could also be suitable for smaller-scale AI training tasks or for organizations that need to distribute their compute resources across multiple locations. For instance, it's a good fit for edge computing scenarios where powerful GPU compute is needed closer to the data source, such as in smart manufacturing plants or autonomous vehicle testing facilities.

Bottom Line on PCIe vs SXM5

Ultimately, the choice between SXM5 and PCIe depends on your performance needs, infrastructure, scalability requirements, and budget. The SXM5 offers superior performance for the most demanding workloads, while the PCIe provides a more flexible, widely compatible option that still delivers impressive performance across a broad range of AI and HPC applications.

Let’s end with some good news. Availability of NVIDIA’s Tensor Core GPUs like the A100 and H100 is improving on premium cloud GPU platforms. You don’t need to make a massive infrastructure investment to test out the capability of one or more H100 for your needs. Spin up an instance with DataCrunch or reach out for a tailored quote.