Data Movement in NVIDIA's Superchip Era: MEMCPY Analysis from Grace Hopper GH200

We recently got our hands on an NVIDIA GH200 Grace Hopper system, courtesy of Supermicro and NVIDIA. The following blog details some of our initial findings. The NVIDIA white paper gives a good overview of the system. The key takeaway to remember is that the GH200 combines two separate systems - a CPU and a GPU - acting under unified memory, the first time NVIDIA has offered such a product. Citing NVIDIA Grace Hopper Superchip Architecture In-Depth, the main innovations that enable this system include:

NVIDIA Grace CPU:
- Up to 72x Arm Neoverse V2 cores with Armv9.0-A ISA and 4×128-bit SIMD units per core.
- Up to 480 GB of LPDDR5X memory delivering up to 546 GB/s of memory bandwidth.
NVIDIA Hopper GPU:
- Up to 144 SMs with fourth-generation Tensor Cores.
- Up to 96 GB of HBM3 memory delivering up to 4000 GB/s.
NVIDIA NVLink-C2C:
- Hardware-coherent interconnect between the Grace CPU and Hopper GPU.
- Up to 900 GB/s total bandwidth, 450 GB/s per direction.

An overview of the memory and bandwidth for the GH200.

Figure 1: An overview of the memory and bandwidth for the GH200.

The two systems are joined by an NVLink-C2C connection, which offers a total bandwidth of 900 GB/s (450 GB/s per direction) and is 7 times higher than a conventional PCIe connection. Furthermore, the key selling feature is an integrated, 576 GB memory pool that both the CPU and GPU can access directly.

This unified memory changes how we think about data movement in GPU computing. Traditionally, one is always shuffling data between CPU DRAM and GPU HBM — anyone who's worked with PyTorch knows the familiar tensor.to('cuda') call that moves data to the GPU before processing. With the GH200, that explicit movement becomes optional since both processors share the same memory space. Furthermore, NVIDIA has ensured that the CUDA API is compatible with any H200 workflow, as the GH200 uses sm_90 compute capability, meaning it supports the same instruction set and CUDA features.

To understand how the system works in practice, we designed three benchmark tests:

The speed of asynchronous memory copies between system memory LPDDR5X and HBM3.
The performance of a memory-bound GPU kernel using data from different memory locations and api calls (malloc, cudaMallocManaged, cudaMalloc).
The impact of cache flushing on unified memory performance.

Async memory copies

For the memory transfer tests, we used async memory copies from host to device for one-way transfers, then added a device-to-host copy for round-trip measurements. We compared these results against a standard H200 system with separate CPU and GPU memory to see the difference that unified memory makes.

System	Buffer Size	Host->Device (GiB/s)	Round-trip (GiB/s)
H200	500MiB	57	41
GH200	500MiB	135	65

Table 1: Memory bandwidth of async memory movements. The script can be found here.

The results in Table 1 show that in practice, using the async method for data transfer yields a factor of around 2.5 for a one-way transfer and a 1.5 times increase for a round trip. The NVLink-C2C connection is the limiting factor at 450 GB/s. There appears to be some additional overhead that prevents one from reaching the maximum bandwidth speeds.

Benchmarking kernel memory access

Getting the best performance from the GH200 comes down to understanding the speed characteristics of different memory locations. While other researchers have done comprehensive microbenchmarking studies (Understanding Data Movement in Tightly Coupled Heterogeneous Systems: A Case Study with the Grace Hopper Superchip), we wanted to test the behaviour on this system, in how malloc, cudaMalloc, and cudaMallocManaged behave.

To test the various bandwidths, we used a simple memory-bound kernel that performs an in-place update (a read and a write) on each memory element with no optimizations for memory access patterns. We found that an important setting is needed: For the NVIDIA driver to migrate data placed using malloc, one needs to change the default Linux kernel page size from 4KiB to 64KiB.

Allocation Function	Initial Location	Final Location	Data Moved (GiB)	Avg Time (ms)	Peak Bandwidth (GiB/s)
malloc (4KiB page size)	CPU DDR (Node 0)	CPU DDR (Node 0)	1.0	9.14	218.78
malloc (64KiB page size)	CPU DDR (Node 0)	CPU DDR (Node 1)	1.0	0.95	2112.36
cudaMallocManaged	CPU DDR (Node 0)	GPU HBM (Node -2)	1.0	0.91	2192.39
cudaMalloc	GPU HBM	GPU HBM	1.0	0.84	2390.23

Table 2: Benchmark of a kernel with different initial memory allocation functions.

We run the simple memory-bound kernel multiple times from data saved using different API functions, with peak bandwidth recorded in Table 2. Primarily, the experiment shows that the GPU can directly access system memory (Node 0 in our system), though it's roughly 10 times slower than accessing HBM3. However, when using cudaMallocManaged or malloc with 64KiB, data placed on Node 0 migrates to HBM3 during execution, showing up as unknown node -2 or node 0 in our NUMA topology.

Testing the impact of cache flushing

Finally, we tested what happens when we access a specific block of LPDDR5X-located memory multiple times and then attempt to flush the caches — that is, by having the GPU process a much larger, different block of memory also located in LPDDR5X.

Access Type	Time (ms)	Bandwidth (GiB/s)
Initial 'Cold' GPU access to host memory	1.117	131
'Warm' GPU access to host memory	0.589	248
After Eviction GPU access to host memory	680.806	0.22

Table 3: We see that caching is speeding up access to LPDDR5X and flushing sees a significant slowdown.

The results show that caching is clearly helping, and any workload that frequently switches its data access patterns from system memory would suffer significantly slower access speeds.

Conclusions

Our testing reveals that the GH200's unified memory architecture delivers significant practical benefits, particularly for memory-intensive AI workloads. The improvement in memory transfer speeds and the elimination of explicit device transfers result in performance gains for applications that frequently transfer data between the CPU and GPU. Furthermore, the NVIDIA driver can now migrate data based on access patterns. However, optimizing for a specific AI workflow seems less than straightforward with the various ways the API interacts and understanding what causes the NVIDIA driver migration patterns seems unclear.

The combined 576 GB memory pool provides a much larger resource directly accessible to the GPU, opening up potential for memory-intensive methods that have a degree of idleness, such as KV cache offloading or expert parameter offloading in MoE models. These methods could benefit from the integrated memory. Furthermore, the faster C2C should also offer improvements for current workloads with no code changes needed. The Blackwell GB200 and GB300 designs should further increase performance, specifically for the inference of larger models. This is primarily due to the improved 2:1 GPU to CPU ratio, higher GPU interconnect bandwidth, and the new NVL72 rack deployment, which is not available in the GH200 model.

DataCrunch is among the earliest adopters of the Blackwell platform – making B200 SXM6 180GB servers publicly available via our Cloud Platform. You can experience the next-generation of AI supercomputing with our secure and scalable GPU infrastructure – hosted in European locations adhering to GDPR compliance and utilizing 100% renewable energy sources: