NVIDIA® Blackwell Clusters: Available Soon

Multi Data Center Training: Prime Intellect

17 min read
Multi Data Center Training: Prime Intellect

Datacenter-as-a-Computer (DaaC) and the Path to Global-DaaC for LLM Training

Modern AI workloads, especially large language model (LLM) training, treat an entire data center as a single computer – a concept popularized by Barroso et al. as "Datacenter as a Computer" (DaaC). In this paradigm, thousands of GPUs or TPUs within one warehouse-scale computer (WSC) act in concert on one task.

To push boundaries further, hyperscalers and researchers are exploring “Global-DaaC”, where multiple data centers around the world or region function as one unified computing entity. This evolution could unlock unprecedented model scale, resource pooling, and efficient ML job scheduling, but it also introduces significant technical challenges around networking and communication efficiency.

Emerging training algorithms like Distributed Low-Communication (DiLoCo) explicitly target these challenges by reducing the frequency and volume of synchronization required between nodes, making it feasible to train LLMs on clusters of machines that are poorly connected or geographically separated.

We investigate key aspects of Global-DaaC, focusing on virtual clusters across nearby data centers, advanced networking hardware, hyperscaler case studies, and future trends in scalability – highlighting how communication-efficient strategies (both hardware and algorithmic) impact large-scale LLM training. We highlight PrimeIntellect as the principal open-source SW training infrastructure targeting this multi-datacenter framework.

Virtual Clusters Across Close-Proximity Data Centers

Connecting multiple nearby data centers, in the same region, into one virtual cluster can extend the compute and memory resources available for a single training job. In practice, this means linking facilities within roughly 10-1000 km into a unified “compute grid” that appears to the software as one giant cluster.

Low-latency optical interconnects are crucial for this scenario, using connections with dedicated fiber links or metropolitan area networks that add only microseconds to a few hundred microseconds of latency. For somewhat longer distances (hundreds of kilometers), advanced network gear and direct fiber routes are used to keep latency in the low milliseconds.

Even at the theoretical speed of light in the fiber (~208,000 km/s), a 1000 km distance imposes about 5 ms one-way latency (~10 ms round-trip), and real-world telecom equipment adds overhead (see SemiAnalysis Multi-Datacenter Training). This overhead becomes significant in synchronous distributed training – where GPUs must frequently synchronize gradients – even a few milliseconds can become a bottleneck.

Network optimizations and distributed scheduling strategies are employed to mitigate these effects. One approach is hierarchical or pipelined synchronization: grouping GPUs by location so that most communication stays within each data center, with only occasional or aggregated updates exchanged across sites. For instance, a training job might perform fast intra-cluster all-reduce operations locally, then do a slower inter-cluster synchronization only after combining results within each site (a form of hierarchical all-reduce). This limits the volume of data traversing long-haul links in each iteration. Similarly, scheduling can ensure that tasks with heavy mutual communication are co-located in the same data center, whereas more independent tasks (or those that can tolerate delay) are distributed.

Researchers have shown that when communication frequency is reduced dramatically, multiple clusters can effectively act as one: DeepMind’s DiLoCo algorithm is an example of local-SGD, synchronizing “pseudo-gradients” only once every ~500 training steps. In their experiments on the C4 dataset, DiLoCo running on 8 separate workers achieved model convergence comparable to fully synchronous training while exchanging 500× less data in communication.

In practice, close-proximity multi-datacenter clusters often rely on dedicated fiber paths and optimized routing. Hyperscalers tend to build private regional fiber networks to link their nearby facilities, sometimes deploying wavelength-division multiplexing (DWDM) to get terabits of aggregate bandwidth between sites​ (see Semianalysis Multi-Datacenter Training: OpenAI’s Ambitious Plan To Beat Google’s Infrastructure).

Distributed workload scheduling systems decide where to launch parts of a job or which data center should host a training run. In cloud environments, users traditionally had to choose a single region for training to keep data and compute co-located. This manual placement could lead to imbalance with one region being overloaded while another sitting idle. New global schedulers like Meta’s MAST abstract away regions and intelligently place workloads and data in different sites to balance utilization. MAST’s global scheduling eliminated severe regional overloads by moving jobs to less busy data centers. While MAST schedules entire jobs to one region at a time (avoiding cross-region training in the same job), it demonstrates the value of treating multiple data centers as a flexible pool of resources.

In the future, similar schedulers could allocate parts of a single training job across close-proximity sites (when network conditions allow) or dynamically migrate workloads between data centers for efficiency. Overall, building a virtual cluster spanning a few data centers is feasible today at distances up to a few hundred kilometers – but it demands careful network engineering and algorithmic adjustments (like less frequent synchronization) to hide the added latency.

HW-SW Co-Design For Inter-DC Training

To make multiple data centers operate like one computer, networking hardware must minimize the overhead of inter-node communication. Modern AI superclusters already leverage specialized interconnects within a single data center – technologies such as NVIDIA InfiniBand or RDMA over Converged Ethernet (RoCE) (see Meta’s Llama 3 training ROCE cluster) networks provide high bandwidth (200-400 Gbps per link) and low latency (single-digit microseconds) for GPU-to-GPU data exchange.

Extending these capabilities across data centers requires equally advanced wide-area networking hardware.

Key infrastructure in extending to the multi-datacenter paradigm include:

Ultimately, bridging multiple data centers requires a combination of cutting-edge networking hardware and clever software strategies. High-bandwidth, low-latency links (InfiniBand or advanced Ethernet with RDMA) and smart NICs (DPUs) provide the physical capability to exchange tens of gigabytes per second between distant machines. Meanwhile, distributed training frameworks exploit those capabilities with communication-efficient algorithms and by overlapping network I/O with computation. Together, these innovations ensure that scaling LLM training beyond a single data center remains efficient. In effect, the network becomes as much a part of the “computer” as the GPUs and CPUs themselves.

Hyperscalers: Multi-DC Case Studies

Large cloud companies and hyperscalers have been at the forefront of pushing multi-datacenter training solutions, out of both necessity and opportunity. Physical constraints like construction lead times, power availability, and cooling capacity mean no single data center can house infinite GPUs. In this section, we examine how some hyperscalers are tackling this and what benchmarks or results have been reported:

In summary, here are the main differences in how short-distance training differs from long-distance training:

PrimeIntellect: Towards a Global-DaaC

The concept of a Global Datacenter-as-a-Computer envisions pooling compute across continents to act as one massive, virtual supercomputer. PrimeIntellect builds on top of this idea, allowing the combined power of several regional GPU clusters, breaking past the limits of any single facility and reducing costs based on demand and GPU availability per region. The potential benefits include faster training for ultra-large models, better utilization of worldwide spare capacity, and fault tolerance training jobs (the jobs that could survive one of the data centers going down by relying on others). However, achieving this vision requires advanced engineering to solve key challenges in both infrastructure and software.

Scalability across multiple data centers is mandatory depending on communication efficiency. As discussed, synchronizing every gradient update globally in lock-step does not scale when latencies reach tens of milliseconds – the math of distributed training and Amdahl’s Law dictate diminishing returns or even stalls beyond a certain number of nodes. PrimeIntellect uses asynchronous local-SGD training algorithms to reduce communication volume and overhead by performing many local updates and aggregating models only periodically. That kind of improvement is exactly what global-scale training needs, reducing communication frequency and overlapping it with compute.

Another crucial aspect is software-defined networking (SDN) and intelligent infrastructure. In a Global-DaaC scenario, the network connecting data centers becomes just as important as their local networks. SDN can be used to dynamically route traffic on optimal paths, reserve bandwidth for synchronous all-reduce bursts, and even reconfigure topology on the fly. For example, if two clusters need to exchange a large amount of data during the next hour, the SDN controller can be used to establish a dedicated optical circuit between them (bypassing congested routers) to guarantee low latency and high throughput. We may also see in-network aggregation – network switches or DPUs that can perform reduction operations (summing gradients) as data flows through them. This would cut down on total bytes sent across long links by combining updates en route. Such capabilities are already in prototypes; for instance, some InfiniBand switch designs support all-reduce offloading in hardware. If each regional cluster sends its gradients to a middle point where they get averaged, and only the result is forwarded, it reduces redundant data transmission.

From an infrastructure perspective, latency remains the hardest physical barrier. While bandwidth can be increased almost arbitrarily with more fibers or better modulation (we are seeing 800 Gbps and 1.6 Tbps transceivers emerge), latency is limited by physics and routing overhead. Multi-tier optimizer is a feasible approach, which entails fully synchronous updates within a data center, slower asynchronous exchanges between data centers, and higher-level periodic model merges across continents. This multi-tier model is akin to how caching works in CPUs (L1, L2, L3 caches) – each level has a different speed and the system is designed to use faster, smaller caches more frequently and larger, slower sync less frequently.

Another challenge is consistency and accuracy. Asynchronous updates can lead to larger training variance or slower convergence. Because of this, researchers will need to ensure that algorithms like DiLoCo scale to more workers without loss of model quality. PrimeIntellect’s INTELLECT–1 represents the first open-source decentralized LLM training at a reasonable scale: a 10B parameter model. Although this is understood as a small LLM size and uses a dense architecture, the implications for further training scale are present. DataCrunch participated in the worldwide training, becoming the first EU node in a multi-datacenter inter-continental training.

For GPU cloud infrastructure providers such as DataCrunch, the march toward Global-DaaC presents both an opportunity and a technical hurdle. On one hand, being able to offer clients a “single virtual supercomputer” composed of all our distributed GPU hubs could be a game-changer – a customer could run a job on 2× the GPUs by transparently using two of our sites at once, for example. It would enable us to gradually gather multiple small data centers to match the scale of a hyperscaler. On the other hand, to make this feasible, we must invest in high-quality networking between our sites and support the right software stack. We would need to ensure that our data centers (perhaps in different cities or countries) are linked with sufficient bandwidth and have low enough latency, likely via direct fiber or high-end cloud exchange networks, rather than relying on the public internet. We would also need to incorporate distributed training frameworks that implement communication-efficient algorithms. This might involve adopting open-source projects like OpenDiLoCo (which is built on PyTorch and the Hivemind library) in our platform. This way, our users can opt into a low-communication training mode when spanning multiple locations.

In conclusion, we wish to increase the visibility of PrimeIntellect open-source projects like OpenDiLoCo, since it already allows multi-datacenter LLM training ranging from a metro area (10-100 km), within a region (100-1000 km), up to transcontinental (thousands of kilometers) to avoid huge slowdowns while challenging what was previously possible using strict synchronous training.

Subscribe to our newsletter

Get the latest updates on GPU benchmarks and AI research

References

  1. Luiz André Barroso, Jimmy Clidaras, & Urs Hölzle (2013). The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines
  2. Arthur Douillard, Qixuan Feng, Andrei A. Rusu, Rachita Chhaparia, Yani Donchev, Adhiguna Kuncoro, Marc'Aurelio Ranzato, Arthur Szlam, & Jiajun Shen. (2024). DiLoCo: Distributed Low-Communication Training of Language Models.
  3. PrimeIntellect
  4. SemiAnalysis Multi-Datacenter Training: OpenAI’s Ambitious Plan To Beat Google’s Infrastructure Gigawatt Clusters, Telecom Networking, Long Haul Fiber, Hierarchical & Asynchronous SGD, Distributed Infrastructure Winners
  5. Cowan, M., Maleki, S., Musuvathi, M., Saarikivi, O., & Xiong, Y. (2023). MSCCLang: Microsoft Collective Communication Language. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 (pp. 502–514). Association for Computing Machinery
  6. Arnab Choudhury, Yang Wang, Tuomas Pelkonen, Kutta Srinivasan, Abha Jain, Shenghao Lin, Delia David, Siavash Soleimanifard, Michael Chen, Abhishek Yadav, Ritesh Tijoriwala, Denis Samoylov, & Chunqiang Tang (2024). MAST: Global Scheduling of ML Training across Geo-Distributed Datacenters at Hyperscale. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24) (pp. 563–580). USENIX Association.
  7. Building Meta’s GenAI Infrastructure
  8. DeepSeek-AI. (2024). DeepSeek-V3 Technical Report
  9. DeepSeek DualPipe
  10. DeepSeek DeepEP
  11. NVIDIA’s NCCL
  12. Sami Jaghouar, Jack Min Ong, & Johannes Hagemann. (2024). OpenDiLoCo: An Open-Source Framework for Globally Distributed Low-Communication Training.
  13. Paul Barham, Aakanksha Chowdhery, Jeff Dean, Sanjay Ghemawat, Steven Hand, Dan Hurt, Michael Isard, Hyeontaek Lim, Ruoming Pang, Sudip Roy, Brennan Saeta, Parker Schuh, Ryan Sepassi, Laurent El Shafey, Chandramohan A. Thekkath, & Yonghui Wu. (2022). Pathways: Asynchronous Distributed Dataflow for ML.
  14. World’s largest publicly-disclosed LLM training run using TPU v5e chips
  15. Multislice Training
  16. Latency Numbers Every Programmer Should Know
  17. Intellect-1: Technical Report