Multi Data Center Training: Prime Intellect

Datacenter-as-a-Computer (DaaC) and the Path to Global-DaaC for LLM Training

Modern AI workloads, especially large language model (LLM) training, treat an entire data center as a single computer – a concept popularized by Barroso et al. as "Datacenter as a Computer" (DaaC). In this paradigm, thousands of GPUs or TPUs within one warehouse-scale computer (WSC) act in concert on one task.

To push boundaries further, hyperscalers and researchers are exploring “Global-DaaC”, where multiple data centers around the world or region function as one unified computing entity. This evolution could unlock unprecedented model scale, resource pooling, and efficient ML job scheduling, but it also introduces significant technical challenges around networking and communication efficiency.

Emerging training algorithms like Distributed Low-Communication (DiLoCo) explicitly target these challenges by reducing the frequency and volume of synchronization required between nodes, making it feasible to train LLMs on clusters of machines that are poorly connected or geographically separated.

We investigate key aspects of Global-DaaC, focusing on virtual clusters across nearby data centers, advanced networking hardware, hyperscaler case studies, and future trends in scalability – highlighting how communication-efficient strategies (both hardware and algorithmic) impact large-scale LLM training. We highlight PrimeIntellect as the principal open-source SW training infrastructure targeting this multi-datacenter framework.

Virtual Clusters Across Close-Proximity Data Centers

Connecting multiple nearby data centers, in the same region, into one virtual cluster can extend the compute and memory resources available for a single training job. In practice, this means linking facilities within roughly 10-1000 km into a unified “compute grid” that appears to the software as one giant cluster.

Low-latency optical interconnects are crucial for this scenario, using connections with dedicated fiber links or metropolitan area networks that add only microseconds to a few hundred microseconds of latency. For somewhat longer distances (hundreds of kilometers), advanced network gear and direct fiber routes are used to keep latency in the low milliseconds.

Even at the theoretical speed of light in the fiber (~208,000 km/s), a 1000 km distance imposes about 5 ms one-way latency (~10 ms round-trip), and real-world telecom equipment adds overhead (see SemiAnalysis Multi-Datacenter Training). This overhead becomes significant in synchronous distributed training – where GPUs must frequently synchronize gradients – even a few milliseconds can become a bottleneck.

Network optimizations and distributed scheduling strategies are employed to mitigate these effects. One approach is hierarchical or pipelined synchronization: grouping GPUs by location so that most communication stays within each data center, with only occasional or aggregated updates exchanged across sites. For instance, a training job might perform fast intra-cluster all-reduce operations locally, then do a slower inter-cluster synchronization only after combining results within each site (a form of hierarchical all-reduce). This limits the volume of data traversing long-haul links in each iteration. Similarly, scheduling can ensure that tasks with heavy mutual communication are co-located in the same data center, whereas more independent tasks (or those that can tolerate delay) are distributed.

Researchers have shown that when communication frequency is reduced dramatically, multiple clusters can effectively act as one: DeepMind’s DiLoCo algorithm is an example of local-SGD, synchronizing “pseudo-gradients” only once every ~500 training steps. In their experiments on the C4 dataset, DiLoCo running on 8 separate workers achieved model convergence comparable to fully synchronous training while exchanging 500× less data in communication.

In practice, close-proximity multi-datacenter clusters often rely on dedicated fiber paths and optimized routing. Hyperscalers tend to build private regional fiber networks to link their nearby facilities, sometimes deploying wavelength-division multiplexing (DWDM) to get terabits of aggregate bandwidth between sites (see Semianalysis Multi-Datacenter Training: OpenAI’s Ambitious Plan To Beat Google’s Infrastructure).

Distributed workload scheduling systems decide where to launch parts of a job or which data center should host a training run. In cloud environments, users traditionally had to choose a single region for training to keep data and compute co-located. This manual placement could lead to imbalance with one region being overloaded while another sitting idle. New global schedulers like Meta’s MAST abstract away regions and intelligently place workloads and data in different sites to balance utilization. MAST’s global scheduling eliminated severe regional overloads by moving jobs to less busy data centers. While MAST schedules entire jobs to one region at a time (avoiding cross-region training in the same job), it demonstrates the value of treating multiple data centers as a flexible pool of resources.

In the future, similar schedulers could allocate parts of a single training job across close-proximity sites (when network conditions allow) or dynamically migrate workloads between data centers for efficiency. Overall, building a virtual cluster spanning a few data centers is feasible today at distances up to a few hundred kilometers – but it demands careful network engineering and algorithmic adjustments (like less frequent synchronization) to hide the added latency.

HW-SW Co-Design For Inter-DC Training

To make multiple data centers operate like one computer, networking hardware must minimize the overhead of inter-node communication. Modern AI superclusters already leverage specialized interconnects within a single data center – technologies such as NVIDIA InfiniBand or RDMA over Converged Ethernet (RoCE) (see Meta’s Llama 3 training ROCE cluster) networks provide high bandwidth (200-400 Gbps per link) and low latency (single-digit microseconds) for GPU-to-GPU data exchange.

Extending these capabilities across data centers requires equally advanced wide-area networking hardware.

Key infrastructure in extending to the multi-datacenter paradigm include:

InfiniBand and RDMA: InfiniBand (IB) is a high-performance network protocol commonly used in AI clusters. It supports RDMA (Remote Direct Memory Access), which lets one machine directly read/write the memory of another without involving the CPU or operating system, drastically cutting communication latency. However, traditional IB is designed for on-premise clusters while emerging technologies like RDMA over WAN and long-haul IB extensions offer new solutions.
DPUs (Data Processing Units): Specialized smart NICs like NVIDIA BlueField-3 offload network processing tasks from the CPU and manage RDMA traffic in hardware. A BlueField DPU, often called a “SuperNIC”, can manage GPU-to-GPU communication across nodes with minimal latency penalty. In NVIDIA’s architecture, each server node can include a BlueField-3 NIC that supports 400 Gbps InfiniBand (NDR) and features integrated Arm cores and accelerators to handle packet routing, RDMA operations, and even collective communication logic. This means gradient data can be transferred directly from one GPU to another across the network with near line-rate speed and very little CPU involvement. The BlueField NIC effectively becomes a gateway connecting GPUs to the network fabric, ensuring that as soon as data leaves one GPU, it gets on a low-latency path to the target GPU in another node (or even another data center). By offloading communication to DPUs, the system can overlap computation and communication more efficiently.
GPU Peer-to-Peer Remote Communication: Technologies like GPUDirect RDMA allow GPUs to communicate across the network as if they were doing device-to-device transfers. GPUDirect RDMA lets the NIC read data directly from GPU memory and write data into GPU memory on the other end, bypassing additional copies. This is crucial for improving performance, as it eliminates staging buffers on the host. In effect, one GPU can send a chunk of model gradients to another GPU in a different server (or potentially a different data center) through a single hop from its VRAM to the NIC and across the wire. Combined with PCIe switch improvements and NVLink/NVSwitch inside each node (for intra-node speed), these features create a unified fabric where both intra-node and inter-node communication are highly optimized. Modern GPU clusters have the following hierarchy of interconnects:
- NVLink within a server
- InfiniBand or RoCE within a rack or a data center
- Extended fiber links between data centers
Overlapping Computation with Communication: On the software side, deep learning frameworks employ techniques to hide communication latency by overlapping it with useful computation (see pytorch feature... "AutoFSDP: grouping parameters to overlap communication with compute", DeepSeek V3 3.2.2 Efficient Implementation of Cross-Node All-to-All Communication, DeepSeek DualPipe, DeepSeek DeepEP). For example, during backpropagation, gradients for earlier layers can start to all-reduce across nodes while later layers are still computing their gradients. Libraries like NVIDIA’s NCCL implement non-blocking collective operations so that network transfers proceed in parallel with GPU work. In large clusters, using a well-tuned all-reduce algorithm (e.g. tree-based or ring-based) is vital. In an inter-datacenter context, overlapping and pipelining communications are even more important. Techniques such as gradient accumulation (doing multiple forward/backprop passes before syncing) and asynchronous parameter updates (where nodes do not wait for every single update before continuing) further amplify this effect. The DiLoCo algorithm mentioned earlier is essentially an extreme form of overlap: it accumulates hundreds of local update steps (keeping GPUs busy with computation) before incurring a communication round, thereby keeping GPU utilization high even when network bandwidth is limited. PrimeIntellect built its training SW infrastructure around DiLoCo’s core ideas leading to OpenDiloco.

Ultimately, bridging multiple data centers requires a combination of cutting-edge networking hardware and clever software strategies. High-bandwidth, low-latency links (InfiniBand or advanced Ethernet with RDMA) and smart NICs (DPUs) provide the physical capability to exchange tens of gigabytes per second between distant machines. Meanwhile, distributed training frameworks exploit those capabilities with communication-efficient algorithms and by overlapping network I/O with computation. Together, these innovations ensure that scaling LLM training beyond a single data center remains efficient. In effect, the network becomes as much a part of the “computer” as the GPUs and CPUs themselves.

Hyperscalers: Multi-DC Case Studies

Large cloud companies and hyperscalers have been at the forefront of pushing multi-datacenter training solutions, out of both necessity and opportunity. Physical constraints like construction lead times, power availability, and cooling capacity mean no single data center can house infinite GPUs. In this section, we examine how some hyperscalers are tackling this and what benchmarks or results have been reported:

Google’s Multi-Cluster Training (Pathways and Multislice): Google has long advocated for treating its infrastructure as a global computer. Its Pathways system is designed as a distributed ML platform that in principle can route parts of a model or dataset to different resources, even across regions, in an asynchronous dataflow manner. In practice, early Pathways usage still ran jobs within a single region synchronously, but it laid the groundwork for more geo-distributed training. In 2023, Google Cloud demonstrated the world’s largest publicly-disclosed LLM training run using TPU v5e chips. It leveraged a technique called Multislice Training that connects multiple TPU pods (slices) into one giant virtual pod, spanning 50,944 TPU v5e chips for a single 32B-parameter model training. This is essentially a multi-datacenter or multi-module cluster (since a single TPU pod typically resides in one data center). The successful run shows that with the right networking (Google’s custom optical interconnects) and software (pipeline parallelism), tens of thousands of accelerators can be synchronized. While details of latency hiding were not fully public, the achievement underscores that short-distance multi-datacenter training is feasible at scale – presumably, Google kept these pods within a region or continent to manage latency.
Microsoft’s and OpenAI’s Hierarchical Approach: Azure has a global fiber network interconnecting its regions. OpenAI and Microsoft have been investing in dedicated bandwidth between their data centers. One strategy highlighted in analyses relies on using hierarchical and asynchronous SGD to cope with WAN latencies. In essence, Azure could organize training such that each region (or availability zone) performs local synchronizations frequently and only synchronizes with other regions infrequently or asynchronously.
Meta’s (Facebook’s) Distributed Training Utilization: So far, Meta’s strategy relied on building very large single-site clusters, but their recent work provides insight relevant to multi-datacenter setups. Meta’s Research SuperCluster (RSC) has 16,000 GPUs in one location, and it recently announced two new clusters with 24,000+ GPUs each for GenAI workloads. Meta managed to train its large models (like Llama 3) on the RoCE-based 24k GPU cluster without network bottlenecks. This is encouraging for multi-datacenter ideas: if standard Ethernet with RDMA can handle thousands of GPUs with no slowdown, then linking two such clusters with similar high-speed links might also be workable. Meta’s global scheduler (MAST, discussed earlier) is another piece of the puzzle – while it currently avoids splitting one job across regions, it shows how a hyperscaler can dynamically allocate resources worldwide for training demand. It is plausible that Meta could in the future run a single training job on multiple smaller clusters of their software (e.g. PyTorch Distributed) and adopt algorithms for elasticity, latency tolerance, and low-communication training (e.g. DiLoCo).

In summary, here are the main differences in how short-distance training differs from long-distance training:

Short distances (e.g. between two data centers in the same metro area) offer high bandwidth and sub-millisecond latency, making it relatively straightforward to extend clusters.
Long distances (inter-continental) introduce tens of milliseconds latency and a higher risk of packet loss or outages, which currently force a move toward asynchronous updates and the implementation of novel training optimizers.

PrimeIntellect: Towards a Global-DaaC

The concept of a Global Datacenter-as-a-Computer envisions pooling compute across continents to act as one massive, virtual supercomputer. PrimeIntellect builds on top of this idea, allowing the combined power of several regional GPU clusters, breaking past the limits of any single facility and reducing costs based on demand and GPU availability per region. The potential benefits include faster training for ultra-large models, better utilization of worldwide spare capacity, and fault tolerance training jobs (the jobs that could survive one of the data centers going down by relying on others). However, achieving this vision requires advanced engineering to solve key challenges in both infrastructure and software.

Scalability across multiple data centers is mandatory depending on communication efficiency. As discussed, synchronizing every gradient update globally in lock-step does not scale when latencies reach tens of milliseconds – the math of distributed training and Amdahl’s Law dictate diminishing returns or even stalls beyond a certain number of nodes. PrimeIntellect uses asynchronous local-SGD training algorithms to reduce communication volume and overhead by performing many local updates and aggregating models only periodically. That kind of improvement is exactly what global-scale training needs, reducing communication frequency and overlapping it with compute.

Another crucial aspect is software-defined networking (SDN) and intelligent infrastructure. In a Global-DaaC scenario, the network connecting data centers becomes just as important as their local networks. SDN can be used to dynamically route traffic on optimal paths, reserve bandwidth for synchronous all-reduce bursts, and even reconfigure topology on the fly. For example, if two clusters need to exchange a large amount of data during the next hour, the SDN controller can be used to establish a dedicated optical circuit between them (bypassing congested routers) to guarantee low latency and high throughput. We may also see in-network aggregation – network switches or DPUs that can perform reduction operations (summing gradients) as data flows through them. This would cut down on total bytes sent across long links by combining updates en route. Such capabilities are already in prototypes; for instance, some InfiniBand switch designs support all-reduce offloading in hardware. If each regional cluster sends its gradients to a middle point where they get averaged, and only the result is forwarded, it reduces redundant data transmission.

From an infrastructure perspective, latency remains the hardest physical barrier. While bandwidth can be increased almost arbitrarily with more fibers or better modulation (we are seeing 800 Gbps and 1.6 Tbps transceivers emerge), latency is limited by physics and routing overhead. Multi-tier optimizer is a feasible approach, which entails fully synchronous updates within a data center, slower asynchronous exchanges between data centers, and higher-level periodic model merges across continents. This multi-tier model is akin to how caching works in CPUs (L1, L2, L3 caches) – each level has a different speed and the system is designed to use faster, smaller caches more frequently and larger, slower sync less frequently.

Another challenge is consistency and accuracy. Asynchronous updates can lead to larger training variance or slower convergence. Because of this, researchers will need to ensure that algorithms like DiLoCo scale to more workers without loss of model quality. PrimeIntellect’s INTELLECT–1 represents the first open-source decentralized LLM training at a reasonable scale: a 10B parameter model. Although this is understood as a small LLM size and uses a dense architecture, the implications for further training scale are present. DataCrunch participated in the worldwide training, becoming the first EU node in a multi-datacenter inter-continental training.

For GPU cloud infrastructure providers such as DataCrunch, the march toward Global-DaaC presents both an opportunity and a technical hurdle. On one hand, being able to offer clients a “single virtual supercomputer” composed of all our distributed GPU hubs could be a game-changer – a customer could run a job on 2× the GPUs by transparently using two of our sites at once, for example. It would enable us to gradually gather multiple small data centers to match the scale of a hyperscaler. On the other hand, to make this feasible, we must invest in high-quality networking between our sites and support the right software stack. We would need to ensure that our data centers (perhaps in different cities or countries) are linked with sufficient bandwidth and have low enough latency, likely via direct fiber or high-end cloud exchange networks, rather than relying on the public internet. We would also need to incorporate distributed training frameworks that implement communication-efficient algorithms. This might involve adopting open-source projects like OpenDiLoCo (which is built on PyTorch and the Hivemind library) in our platform. This way, our users can opt into a low-communication training mode when spanning multiple locations.

In conclusion, we wish to increase the visibility of PrimeIntellect open-source projects like OpenDiLoCo, since it already allows multi-datacenter LLM training ranging from a metro area (10-100 km), within a region (100-1000 km), up to transcontinental (thousands of kilometers) to avoid huge slowdowns while challenging what was previously possible using strict synchronous training.

Multi Data Center Training: Prime Intellect

Datacenter-as-a-Computer (DaaC) and the Path to Global-DaaC for LLM Training

Virtual Clusters Across Close-Proximity Data Centers

HW-SW Co-Design For Inter-DC Training

Hyperscalers: Multi-DC Case Studies

PrimeIntellect: Towards a Global-DaaC

Subscribe to our newsletter

Get the latest updates on GPU benchmarks and AI research

References