Decentralized Training Can Help Solve AI’s Energy Woes
<img src="https://spectrum.ieee.org/media-library/illustration-of-several-data-servers-interconnected-across-long-distances.jpg?id=65477795&width=1245&height=700&coordinates=0%2C156%2C0%2C157"/><br/><br/><p><a href="https://spectrum.ieee.org/topic/artificial-intelligence/" target="_self">Artificial intelligence</a><span> harbors an enormous</span><a href="https://spectrum.ieee.org/topic/energy/" target="_self">energy</a><span> appetite. Such constant cravings are evident in the</span><a href="https://spectrum.ieee.org/ai-index-2025" target="_self">hefty carbon footprint</a><span> of the</span><a href="https://spectrum.ieee.org/tag/data-centers" target="_self">data centers</a><span> behind the AI boom and the steady increase over time of</span><a href="https://spectrum.ieee.org/tag/carbon-emissions" target="_self">carbon emissions</a><span> from training frontier</span><a href="https://spectrum.ieee.org/tag/ai-models" target="_self">AI models</a><span>.</span></p><p>No wonder big tech companies are warming up to<a href="https://spectrum.ieee.org/tag/nuclear-energy" target="_self">nuclear energy</a>, envisioning a future fueled by reliable, carbon-free sources. But while<a href="https://spectrum.ieee.org/nuclear-powered-data-center" target="_self">nuclear-powered data centers</a> might still be years away, some in the research and industry spheres are taking action right now to curb AI’s growing energy demands. They’re tackling training as one of the most energy-intensive phases in a model’s life cycle, focusing their efforts on decentralization.</p><p>Decentralization allocates model training across a network of independent nodes rather than relying on one platform or provider. It allows compute to go where the energy is—be it a dormant server sitting in a research lab or a computer in a<a href="https://spectrum.ieee.org/tag/solar-power" target="_self">solar-powered</a> home. Instead of constructing more data centers that require<a href="https://spectrum.ieee.org/tag/power-grid" target="_self">electric grids</a> to scale up their infrastructure and capacity, decentralization harnesses energy from existing sources, avoiding adding more power into the mix.</p><h2>Hardware in harmony</h2><p>Training AI models is a huge data center sport, synchronized across clusters of closely connected<a href="https://spectrum.ieee.org/tag/gpus" target="_self">GPUs</a>. But as<a href="https://spectrum.ieee.org/mlperf-trends" target="_self">hardware improvements struggle to keep up</a> with the swift rise in size of<a href="https://spectrum.ieee.org/tag/large-language-models" target="_self">large language models</a>, even massive single data centers are no longer cutting it.</p><p>Tech firms are turning to the pooled power of multiple data centers—no matter their location.<a href="https://spectrum.ieee.org/tag/nvidia" target="_self">Nvidia</a>, for instance, launched the<a href="https://developer.nvidia.com/blog/how-to-connect-distributed-data-centers-into-large-ai-factories-with-scale-across-networking/" target="_blank">Spectrum-XGS Ethernet for scale-across networking</a>, which “can deliver the performance needed for large-scale single job AI training and inference across geographically separated data centers.” Similarly,<a href="https://spectrum.ieee.org/tag/cisco" target="_self">Cisco</a> introduced its<a href="https://blogs.cisco.com/sp/the-new-benchmark-for-distributed-ai-networking" rel="noopener noreferrer" target="_blank">8223 router</a> designed to “connect geographically dispersed AI clusters.”</p><p>Other companies are harvesting idle compute in<a href="https://spectrum.ieee.org/tag/servers" target="_self">servers</a>, sparking the emergence of a<a href="https://spectrum.ieee.org/gpu-as-a-service" target="_self">GPU-as-a-Service</a> business model. Take<a href="https://akash.network/" rel="noopener noreferrer" target="_blank">Akash Network</a>, a peer-to-peer<a href="https://spectrum.ieee.org/tag/cloud-computing" target="_self">cloud computing</a> marketplace that bills itself as the “Airbnb for data centers.” Those with unused or underused GPUs in offices and smaller data centers register as providers, while those in need of computing power are considered as tenants who can choose among providers and rent their GPUs.</p><p>“If you look at [AI] training today, it’s very dependent on the latest and greatest GPUs,” says Akash cofounder and CEO<a href="https://www.linkedin.com/in/gosuri" rel="noopener noreferrer" target="_blank">Greg Osuri</a>. “The world is transitioning, fortunately, from only relying on large, high-density GPUs to now considering smaller GPUs.”</p><h2>Software in sync</h2><p>In addition to orchestrating the<a href="https://spectrum.ieee.org/tag/hardware" target="_self">hardware</a>, decentralized AI training also requires algorithmic changes on the<a href="https://spectrum.ieee.org/tag/software" target="_self">software</a> side. This is where<a href="https://cloud.google.com/discover/what-is-federated-learning" rel="noopener noreferrer" target="_blank">federated learning</a>, a form of distributed<a href="https://spectrum.ieee.org/tag/machine-learning" target="_self">machine learning</a>, comes in.</p><p>It starts with an initial version of a global AI model housed in a trusted entity such as a central server. The server distributes the model to participating organizations, which train it locally on their data and share only the model weights with the trusted entity, explains<a href="https://www.csail.mit.edu/person/lalana-kagal" rel="noopener noreferrer" target="_blank">Lalana Kagal</a>, a principal research scientist at<a href="https://www.csail.mit.edu/" rel="noopener noreferrer" target="_blank">MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL)</a> who leads the<a href="https://www.csail.mit.edu/research/decentralized-information-group-dig" rel="noopener noreferrer" target="_blank">Decentralized Information Group</a>. The trusted entity then aggregates the weights, often by averaging them, integrates them into the global model, and sends the updated model back to the participants. This collaborative training cycle repeats until the model is considered fully trained.</p><p>But there are drawbacks to distributing both data and computation. The constant back and forth exchanges of model weights, for instance, result in high communication costs. Fault tolerance is another issue.</p><p>“A big thing about AI is that every training step is not fault-tolerant,” Osuri says. “That means if one node goes down, you have to restore the whole batch again.”</p><p>To overcome these hurdles, researchers at<a href="https://deepmind.google/" rel="noopener noreferrer" target="_blank">Google DeepMind</a> developed<a href="https://arxiv.org/abs/2311.08105" rel="noopener noreferrer" target="_blank">DiLoCo</a>, a distributed low-communication optimization<a href="https://spectrum.ieee.org/tag/algorithms" target="_self">algorithm</a>. DiLoCo forms what<a href="https://spectrum.ieee.org/tag/google-deepmind" target="_self">Google DeepMind</a> research scientist<a href="https://arthurdouillard.com/" rel="noopener noreferrer" target="_blank">Arthur Douillard</a> calls “islands of compute,” where each island consists of a group of<a href="https://spectrum.ieee.org/tag/chips" target="_self">chips</a>. Every island holds a different chip type, but chips within an island must be of the same type. Islands are decoupled from each other, and synchronizing knowledge between them happens once in a while. This decoupling means islands can perform training steps independently without communicating as often, and chips can fail without having to interrupt the remaining healthy chips. However, the team’s experiments found diminishing performance after eight islands.</p><p>An improved version dubbed<a href="https://arxiv.org/abs/2501.18512" rel="noopener noreferrer" target="_blank">Streaming DiLoCo</a> further reduces the bandwidth requirement by synchronizing knowledge “in a streaming fashion across several steps and without stopping for communicating,” says Douillard. The mechanism is akin to watching a video even if it hasn’t been fully downloaded yet. “In Streaming DiLoCo, as you do computational work, the knowledge is being synchronized gradually in the background,” he adds.</p><p>AI development platform<a href="https://www.primeintellect.ai/" rel="noopener noreferrer" target="_blank">Prime Intellect</a> implemented a variant of the DiLoCo algorithm as a vital component of its 10-billion-parameter<a href="https://www.primeintellect.ai/blog/intellect-1-release" rel="noopener noreferrer" target="_blank">INTELLECT-1</a> model trained across five countries spanning three continents. Upping the ante,<a href="https://0g.ai/" rel="noopener noreferrer" target="_blank">0G Labs</a>, makers of a decentralized AI<a href="https://spectrum.ieee.org/tag/operating-system" target="_self">operating system</a>,<a href="https://0g.ai/blog/worlds-first-distributed-100b-parameter-ai" rel="noopener noreferrer" target="_blank">adapted DiLoCo to train a 107-billion-parameter foundation model</a> under a network of segregated clusters with limited bandwidth. Meanwhile, popular<a href="https://spectrum.ieee.org/tag/open-source" target="_self">open-source</a><a href="https://spectrum.ieee.org/tag/deep-learning" target="_self">deep learning</a> framework<a href="https://pytorch.org/projects/pytorch/" rel="noopener noreferrer" target="_blank">PyTorch</a> included DiLoCo in its<a href="https://meta-pytorch.org/torchft/" rel="noopener noreferrer" target="_blank">repository of fault tolerance techniques</a>.</p><p>“A lot of engineering has been done by the community to take our DiLoCo paper and integrate it in a system learning over consumer-grade internet,” Douillard says. “I’m very excited to see my research being useful.”</p><h2>A more energy-efficient way to train AI</h2><p>With hardware and software enhancements in place, decentralized AI training is primed to help solve AI’s energy problem. This approach offers the option of training models “in a cheaper, more resource-efficient, more energy-efficient way,” says MIT CSAIL’s Kagal.</p><p>And while Douillard admits that “training methods like DiLoCo are arguably more complex, they provide an interesting tradeoff of system efficiency.” For instance, you can now use data centers across far apart locations without needing to build ultrafast bandwidth in between. Douillard adds that fault tolerance is baked in because “the blast radius of a chip failing is limited to its island of compute.”</p><p>Even better, companies can take advantage of existing underutilized processing capacity rather than continuously building new energy-hungry data centers. Betting big on such an opportunity, Akash created its<a href="https://www.youtube.com/watch?v=zAj41xSNPeI" rel="noopener noreferrer" target="_blank">Starcluster program</a>. One of the program’s aims involves tapping into solar-powered homes and employing the desktops and laptops within them to train AI models. “We want to convert your home into a fully functional data center,” Osuri says.</p><p>Osuri acknowledges that participating in Starcluster will not be trivial. Beyond solar panels and devices equipped with consumer-grade GPUs, participants would also need to invest in<a href="https://spectrum.ieee.org/tag/batteries" target="_self">batteries</a> for backup power and redundant internet to prevent downtime. The Starcluster program is figuring out ways to package all these aspects together and make it easier for homeowners, including collaborating with industry partners to subsidize battery costs.</p><p>Backend work is already underway to enable<a href="https://akash.network/roadmap/aep-60/" rel="noopener noreferrer" target="_blank">homes to participate as providers in the Akash Network</a>, and the team hopes to reach its target by 2027. The Starcluster program also envisions expanding into other solar-powered locations, such as schools and local community sites.</p><p>Decentralized AI training holds much promise to steer AI toward a more environmentally sustainable future. For Osuri, such potential lies in moving AI “to where the energy is instead of moving the energy to where AI is.”</p>
- Published
- Apr 7, 2026, 2:00 PM
- Source
- IEEE Spectrum AI