Larmedias

Explore Scientific Knowledge.Understand Intelligence, Autonomy, and Decision-Making.

From the latest trends in AI, engineering, and space science to the mechanisms of intelligence, autonomy, and decision-making. Search for topics of interest and smoothly navigate to related concepts and resources.

Trace the knowledge behind events

Explore latest news and related concepts

Highlighted Story

Decentralized Training Can Help Solve AI’s Energy Woes

<img src="https://spectrum.ieee.org/media-library/illustration-of-several-data-servers-interconnected-across-long-distances.jpg?id=65477795&width=1245&height=700&coordinates=0%2C156%2C0%2C157"/><br/><br/><p><a href="https://spectrum.ieee.org/topic/artificial-intelligence/" target="_self">Artificial intelligence</a><span> harbors an enormous</span><a href="https://spectrum.ieee.org/topic/energy/" target="_self">energy</a><span> appetite. Such constant cravings are evident in the</span><a href="https://spectrum.ieee.org/ai-index-2025" target="_self">hefty carbon footprint</a><span> of the</span><a href="https://spectrum.ieee.org/tag/data-centers" target="_self">data centers</a><span> behind the AI boom and the steady increase over time of</span><a href="https://spectrum.ieee.org/tag/carbon-emissions" target="_self">carbon emissions</a><span> from training frontier</span><a href="https://spectrum.ieee.org/tag/ai-models" target="_self">AI models</a><span>.</span></p><p>No wonder big tech companies are warming up to<a href="https://spectrum.ieee.org/tag/nuclear-energy" target="_self">nuclear energy</a>, envisioning a future fueled by reliable, carbon-free sources. But while<a href="https://spectrum.ieee.org/nuclear-powered-data-center" target="_self">nuclear-powered data centers</a> might still be years away, some in the research and industry spheres are taking action right now to curb AI’s growing energy demands. They’re tackling training as one of the most energy-intensive phases in a model’s life cycle, focusing their efforts on decentralization.</p><p>Decentralization allocates model training across a network of independent nodes rather than relying on one platform or provider. It allows compute to go where the energy is—be it a dormant server sitting in a research lab or a computer in a<a href="https://spectrum.ieee.org/tag/solar-power" target="_self">solar-powered</a> home. Instead of constructing more data centers that require<a href="https://spectrum.ieee.org/tag/power-grid" target="_self">electric grids</a> to scale up their infrastructure and capacity, decentralization harnesses energy from existing sources, avoiding adding more power into the mix.</p><h2>Hardware in harmony</h2><p>Training AI models is a huge data center sport, synchronized across clusters of closely connected<a href="https://spectrum.ieee.org/tag/gpus" target="_self">GPUs</a>. But as<a href="https://spectrum.ieee.org/mlperf-trends" target="_self">hardware improvements struggle to keep up</a> with the swift rise in size of<a href="https://spectrum.ieee.org/tag/large-language-models" target="_self">large language models</a>, even massive single data centers are no longer cutting it.</p><p>Tech firms are turning to the pooled power of multiple data centers—no matter their location.<a href="https://spectrum.ieee.org/tag/nvidia" target="_self">Nvidia</a>, for instance, launched the<a href="https://developer.nvidia.com/blog/how-to-connect-distributed-data-centers-into-large-ai-factories-with-scale-across-networking/" target="_blank">Spectrum-XGS Ethernet for scale-across networking</a>, which “can deliver the performance needed for large-scale single job AI training and inference across geographically separated data centers.” Similarly,<a href="https://spectrum.ieee.org/tag/cisco" target="_self">Cisco</a> introduced its<a href="https://blogs.cisco.com/sp/the-new-benchmark-for-distributed-ai-networking" rel="noopener noreferrer" target="_blank">8223 router</a> designed to “connect geographically dispersed AI clusters.”</p><p>Other companies are harvesting idle compute in<a href="https://spectrum.ieee.org/tag/servers" target="_self">servers</a>, sparking the emergence of a<a href="https://spectrum.ieee.org/gpu-as-a-service" target="_self">GPU-as-a-Service</a> business model. Take<a href="https://akash.network/" rel="noopener noreferrer" target="_blank">Akash Network</a>, a peer-to-peer<a href="https://spectrum.ieee.org/tag/cloud-computing" target="_self">cloud computing</a> marketplace that bills itself as the “Airbnb for data centers.” Those with unused or underused GPUs in offices and smaller data centers register as providers, while those in need of computing power are considered as tenants who can choose among providers and rent their GPUs.</p><p>“If you look at [AI] training today, it’s very dependent on the latest and greatest GPUs,” says Akash cofounder and CEO<a href="https://www.linkedin.com/in/gosuri" rel="noopener noreferrer" target="_blank">Greg Osuri</a>. “The world is transitioning, fortunately, from only relying on large, high-density GPUs to now considering smaller GPUs.”</p><h2>Software in sync</h2><p>In addition to orchestrating the<a href="https://spectrum.ieee.org/tag/hardware" target="_self">hardware</a>, decentralized AI training also requires algorithmic changes on the<a href="https://spectrum.ieee.org/tag/software" target="_self">software</a> side. This is where<a href="https://cloud.google.com/discover/what-is-federated-learning" rel="noopener noreferrer" target="_blank">federated learning</a>, a form of distributed<a href="https://spectrum.ieee.org/tag/machine-learning" target="_self">machine learning</a>, comes in.</p><p>It starts with an initial version of a global AI model housed in a trusted entity such as a central server. The server distributes the model to participating organizations, which train it locally on their data and share only the model weights with the trusted entity, explains<a href="https://www.csail.mit.edu/person/lalana-kagal" rel="noopener noreferrer" target="_blank">Lalana Kagal</a>, a principal research scientist at<a href="https://www.csail.mit.edu/" rel="noopener noreferrer" target="_blank">MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL)</a> who leads the<a href="https://www.csail.mit.edu/research/decentralized-information-group-dig" rel="noopener noreferrer" target="_blank">Decentralized Information Group</a>. The trusted entity then aggregates the weights, often by averaging them, integrates them into the global model, and sends the updated model back to the participants. This collaborative training cycle repeats until the model is considered fully trained.</p><p>But there are drawbacks to distributing both data and computation. The constant back and forth exchanges of model weights, for instance, result in high communication costs. Fault tolerance is another issue.</p><p>“A big thing about AI is that every training step is not fault-tolerant,” Osuri says. “That means if one node goes down, you have to restore the whole batch again.”</p><p>To overcome these hurdles, researchers at<a href="https://deepmind.google/" rel="noopener noreferrer" target="_blank">Google DeepMind</a> developed<a href="https://arxiv.org/abs/2311.08105" rel="noopener noreferrer" target="_blank">DiLoCo</a>, a distributed low-communication optimization<a href="https://spectrum.ieee.org/tag/algorithms" target="_self">algorithm</a>. DiLoCo forms what<a href="https://spectrum.ieee.org/tag/google-deepmind" target="_self">Google DeepMind</a> research scientist<a href="https://arthurdouillard.com/" rel="noopener noreferrer" target="_blank">Arthur Douillard</a> calls “islands of compute,” where each island consists of a group of<a href="https://spectrum.ieee.org/tag/chips" target="_self">chips</a>. Every island holds a different chip type, but chips within an island must be of the same type. Islands are decoupled from each other, and synchronizing knowledge between them happens once in a while. This decoupling means islands can perform training steps independently without communicating as often, and chips can fail without having to interrupt the remaining healthy chips. However, the team’s experiments found diminishing performance after eight islands.</p><p>An improved version dubbed<a href="https://arxiv.org/abs/2501.18512" rel="noopener noreferrer" target="_blank">Streaming DiLoCo</a> further reduces the bandwidth requirement by synchronizing knowledge “in a streaming fashion across several steps and without stopping for communicating,” says Douillard. The mechanism is akin to watching a video even if it hasn’t been fully downloaded yet. “In Streaming DiLoCo, as you do computational work, the knowledge is being synchronized gradually in the background,” he adds.</p><p>AI development platform<a href="https://www.primeintellect.ai/" rel="noopener noreferrer" target="_blank">Prime Intellect</a> implemented a variant of the DiLoCo algorithm as a vital component of its 10-billion-parameter<a href="https://www.primeintellect.ai/blog/intellect-1-release" rel="noopener noreferrer" target="_blank">INTELLECT-1</a> model trained across five countries spanning three continents. Upping the ante,<a href="https://0g.ai/" rel="noopener noreferrer" target="_blank">0G Labs</a>, makers of a decentralized AI<a href="https://spectrum.ieee.org/tag/operating-system" target="_self">operating system</a>,<a href="https://0g.ai/blog/worlds-first-distributed-100b-parameter-ai" rel="noopener noreferrer" target="_blank">adapted DiLoCo to train a 107-billion-parameter foundation model</a> under a network of segregated clusters with limited bandwidth. Meanwhile, popular<a href="https://spectrum.ieee.org/tag/open-source" target="_self">open-source</a><a href="https://spectrum.ieee.org/tag/deep-learning" target="_self">deep learning</a> framework<a href="https://pytorch.org/projects/pytorch/" rel="noopener noreferrer" target="_blank">PyTorch</a> included DiLoCo in its<a href="https://meta-pytorch.org/torchft/" rel="noopener noreferrer" target="_blank">repository of fault tolerance techniques</a>.</p><p>“A lot of engineering has been done by the community to take our DiLoCo paper and integrate it in a system learning over consumer-grade internet,” Douillard says. “I’m very excited to see my research being useful.”</p><h2>A more energy-efficient way to train AI</h2><p>With hardware and software enhancements in place, decentralized AI training is primed to help solve AI’s energy problem. This approach offers the option of training models “in a cheaper, more resource-efficient, more energy-efficient way,” says MIT CSAIL’s Kagal.</p><p>And while Douillard admits that “training methods like DiLoCo are arguably more complex, they provide an interesting tradeoff of system efficiency.” For instance, you can now use data centers across far apart locations without needing to build ultrafast bandwidth in between. Douillard adds that fault tolerance is baked in because “the blast radius of a chip failing is limited to its island of compute.”</p><p>Even better, companies can take advantage of existing underutilized processing capacity rather than continuously building new energy-hungry data centers. Betting big on such an opportunity, Akash created its<a href="https://www.youtube.com/watch?v=zAj41xSNPeI" rel="noopener noreferrer" target="_blank">Starcluster program</a>. One of the program’s aims involves tapping into solar-powered homes and employing the desktops and laptops within them to train AI models. “We want to convert your home into a fully functional data center,” Osuri says.</p><p>Osuri acknowledges that participating in Starcluster will not be trivial. Beyond solar panels and devices equipped with consumer-grade GPUs, participants would also need to invest in<a href="https://spectrum.ieee.org/tag/batteries" target="_self">batteries</a> for backup power and redundant internet to prevent downtime. The Starcluster program is figuring out ways to package all these aspects together and make it easier for homeowners, including collaborating with industry partners to subsidize battery costs.</p><p>Backend work is already underway to enable<a href="https://akash.network/roadmap/aep-60/" rel="noopener noreferrer" target="_blank">homes to participate as providers in the Akash Network</a>, and the team hopes to reach its target by 2027. The Starcluster program also envisions expanding into other solar-powered locations, such as schools and local community sites.</p><p>Decentralized AI training holds much promise to steer AI toward a more environmentally sustainable future. For Osuri, such potential lies in moving AI “to where the energy is instead of moving the energy to where AI is.”</p>

Published
Apr 7, 2026, 2:00 PM
Source
IEEE Spectrum AI
IEEE Spectrum AI
Why AI Systems Fail Quietly

<img src="https://spectrum.ieee.org/media-library/a-series-of-135-green-dots-slowly-transition-from-bright-green-to-black.png?id=65461614&width=1200&height=800&…

More News

Latest Knowledge

Recently added important knowledge

Higher Education and National Security
Higher Education and National Security

The intersection of higher education and national security involves the consideration of how educational institutions manage foreign students and researchers in the context of national interests. This can include concerns about intellectual property, research security, and the potential for espionage.

Conflict of Interest
Conflict of Interest

A conflict of interest occurs when an individual or organization has multiple interests, one of which could potentially corrupt the motivation for an act in another. In research, this can lead to biased results and undermine trust in scientific findings.

Aging Research
Aging Research

Aging research focuses on understanding the biological processes that lead to aging and developing interventions to slow down or reverse these processes. This field encompasses various disciplines, including genetics, cellular biology, and biochemistry.

Research Independence
Research Independence

Research independence is the ability of a PhD student to conduct their research autonomously, making decisions about their project direction and methodology. This skill is essential for developing critical thinking and problem-solving abilities in a scientific context.

Cognitive Computing
Cognitive Computing

Cognitive Computing refers to systems that simulate human thought processes in complex situations. It combines elements of AI, machine learning, and data analytics to create systems that can learn from data, reason, and interact naturally with humans.

International Student Policies
International Student Policies

International student policies refer to the regulations and guidelines that govern the admission and enrollment of students from other countries in educational institutions. These policies can be influenced by political, economic, and social factors, and can vary significantly from one country to another.