Larmedias

Explore Scientific Knowledge.Understand Intelligence, Autonomy, and Decision-Making.

From the latest trends in AI, engineering, and space science to the mechanisms of intelligence, autonomy, and decision-making. Search for topics of interest and smoothly navigate to related concepts and resources.

Trace the knowledge behind events

Explore latest news and related concepts

Highlighted Story

With Nvidia Groq 3, the Era of AI Inference Is (Probably) Here

<img src="https://spectrum.ieee.org/media-library/a-man-in-all-black-presents-in-front-of-a-large-screen-which-compares-a-large-rectangular-chip-labelled-rubin-gpu-with-a-square.jpg?id=65298681&width=1200&height=400&coordinates=0%2C729%2C0%2C730"/><br/><br/><p>This week, over 30,000 people are descending upon San Jose, Calif., to attend<a href="https://www.nvidia.com/gtc/" rel="noopener noreferrer" target="_blank"> Nvidia GTC</a>, the so-called Superbowl of AI—a nickname that may or may not have been coined by Nvidia. At the main event Jensen Huang, Nvidia CEO, took the stage to announce (among other things) a new line of<a href="https://spectrum.ieee.org/nvidia-rubin-networking" target="_self"> next generation Vera Rubin</a> chips that represent a first for the GPU giant: a chip designed specifically to handle AI inference. The Nvidia Groq 3 language processing unit (LPU) incorporates intellectual property Nvidia<a href="https://groq.com/newsroom/groq-and-nvidia-enter-non-exclusive-inference-technology-licensing-agreement-to-accelerate-ai-inference-at-global-scale" rel="noopener noreferrer" target="_blank"> licensed</a> from the start-up Groq last Christmas Eve for US $20 billion.</p><p>“Finally, AI is able to do productive work, and therefore the inflection point of inference has arrived,” Huang told the crowd. “AI now has to think. In order to think, it has to inference. AI now has to do; in order to do, it has to inference.”</p><p>Training and inference tasks have distinct computational requirements. While training can be done on huge amounts of data at the same time and can take weeks, inference must be run on a user’s query when it comes in. Unlike training, inference doesn’t require running costly<a href="https://spectrum.ieee.org/what-is-deep-learning/backpropagation" target="_self"> backpropagation</a>. With inference, the most important thing is low latency—users expect the chatbot to answer quickly, and for thinking or reasoning models inference runs many times before the user even sees an output.</p><p>Over the past few years, inference-specific chip start-ups were experiencing a sort of Cambrian explosion, with different companies exploring distinct approaches to speed up the task. The start-ups include<a href="https://www.d-matrix.ai/" rel="noopener noreferrer" target="_blank"> D-matrix</a> with digital in-memory compute,<a href="https://www.etched.com/" rel="noopener noreferrer" target="_blank"> Etched</a> with an ASIC for transformer inference,<a href="https://rain.ai/" rel="noopener noreferrer" target="_blank"> RainAI</a> with neuromorphic chips,<a href="https://en100.enchargeai.com/" rel="noopener noreferrer" target="_blank"> EnCharge</a> with analog in-memory compute,<a href="https://www.tensordyne.ai/" rel="noopener noreferrer" target="_blank"> Tensordyne</a> with logarithmic math to make AI computations more efficient,<a href="https://furiosa.ai/" rel="noopener noreferrer" target="_blank"> FuriosaAI</a> with hardware optimized for tensor operation rather than vector-matrix multiplication, and others.</p><p>Late last year, it looked like Nvidia had picked one of the winners among the crop of inference chips, when it announced its deal with Groq. The Nvidia Groq 3 LPU reveal came a mere two and a half months after, highlighting the urgency of the growing inference market.</p><h2>Memory bandwidth and data flow</h2><p>Groq’s approach to accelerating inference relies on interleaving processing units with memory units on the chip. Instead of relying on high-bandwidth memory (HBM) situated next to GPUs it leans on SRAM memory integrated within the processor itself. This design greatly simplifies the flow of data through the chip, allowing it to proceed in a streamlined, linear fashion.</p><p>“The data actually flows directly through the SRAM,”<a href="https://www.linkedin.com/in/markheaps/" rel="noopener noreferrer" target="_blank"> Mark Heaps</a> said at the Supercomputing conference in 2024. Heaps was a chief technology evangelist at Groq at the time and is now director of developer marketing at Nvidia. “When you look at a multi-core GPU, a lot of the instruction commands need to be sent off the chip, to get into memory and then come back in. We don’t have that. It all passes through in a linear order.”</p><p>Using SRAM allows that linear data flow to happen exceptionally fast, leading to the low latency required for inference applications. “The LPU is optimized strictly for that extreme low latency token generation,” says<a href="https://www.linkedin.com/in/ian-buck-19201315/" rel="noopener noreferrer" target="_blank"> Ian Buck</a>, VP and general manager of hyperscale and high-performance computing at Nvidia.</p><p>Comparing the Rubin GPU and Groq 3 LPU side by side highlights the difference. The Rubin GPU has access to a whopping 288 gigabytes of HBM and is capable of 50 quadrillion floating-point operations per second (petaFLOPS) of 4-bit computation. The Groq 3 LPU contains a mere 500 megabytes of SRAM memory, and is capable of 1.2 petaFLOPS of 8-bit computation. On the other hand, while the Rubin GPU has a memory bandwidth of 22 terabytes per second, at 150 TB/s the Groq 3 LPU is seven times as fast,. The lean, speed-focused design is what allows the LPU to excel at inference.</p><p>The new inference chip underscores the ongoing trend of AI adoption, which shifts the computational load from just building ever bigger models to actually using those models at scale .“NVIDIA’s announcement validates the importance of SRAM-based architectures for large-scale inference, and no one has pushed SRAM density further than d-Matrix,” says d-Matrix CEO Sid Sheth. He’s betting that data center customers will want a variety of processors for inference. “The winning systems will combine different types of silicon and fit easily into existing data centers alongside GPUs.”</p><p>Inference-only chips may not be the only solution. Late last week, <a href="https://press.aboutamazon.com/aws/2026/3/aws-and-cerebras-collaboration-aims-to-set-a-new-standard-for-ai-inference-speed-and-performance-in-the-cloud" rel="noopener noreferrer" target="_blank">Amazon Web Services</a> said that it will deploy a new kind of inferencing system in its data centers. The system is a combination of AWS’ Tranium <a href="https://spectrum.ieee.org/amazon-ai" target="_self">AI accelerator </a>and <a href="https://spectrum.ieee.org/cerebras-chip-cs3" target="_self">Cerebras Systems’ third generation computer CS-3</a>, which is built around the <a href="https://spectrum.ieee.org/cerebrass-giant-chip-will-smash-deep-learnings-speed-barrier" target="_self">largest single chip</a> ever made. The two-part system is meant to take advantage of a technique called inference disaggregation. It separates inference into two parts—processing the prompt, called prefill, and generating the output, called decode. Prefill is inherently parallel, computationally intensive, and doesn’t need much memory bandwidth. While decode is a more serial process that needs a lot of memory bandwidth. Cerebras has maximized the memory bandwidth issue by building more 44 GB of SRAM on its chip connected by a 21 PB/s network. </p><p>Nvidia, too, intends to take advantage of inference disaggregation in its new, combined compute tray called the Nvidia Groq 3 LPX. Each tray will house 8 Groq 3 LPUs and a Vera Rubin, which pairs Rubin GPUs with a Vera CPU. The pre-fill and the more computationally intensive parts of the decode are done on Vera Rubin, while the final part is done on the Groq 3 LPU, leveraging the strengths of each chip. “We’re in volume production now,” Huang said.</p>

Published
Mar 16, 2026, 9:04 PM
Source
IEEE Spectrum AI

More News

Latest Knowledge

Recently added important knowledge

Higher Education and National Security
Higher Education and National Security

The intersection of higher education and national security involves the consideration of how educational institutions manage foreign students and researchers in the context of national interests. This can include concerns about intellectual property, research security, and the potential for espionage.

Conflict of Interest
Conflict of Interest

A conflict of interest occurs when an individual or organization has multiple interests, one of which could potentially corrupt the motivation for an act in another. In research, this can lead to biased results and undermine trust in scientific findings.

Aging Research
Aging Research

Aging research focuses on understanding the biological processes that lead to aging and developing interventions to slow down or reverse these processes. This field encompasses various disciplines, including genetics, cellular biology, and biochemistry.

Research Independence
Research Independence

Research independence is the ability of a PhD student to conduct their research autonomously, making decisions about their project direction and methodology. This skill is essential for developing critical thinking and problem-solving abilities in a scientific context.

Cognitive Computing
Cognitive Computing

Cognitive Computing refers to systems that simulate human thought processes in complex situations. It combines elements of AI, machine learning, and data analytics to create systems that can learn from data, reason, and interact naturally with humans.

International Student Policies
International Student Policies

International student policies refer to the regulations and guidelines that govern the admission and enrollment of students from other countries in educational institutions. These policies can be influenced by political, economic, and social factors, and can vary significantly from one country to another.