Larmedias

Explore Scientific Knowledge.Understand Intelligence, Autonomy, and Decision-Making.

From the latest trends in AI, engineering, and space science to the mechanisms of intelligence, autonomy, and decision-making. Search for topics of interest and smoothly navigate to related concepts and resources.

Trace the knowledge behind events

Explore latest news and related concepts

Highlighted Story

Why Are Large Language Models so Terrible at Video Games?

<img src="https://spectrum.ieee.org/media-library/a-middle-aged-man-smiling-while-holding-a-video-game-controller-behind-him-is-a-bookshelf-filled-with-countless-cooperative-boa.jpg?id=65413486&width=1200&height=800&coordinates=0%2C208%2C0%2C209"/><br/><br/><p>Large language models (LLMs) have improved so quickly <a href="https://spectrum.ieee.org/ai-math-benchmarks" target="_self">that the benchmarks themselves</a> have evolved, adding more complex problems in an effort to challenge the latest models. Yet LLMs haven’t improved across all domains, and one task remains far outside their grasp: They have no idea how to play video games. <br/><br/>While a few have managed to beat a few games (for example, <a href="https://techcrunch.com/2025/05/03/googles-gemini-has-beaten-pokemon-blue-with-a-little-help/" target="_blank">Gemini 2.5 Pro beat Pokemon Blue</a> in May of 2025), these exceptions prove the rule. The eventually victorious AI completed games far more slowly than a typical human player, made bizarre and often repetitive mistakes, and required custom software to guide their interactions with the game.</p><p><a href="http://julian.togelius.com/" rel="noopener noreferrer" target="_blank">Julian Togelius</a>, the director of New York University’s <a href="https://game.engineering.nyu.edu/" target="_blank">Game Innovation Lab</a> and co-founder of AI game testing company Modl.ai, explored the implications of LLMs’ limitations in video games <a href="http://julian.togelius.com/Togelius2026What.pdf" target="_blank">in a recent paper</a>. He spoke with <em>IEEE Spectrum</em> about what this lack of video games skills can tell us about the broader state of AI in 2026. <strong></strong></p><p><strong>LLMs have improved rapidly in coding, and your paper frames coding as a kind of well-behaved game. What do you mean by that?</strong> </p><p><strong>Julian Togelius: </strong>Coding is extremely well-behaved in the sense that you have tasks. These are like levels. You get a specification, you write code, and then you run it. <br/><br/>The reward is immediate and granular. The code has to compile, it has to run without crashing, and then it usually has to pass tests. Often, there’s also an explanation of how and why it failed. <br/><br/>There’s a theory from game designer <a href="https://www.raphkoster.com/" target="_blank">Raph Koster</a> that games are fun because we learn to play them as we play them. From that perspective, writing code is an extremely well-designed game. And in fact, writing code is something many people enjoy doing.<br/><br/><strong>Unlike coding, LLMs struggle with video games. This feels surprising given their <a data-linked-post="2671645555" href="https://spectrum.ieee.org/vibe-coding" target="_blank">success in coding</a>, as well as in games like chess and Go. What is it about video games that’s causing a problem?</strong> </p><p><strong>Togelius</strong>: It’s not just LLMs that are bad at this. We do not have general game AI.</p><p>There’s a widespread perception that because we can build AI that plays particular games well, we should be able to build one that plays any game. I’m not sure we’re going to get there. </p><p>People will mention that Google’s <a href="https://spectrum.ieee.org/deepmind-achieves-holy-grail" target="_self">AlphaZero</a> [which is not an LLM] can play both Go and chess. However, it had to be retrained and re-engineered for each. And those are games that are similar in terms of input and output space. Most games are more different from each other. They have different mechanics and different input representations.</p><p>There’s also a data problem. Some of the games that AI can successfully play, like Minecraft and Pokémon, are among the most well studied games in the world with literally millions of hours of guides. For a less well-known game, there’s far less. </p><h2>Video Game Benchmarks for LLM Performance</h2><p><strong>One factor that seems to help LLMs improve in coding is the proliferation of benchmarks. We have many benchmarks LLMs can try to solve, we can score the results, and then modify the LLM to improve performance. Developing a benchmark for playing a video game, though, is less clear-cut. Why is that?</strong></p><p><strong>Togelius: </strong>I’ve built many game-based AI benchmarks over the years. One, <a href="https://cdn.aaai.org/ojs/9869/9869-13-13397-1-2-20201228.pdf" rel="noopener noreferrer" target="_blank">the General Video Game AI competition</a>, ran for seven years. We tested an agent on our publicly available games, and every time we ran the competition we invented ten new games to test on. <br/><br/>One reason we stopped was that we stopped seeing progress. Agents got better at some games but worse at others. This was before LLMs.<br/><br/>Lately we’ve been updating this framework for LLMs. They fail. They absolutely suck. All of them. They don’t even do as well as a simple search algorithm. <br/><br/>Why? They were never trained on these games, and they’re separately very bad at spatial reasoning. Which shouldn’t be surprising, because that’s also not in the training data.</p><p><strong>This brings us to what seems like a contradiction. LLMs are bad at playing games. Yet at the same time, they’re improving rapidly at coding, a skillset that can be used to create a game. How do these facts fit together?</strong> </p><p><strong>Togelius: </strong>It’s super weird. You can go into Cursor or Claude, write one prompt, and get a playable game. The game will be very typical, because an LLM’s code writing abilities are better the more typical something is. So, if you ask it to give you something like <a data-linked-post="2656808350" href="https://spectrum.ieee.org/commodore-64" target="_blank">Asteroids</a>, it will work. That’s impressive.<br/><br/>However, it’s not going to give you a good or novel game. That does seem weird. The reason is that the LLM can’t play it. Game development is an iterative process. You write, you test, you adjust the game feel. An LLM can’t do that. </p><p>And to an extent, I don’t think it’s different when designing other software. Yes, you can ask an LLM to create a GUI with a bunch of buttons. But the LLM doesn’t know much about how to use it. </p><p><strong>Companies like Nvidia and Google have talked about using simulations, including game-like environments, to improve AI performance. If AI can’t master games in general, how optimistic should we be about that approach?</strong></p><p><strong>Togelius: </strong>Games are both easier and harder than the real world. They’re easier because there are fewer levels of abstraction. They’re harder because games are much more diverse. The real world has the same physics everywhere. </p><p>One example is Waymo, which uses world models in its training loop. That makes sense because driving is much the same everywhere. It’s way less diverse than games. </p><p>That’s confusing for people. People see an LLM write an academic essay on quantum physics and wonder, “how can it not play both Halo and Space Invaders?” However, those games are more different from each other, in a sense, than two academic essays. </p>

Published
Mar 29, 2026, 1:00 PM
Source
IEEE Spectrum AI

More News

Latest Knowledge

Recently added important knowledge

Higher Education and National Security
Higher Education and National Security

The intersection of higher education and national security involves the consideration of how educational institutions manage foreign students and researchers in the context of national interests. This can include concerns about intellectual property, research security, and the potential for espionage.

Conflict of Interest
Conflict of Interest

A conflict of interest occurs when an individual or organization has multiple interests, one of which could potentially corrupt the motivation for an act in another. In research, this can lead to biased results and undermine trust in scientific findings.

Aging Research
Aging Research

Aging research focuses on understanding the biological processes that lead to aging and developing interventions to slow down or reverse these processes. This field encompasses various disciplines, including genetics, cellular biology, and biochemistry.

Research Independence
Research Independence

Research independence is the ability of a PhD student to conduct their research autonomously, making decisions about their project direction and methodology. This skill is essential for developing critical thinking and problem-solving abilities in a scientific context.

Cognitive Computing
Cognitive Computing

Cognitive Computing refers to systems that simulate human thought processes in complex situations. It combines elements of AI, machine learning, and data analytics to create systems that can learn from data, reason, and interact naturally with humans.

International Student Policies
International Student Policies

International student policies refer to the regulations and guidelines that govern the admission and enrollment of students from other countries in educational institutions. These policies can be influenced by political, economic, and social factors, and can vary significantly from one country to another.