Video games, with their demands on perception, memory, and strategic planning, seem like a natural arena for testing the capabilities of modern Large Language Models (LLMs). However, researchers have found that simply “dropping” LLMs into popular games often fails to provide an effective evaluation. A new benchmark, LMGAME-BENCH, developed by a team from UC San Diego, MBZUAI, and UC Berkeley, aims to change that by creating a more reliable and insightful way to assess how well LLMs can truly play.
Why LLMs often falter in standard game environments
While games have long served as a crucial testbed for reinforcement learning, using them to evaluate the complex agentic skills of today’s LLMs—their ability to see, reason, and plan over many steps—has proven tricky. The researchers behind LMGAME-BENCH identified three primary reasons why direct evaluation often falls short:
- Brittle vision perception: Even advanced vision-language models (VLMs) can struggle with the nuanced visual understanding required to interpret complex game UIs and dynamic scenes accurately.
- Prompt sensitivity: The performance of LLMs can vary wildly based on the specific wording and structure of the prompts used to guide their actions, making comparisons between models unreliable.
- Potential data contamination: Many popular games have extensive online footprints, including walkthroughs, discussions, and visual assets. If an LLM has encountered this data during its training, its performance might reflect memorization rather than genuine problem-solving skills.
These issues often lead to LLMs performing poorly, sometimes no better than random action-taking, making it difficult to discern their true capabilities or distinguish between different models.
To overcome these hurdles, the researchers developed LMGAME-BENCH. This benchmark features a suite of well-known platformer, puzzle, and narrative-driven games, all accessible through a unified Gym-style API. More importantly, it incorporates several key innovations:
A diverse test of skills
LMGAME-BENCH utilizes six popular games, chosen for their familiarity and the broad spectrum of cognitive skills they test:
- Super Mario Bros: Evaluates visual perception, 2D spatial reasoning, and goal-directed planning with partial observability.
- Tetris: Tests pattern recognition, spatial reasoning for tile matching, and long-horizon planning.
- Sokoban: Emphasizes visual perception, spatial reasoning for character and box navigation, and critical long-horizon planning to avoid deadlocks in low fault-tolerance scenarios.
- Candy Crush: Requires visual perception for identifying candies, spatial reasoning for anticipating chain reactions, and long-horizon planning to maximize points with limited moves.
- 2048: Assesses visual perception for tracking tile values, spatial reasoning for managing merges, and goal-directed planning.
- Ace Attorney: Stresses long-context language understanding, causal and deductive reasoning from extensive dialogues and evidence, and long-horizon, low-fault-tolerance decision making in multi-stage trials.
Scaffolding LLMs for meaningful interaction
A core component of LMGAME-BENCH is its “gaming harness,” a set of modular supports designed to address the inherent limitations of current LLMs and enable more meaningful evaluation. These modules can be toggled on or off for experiments:
- Perception modules: These convert game UI inputs (visual layouts, text) into symbolic representations or textual descriptions that LLMs can more easily process. For grid-based games like Sokoban, this means a text-based table of object coordinates. For text-rich games like Ace Attorney, it involves extracting dialogue and describing visual cues. This helps minimize errors stemming purely from visual misinterpretation.
- Memory modules: To aid in long-horizon planning, especially in games with rapidly expanding decision spaces like Sokoban and Tetris, the harness includes memory support. This consists of a transient memory (recording past game states and actions) and a reflection module (encoding lessons learned to avoid past failures and narrow the action space).
- Reasoning modules: The benchmark is designed to accommodate models that use complex reasoning processes, such as long chain-of-thought (CoT) reasoning, by allowing models to generate detailed reasoning traces before deciding on an action.
The study found that activating this harness significantly boosts scores, with 86.7% of game runs outperforming a random baseline when harnessed, compared to only 40% without. This creates clearer performance gaps between models.
Tackling data contamination and prompt variance
LMGAME-BENCH implements specific strategies to ensure fair and reliable evaluations:
- Data contamination checks & mitigation: For games like Super Mario Bros (vision) and Ace Attorney (text), where assets are widely available online, the team developed checks. For Ace Attorney, they found an initial correlation between model output similarity to fan transcripts and performance. However, after applying mitigation techniques like entity masking, paraphrasing, and enforced reasoning, this correlation disappeared, with rankings then aligning more with judged reasoning quality. For games with combinatorial state spaces (Tetris, 2048, Candy Crush, Sokoban), contamination risk was deemed negligible.
- Prompt standardization: Recognizing that prompt engineering can drastically affect LLM performance, LMGAME-BENCH employs a two-stage optimization technique. First, an empirical approach based on standardized formats common in agentic workflows is used. Second, DSPy (a framework for algorithmically optimizing LLM prompts and weights) is leveraged to refine prompts further, aiming for the best average performance across models and reducing performance variance. For example, in 2048, this reduced variance by 33.8% to 63.5%.
Key findings from LMGAME-BENCH
The researchers evaluated 13 leading models. Without the gaming harness, most models performed poorly, often near random baselines, especially in complex games like Sokoban and Ace Attorney. With the harness, performance improved significantly, and the benchmark effectively differentiated between models. Top performers included models with strong reasoning capabilities like o3 and o1, followed by Gemini-2.5-pro-preview and Claude-3.7-sonnet-20250219 (thinking). Among non-reasoning models, GPT-4.1-2025-04-14 led its category.
A fascinating aspect of the study involved understanding what underlying capabilities game performance correlates with. By comparing LMGAME-BENCH results with performance on 20 other established benchmarks (spanning math, coding, language, visual reasoning, etc.), the team found:
- Sokoban performance showed strong correlations with math and coding benchmarks.
- Tetris and 2048 aligned closely with pattern recognition tasks.
- Candy Crush related to coding, suggesting algorithmic reasoning.
- Ace Attorney strongly correlated with language understanding benchmarks.
Using low-rank matrix factorization and linear modeling, the researchers further decomposed game performance into latent abilities. For instance, they identified features corresponding to language/multi-task knowledge, coding, symbolic/puzzle-solving, and physical reasoning. Different games in LMGAME-BENCH were shown to load on unique combinations of these latent abilities, suggesting that games evaluate a richer, more compositional set of skills than many benchmarks that test capabilities in isolation.
New IDP model rethinks how our brains actually retrieve memories
Perhaps one of the most exciting findings was the potential for game-based training to generalize. The team fine-tuned a Qwen2.5-7B-Instruct model using reinforcement learning (RL) on simplified versions of Sokoban and Tetris.
The results were compelling:
- Training on Sokoban led to strong gains in more complex Sokoban scenarios, improved performance on the planning task Blocksworld, and even showed zero-shot improvement on Tetris.
- Similarly, training on Tetris enhanced performance on other planning tasks and cross-game scenarios.
- Interestingly, while these spatial reasoning and planning heuristics transferred effectively, they did not improve performance on math or coding tasks like GSM8K or BIRD. However, game-trained models did show improvement on the agentic WebShop benchmark, suggesting grid-game-derived skills can benefit some real-world decision-making tasks.