Dataconomy
  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Glossary
    • Whitepapers
  • Newsletter
  • + More
    • Conversations
    • Events
    • About
      • About
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
Subscribe
No Result
View All Result
  • AI
  • Tech
  • Cybersecurity
  • Finance
  • DeFi & Blockchain
  • Startups
  • Gaming
Dataconomy
  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Glossary
    • Whitepapers
  • Newsletter
  • + More
    • Conversations
    • Events
    • About
      • About
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
Subscribe
No Result
View All Result
Dataconomy
No Result
View All Result

How good are large language models at playing games?

The study discovered strong correlations between game performance and other cognitive benchmarks for instance Sokoban skills aligned with math and coding abilities.

byEmre Çıtak
May 28, 2025
in Research
Home Research

Video games, with their demands on perception, memory, and strategic planning, seem like a natural arena for testing the capabilities of modern Large Language Models (LLMs). However, researchers have found that simply “dropping” LLMs into popular games often fails to provide an effective evaluation. A new benchmark, LMGAME-BENCH, developed by a team from UC San Diego, MBZUAI, and UC Berkeley, aims to change that by creating a more reliable and insightful way to assess how well LLMs can truly play.

Why LLMs often falter in standard game environments

While games have long served as a crucial testbed for reinforcement learning, using them to evaluate the complex agentic skills of today’s LLMs—their ability to see, reason, and plan over many steps—has proven tricky. The researchers behind LMGAME-BENCH identified three primary reasons why direct evaluation often falls short:

  • Brittle vision perception: Even advanced vision-language models (VLMs) can struggle with the nuanced visual understanding required to interpret complex game UIs and dynamic scenes accurately.
  • Prompt sensitivity: The performance of LLMs can vary wildly based on the specific wording and structure of the prompts used to guide their actions, making comparisons between models unreliable.
  • Potential data contamination: Many popular games have extensive online footprints, including walkthroughs, discussions, and visual assets. If an LLM has encountered this data during its training, its performance might reflect memorization rather than genuine problem-solving skills.

These issues often lead to LLMs performing poorly, sometimes no better than random action-taking, making it difficult to discern their true capabilities or distinguish between different models.

Stay Ahead of the Curve!

Don't miss out on the latest insights, trends, and analysis in the world of data, technology, and startups. Subscribe to our newsletter and get exclusive content delivered straight to your inbox.

To overcome these hurdles, the researchers developed LMGAME-BENCH. This benchmark features a suite of well-known platformer, puzzle, and narrative-driven games, all accessible through a unified Gym-style API. More importantly, it incorporates several key innovations:

A diverse test of skills

LMGAME-BENCH utilizes six popular games, chosen for their familiarity and the broad spectrum of cognitive skills they test:

  • Super Mario Bros: Evaluates visual perception, 2D spatial reasoning, and goal-directed planning with partial observability.
  • Tetris: Tests pattern recognition, spatial reasoning for tile matching, and long-horizon planning.
  • Sokoban: Emphasizes visual perception, spatial reasoning for character and box navigation, and critical long-horizon planning to avoid deadlocks in low fault-tolerance scenarios.
  • Candy Crush: Requires visual perception for identifying candies, spatial reasoning for anticipating chain reactions, and long-horizon planning to maximize points with limited moves.
  • 2048: Assesses visual perception for tracking tile values, spatial reasoning for managing merges, and goal-directed planning.
  • Ace Attorney: Stresses long-context language understanding, causal and deductive reasoning from extensive dialogues and evidence, and long-horizon, low-fault-tolerance decision making in multi-stage trials.

Scaffolding LLMs for meaningful interaction

A core component of LMGAME-BENCH is its “gaming harness,” a set of modular supports designed to address the inherent limitations of current LLMs and enable more meaningful evaluation. These modules can be toggled on or off for experiments:

  • Perception modules: These convert game UI inputs (visual layouts, text) into symbolic representations or textual descriptions that LLMs can more easily process. For grid-based games like Sokoban, this means a text-based table of object coordinates. For text-rich games like Ace Attorney, it involves extracting dialogue and describing visual cues. This helps minimize errors stemming purely from visual misinterpretation.
  • Memory modules: To aid in long-horizon planning, especially in games with rapidly expanding decision spaces like Sokoban and Tetris, the harness includes memory support. This consists of a transient memory (recording past game states and actions) and a reflection module (encoding lessons learned to avoid past failures and narrow the action space).
  • Reasoning modules: The benchmark is designed to accommodate models that use complex reasoning processes, such as long chain-of-thought (CoT) reasoning, by allowing models to generate detailed reasoning traces before deciding on an action.

The study found that activating this harness significantly boosts scores, with 86.7% of game runs outperforming a random baseline when harnessed, compared to only 40% without. This creates clearer performance gaps between models.

Tackling data contamination and prompt variance

LMGAME-BENCH implements specific strategies to ensure fair and reliable evaluations:

  • Data contamination checks & mitigation: For games like Super Mario Bros (vision) and Ace Attorney (text), where assets are widely available online, the team developed checks. For Ace Attorney, they found an initial correlation between model output similarity to fan transcripts and performance. However, after applying mitigation techniques like entity masking, paraphrasing, and enforced reasoning, this correlation disappeared, with rankings then aligning more with judged reasoning quality. For games with combinatorial state spaces (Tetris, 2048, Candy Crush, Sokoban), contamination risk was deemed negligible.
  • Prompt standardization: Recognizing that prompt engineering can drastically affect LLM performance, LMGAME-BENCH employs a two-stage optimization technique. First, an empirical approach based on standardized formats common in agentic workflows is used. Second, DSPy (a framework for algorithmically optimizing LLM prompts and weights) is leveraged to refine prompts further, aiming for the best average performance across models and reducing performance variance. For example, in 2048, this reduced variance by 33.8% to 63.5%.

Key findings from LMGAME-BENCH

The researchers evaluated 13 leading models. Without the gaming harness, most models performed poorly, often near random baselines, especially in complex games like Sokoban and Ace Attorney. With the harness, performance improved significantly, and the benchmark effectively differentiated between models. Top performers included models with strong reasoning capabilities like o3 and o1, followed by Gemini-2.5-pro-preview and Claude-3.7-sonnet-20250219 (thinking). Among non-reasoning models, GPT-4.1-2025-04-14 led its category.

A fascinating aspect of the study involved understanding what underlying capabilities game performance correlates with. By comparing LMGAME-BENCH results with performance on 20 other established benchmarks (spanning math, coding, language, visual reasoning, etc.), the team found:

  • Sokoban performance showed strong correlations with math and coding benchmarks.
  • Tetris and 2048 aligned closely with pattern recognition tasks.
  • Candy Crush related to coding, suggesting algorithmic reasoning.
  • Ace Attorney strongly correlated with language understanding benchmarks.

Using low-rank matrix factorization and linear modeling, the researchers further decomposed game performance into latent abilities. For instance, they identified features corresponding to language/multi-task knowledge, coding, symbolic/puzzle-solving, and physical reasoning. Different games in LMGAME-BENCH were shown to load on unique combinations of these latent abilities, suggesting that games evaluate a richer, more compositional set of skills than many benchmarks that test capabilities in isolation.


New IDP model rethinks how our brains actually retrieve memories


Perhaps one of the most exciting findings was the potential for game-based training to generalize. The team fine-tuned a Qwen2.5-7B-Instruct model using reinforcement learning (RL) on simplified versions of Sokoban and Tetris.

The results were compelling:

  • Training on Sokoban led to strong gains in more complex Sokoban scenarios, improved performance on the planning task Blocksworld, and even showed zero-shot improvement on Tetris.
  • Similarly, training on Tetris enhanced performance on other planning tasks and cross-game scenarios.
  • Interestingly, while these spatial reasoning and planning heuristics transferred effectively, they did not improve performance on math or coding tasks like GSM8K or BIRD. However, game-trained models did show improvement on the agentic WebShop benchmark, suggesting grid-game-derived skills can benefit some real-world decision-making tasks.

Featured image credit

Tags: Gaminglarge language model

Related Posts

Radware tricks ChatGPT’s Deep Research into Gmail data leak

Radware tricks ChatGPT’s Deep Research into Gmail data leak

September 19, 2025
OpenAI research finds AI models can scheme and deliberately deceive users

OpenAI research finds AI models can scheme and deliberately deceive users

September 19, 2025
MIT studies AI romantic bonds in r/MyBoyfriendIsAI group

MIT studies AI romantic bonds in r/MyBoyfriendIsAI group

September 19, 2025
Anthropic economic index reveals uneven Claude.ai adoption

Anthropic economic index reveals uneven Claude.ai adoption

September 17, 2025
Google releases VaultGemma 1B with differential privacy

Google releases VaultGemma 1B with differential privacy

September 17, 2025
OpenAI researchers identify the mathematical causes of AI hallucinations

OpenAI researchers identify the mathematical causes of AI hallucinations

September 17, 2025

LATEST NEWS

Zoom announces AI Companion 3.0 at Zoomtopia

Google Cloud adds Lovable and Windsurf as AI coding customers

Radware tricks ChatGPT’s Deep Research into Gmail data leak

Elon Musk’s xAI chatbot Grok exposed hundreds of thousands of private user conversations

Roblox game Steal a Brainrot removes AI-generated character, sparking fan backlash and a debate over copyright

DeepSeek releases R1 model trained for $294,000 on 512 H800 GPUs

Dataconomy

COPYRIGHT © DATACONOMY MEDIA GMBH, ALL RIGHTS RESERVED.

  • About
  • Imprint
  • Contact
  • Legal & Privacy

Follow Us

  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Glossary
    • Whitepapers
  • Newsletter
  • + More
    • Conversations
    • Events
    • About
      • About
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
No Result
View All Result
Subscribe

This website uses cookies. By continuing to use this website you are giving consent to cookies being used. Visit our Privacy Policy.