Dataconomy
  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Whitepapers
    • AI Models Leaderboard
  • AI toolsNEW
  • Newsletter
  • + More
    • Glossary
    • Conversations
    • Events
    • About
      • Who we are
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
Subscribe
No Result
View All Result
  • AI
  • Tech
  • Cybersecurity
  • Finance
  • DeFi & Blockchain
  • Startups
  • Gaming
Dataconomy
  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Whitepapers
    • AI Models Leaderboard
  • AI toolsNEW
  • Newsletter
  • + More
    • Glossary
    • Conversations
    • Events
    • About
      • Who we are
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
Subscribe
No Result
View All Result
Dataconomy
No Result
View All Result

Apparently, LLMs are really bad at playing chess

Except for one dark horse...

byEmre Çıtak
November 18, 2024
in Artificial Intelligence
Home News Artificial Intelligence
Share on FacebookShare on TwitterShare on LinkedInShare on WhatsAppShare on e-mail
Google Preferred Source
  • Not all LLMs are equal: GPT-3.5-turbo-instruct stands out as the most capable chess-playing model tested.
  • Fine-tuning is crucial: Instruction tuning and targeted dataset exposure dramatically enhance performance in specific domains.
  • Chess as a benchmark: The experiment highlights chess as a valuable benchmark for evaluating LLM capabilities and refining AI systems.

Can AI language models play chess? That question sparked a recent investigation into how well large language models (LLMs) handle chess tasks, revealing unexpected insights about their strengths, weaknesses, and training methodologies.

While some models floundered against even the simplest chess engines, others—like OpenAI’s GPT-3.5-turbo-instruct—showed surprising potential, pointing to intriguing implications for AI development.

Testing LLMs against chess engines

Researchers tested various LLMs by asking them to play chess as grandmasters, providing game states in algebraic notation. Initial excitement centered on whether LLMs, trained on vast text corpora, could leverage embedded chess knowledge to predict moves effectively.

Stay Ahead of the Curve!

Don't miss out on the latest insights, trends, and analysis in the world of data, technology, and startups. Subscribe to our newsletter and get exclusive content delivered straight to your inbox.

However, results showed that not all LLMs are created equal.

The study began with smaller models like llama-3.2-3b, which has 3 billion parameters. After 50 games against Stockfish’s lowest difficulty setting, the model lost every match, failing to protect its pieces or maintain a favorable board position.

Testing escalated to larger models, such as llama-3.1-70b and its instruction-tuned variant, but they also struggled, showing only slight improvements. Other models, including Qwen-2.5-72b and command-r-v01, continued the trend, revealing a general inability to grasp even basic chess strategies.

chess performance of LLMs research
Smaller LLMs, like llama-3.2-3b, struggled with basic chess strategies, losing consistently to even beginner-level engines (Image credit)

GPT-3.5-turbo-instruct was the unexpected winner

The turning point came with GPT-3.5-turbo-instruct, which excelled against Stockfish—even when the engine’s difficulty level was increased. Unlike chat-oriented counterparts like gpt-3.5-turbo and gpt-4o, the instruct-tuned model consistently produced winning moves.

Why do some models excel while others fail?

Key findings from the research offered valuable insights:

  • Instruction tuning matters: Models like GPT-3.5-turbo-instruct benefited from human feedback fine-tuning, which improved their ability to process structured tasks like chess.
  • Dataset exposure: There’s speculation that instruct models may have been exposed to a richer dataset of chess games, granting them superior strategic reasoning.
  • Tokenization challenges: Small nuances, like incorrect spaces in prompts, disrupted performance, highlighting the sensitivity of LLMs to input formatting.
  • Competing data influences: Training LLMs on diverse datasets may dilute their ability to excel at specialized tasks, such as chess, unless counterbalanced with targeted fine-tuning.

As AI continues to improve, these lessons will inform strategies for improving model performance across disciplines. Whether it’s chess, natural language understanding, or other intricate tasks, understanding how to train and tune AI is essential for unlocking its full potential.


Featured image credit: Piotr Makowski/Unsplash

Tags: AIChess

Related Posts

Anthropic launches Claude Science workbench for researchers

Anthropic launches Claude Science workbench for researchers

July 1, 2026
ChatGPT Plus users can now connect financial accounts

ChatGPT Plus users can now connect financial accounts

July 1, 2026
Google rolls out Gemini Spark for macOS subscribers in the US

Google rolls out Gemini Spark for macOS subscribers in the US

July 1, 2026
Google expands Gemini’s personalized image generation to all U.S. users

Google expands Gemini’s personalized image generation to all U.S. users

June 30, 2026
OpenClaw launches AI agent apps on iOS and Android

OpenClaw launches AI agent apps on iOS and Android

June 30, 2026
Proton launches Lumo 2.0 with image AI and zero-access encryption

Proton launches Lumo 2.0 with image AI and zero-access encryption

June 30, 2026

LATEST NEWS

Anthropic launches Claude Science workbench for researchers

Samsung teases Galaxy Fold 8 in new Instagram campaign

ChatGPT Plus users can now connect financial accounts

Discord launches native app for Meta Quest headsets

Google rolls out Gemini Spark for macOS subscribers in the US

Samsung Galaxy Z Fold8 series leak reveals camera upgrades

BEST AI MODELS LEADERBOARD

See the best AI models, ranked by intelligence, benchmark results, speed and token price. Find the most suitable LLMs, Text-to-Image, Image Editing, Text-to-Speech, Text-to-Video and Image-to-Video  artificial intelligence model for your tasks and business.

LATEST TOOLS

Hoppy Copy

Microsoft Reading Coach

InfiHeal

NOS Agent

Tinywow

Miraa

QuizRise

Voice Swap

Puppetry

Smarter ChatGPT by Athena AI

Dataconomy

COPYRIGHT © DATACONOMY MEDIA GMBH, ALL RIGHTS RESERVED.

  • About
  • Imprint
  • Contact
  • Legal & Privacy

Follow Us

  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Whitepapers
    • AI Models Leaderboard
  • AI tools
  • Newsletter
  • + More
    • Glossary
    • Conversations
    • Events
    • About
      • Who we are
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
No Result
View All Result
Subscribe

This website uses cookies to improve your experience. You can choose to accept or reject them. Visit our Privacy Policy.