Dataconomy
  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Glossary
    • Whitepapers
  • Newsletter
  • + More
    • Conversations
    • Events
    • About
      • About
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
Subscribe
No Result
View All Result
  • AI
  • Tech
  • Cybersecurity
  • Finance
  • DeFi & Blockchain
  • Startups
  • Gaming
Dataconomy
  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Glossary
    • Whitepapers
  • Newsletter
  • + More
    • Conversations
    • Events
    • About
      • About
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
Subscribe
No Result
View All Result
Dataconomy
No Result
View All Result

ChatGPT’s data does not add up with real-world data

Recent research highlights discrepancies between AI training data and real-world usage patterns of ChatGPT

byEmre Çıtak
August 14, 2024
in Artificial Intelligence
Home News Artificial Intelligence

ChatGPT’s data does not add up, according to recent research that sheds light on the discrepancies between AI training data and real-world usage patterns.

This eye-opening study reveals surprising misalignments between the content that large language models like ChatGPT are trained on and how people actually use these AI assistants in practice. Let’s dive into the details of this intriguing research and explore what it means for the future of AI development.

The study, conducted by researchers examining web crawl data and ChatGPT usage logs, uncovered several key findings that challenge our assumptions about AI training data.

Stay Ahead of the Curve!

Don't miss out on the latest insights, trends, and analysis in the world of data, technology, and startups. Subscribe to our newsletter and get exclusive content delivered straight to your inbox.

By comparing the types of web content most commonly crawled for AI training with actual user interactions recorded in ChatGPT conversations, the researchers identified significant gaps between the data used to train these models and their practical applications.

faulty ChatGPT data research
The study found a mismatch between the prevalence of news content in training data and its rarity in ChatGPT queries (Image credit)

ChatGPT’s foundations are shaky

One of the most striking discoveries was the mismatch between the prevalence of news content in training data and its relative scarcity in real-world ChatGPT queries. While news websites comprised nearly 40% of the tokens in the head distribution of crawled web domains, less than 1% of ChatGPT queries were related to news or current affairs. This raises questions about the efficiency and relevance of using such a large proportion of news content in training data when users appear to have limited interest in news-related queries.

Another surprising finding was the high frequency of creative writing and role-playing requests in ChatGPT conversations, despite the relative lack of such content in the training data. Over 30% of user interactions involved requests for fictional story writing, creative compositions, or role-playing scenarios. This suggests that AI models may be underprepared for these popular use cases, potentially leading to suboptimal performance in these areas.

The data dilemma

A closer look at the research findings reveals a complex web of data sources and usage patterns that don’t quite align. The study examined three major web-crawled datasets commonly used for AI training: C4, RefinedWeb, and Dolma. These datasets, derived from Common Crawl snapshots, represent a significant portion of the “data commons” used to train large language models.

However, the composition of these datasets differs markedly from how people use ChatGPT in practice. For instance, the head distribution of web domains in the training data is dominated by news sites, encyclopedias, and social media platforms.

faulty ChatGPT data research
Real-world ChatGPT usage shows a preference for creative tasks and general queries, not fully reflected in training data (Image credit)

In contrast, real-world ChatGPT usage shows a preference for creative tasks, general information queries, and even sexual content – areas that are either underrepresented or actively filtered out of training datasets.

This misalignment raises important questions about the effectiveness of current data collection and curation practices for AI training. If the data used to train these models doesn’t reflect their actual use cases, how can we expect them to perform optimally in real-world scenarios?

The consent conundrum

Adding another layer of complexity to the data puzzle is the rapidly changing landscape of web consent for AI training. The research uncovered a significant increase in restrictions placed on web crawlers by website owners, particularly those associated with AI development.

In just one year, from April 2023 to April 2024, the percentage of tokens restricted by robots.txt files in major corpora like C4 and RefinedWeb increased by over 500%.


Is AI creative: Answering the unanswerable


This trend, if it continues, could severely impact the availability of high-quality training data for future AI models.

Moreover, the study found inconsistencies in how websites communicate their data use preferences. Many sites have contradictory instructions in their robots.txt files and Terms of Service agreements, leading to confusion about what data can be used for AI training. This lack of clarity poses challenges for both AI developers and website owners trying to protect their content.

The sexual content surprise in ChatGPT

Perhaps one of the most unexpected findings of the study was the prevalence of sexual content requests in ChatGPT interactions. While sensitive or explicit content represents less than 1% of the web domains in the training data, sexual role-play accounted for 12% of all recorded user interactions in the study’s dataset.

This discrepancy highlights a significant gap between the sanitized training data used by AI companies and the actual desires of users. It also raises ethical questions about how AI models should handle such requests, given that most have been explicitly trained to avoid generating explicit content.


Featured image credit: Solen Feyissa/Unsplash

Tags: chatgptFeatured

Related Posts

UAE’s new K2 Think AI model jailbroken hours after release via transparent reasoning logs

UAE’s new K2 Think AI model jailbroken hours after release via transparent reasoning logs

September 12, 2025
Barcelona startup Altan raises .5 million to democratize software development with AI agents

Barcelona startup Altan raises $2.5 million to democratize software development with AI agents

September 12, 2025
Not every problem needs AI: A solution architect’s view on responsible tech

Not every problem needs AI: A solution architect’s view on responsible tech

September 12, 2025
AGI ethics checklist proposes ten key elements

AGI ethics checklist proposes ten key elements

September 11, 2025
Google Gemini now transcribes audio files

Google Gemini now transcribes audio files

September 11, 2025
Thinking Machines Lab reveals research on eliminating randomness in AI model responses

Thinking Machines Lab reveals research on eliminating randomness in AI model responses

September 11, 2025

LATEST NEWS

From starship troopers to Helldivers: The satire of militarism in games

How Monster Hunter Wilds blends solitude and chaos in its vast landscapes

UAE’s new K2 Think AI model jailbroken hours after release via transparent reasoning logs

YouTube Music redesigns its Now Playing screen on Android and iOS

EU’s Chat Control proposal will scan your WhatsApp and Signal messages if approved

Apple CarPlay vulnerability leaves vehicles exposed due to slow patch adoption

Dataconomy

COPYRIGHT © DATACONOMY MEDIA GMBH, ALL RIGHTS RESERVED.

  • About
  • Imprint
  • Contact
  • Legal & Privacy

Follow Us

  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Glossary
    • Whitepapers
  • Newsletter
  • + More
    • Conversations
    • Events
    • About
      • About
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
No Result
View All Result
Subscribe

This website uses cookies. By continuing to use this website you are giving consent to cookies being used. Visit our Privacy Policy.