Dataconomy
  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Whitepapers
    • AI Models Leaderboard
  • AI toolsNEW
  • Newsletter
  • + More
    • Glossary
    • Conversations
    • Events
    • About
      • Who we are
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
Subscribe
No Result
View All Result
  • AI
  • Tech
  • Cybersecurity
  • Finance
  • DeFi & Blockchain
  • Startups
  • Gaming
Dataconomy
  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Whitepapers
    • AI Models Leaderboard
  • AI toolsNEW
  • Newsletter
  • + More
    • Glossary
    • Conversations
    • Events
    • About
      • Who we are
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
Subscribe
No Result
View All Result
Dataconomy
No Result
View All Result

Google and Harvard drop 1 million books to train AI models

The announcement came on December 12, 2024, with the dataset, which encompasses a wide array of genres, languages, and authors including notable figures like Dickens, Dante, and Shakespeare

byKerem Gülen
December 13, 2024
in Artificial Intelligence, News
Home News Artificial Intelligence
Share on FacebookShare on TwitterShare on LinkedInShare on WhatsAppShare on e-mail
Google Preferred Source

Harvard University, in collaboration with Google, will release a dataset of approximately one million public-domain books for use in training AI models, according to WIRED. This initiative, known as the Institutional Data Initiative, has secured funding from both Microsoft and OpenAI. The dataset comprises works that are no longer under copyright protection, drawn from Google’s extensive book-scanning efforts.

Harvard and Google provide one million books for AI training

The announcement came on December 12, 2024, with the dataset, which encompasses a wide array of genres, languages, and authors including notable figures like Dickens, Dante, and Shakespeare. Harvard’s executive director for the initiative, Greg Leppert, emphasized that the dataset aims to “level the playing field,” enabling access for research labs and AI startups to enhance their language model development efforts. The dataset is intended for anyone looking to train large language models (LLMs), although the specific release date and method have yet to be disclosed.

As AI technologies increasingly rely on vast amounts of text data, this dataset serves as a crucial resource. Foundational models like ChatGPT benefit significantly from high-quality training data. However, the necessity for data has caused challenges for companies like OpenAI, which face legal scrutiny over the unauthorized use of copyrighted materials. Lawsuits from major publishers, including the Wall Street Journal and the New York Times, highlight ongoing tensions regarding content use and copyright infringement in AI training.

Stay Ahead of the Curve!

Don't miss out on the latest insights, trends, and analysis in the world of data, technology, and startups. Subscribe to our newsletter and get exclusive content delivered straight to your inbox.

While the forthcoming dataset will be advantageous, it is still unclear if one million books will be sufficient to meet the demands of AI model training, especially as contemporary references and updated slang are not covered within these historical texts. AI companies will continue to seek additional data sources, particularly exclusive or up-to-date information, to distinguish their models from competitors.

  • Harvard’s Institutional Data Initiative aims to provide accessible data for AI development.
  • Funding from Microsoft and OpenAI underpins the project.
  • The dataset includes literary classics and less familiar texts.
  • AI models require extensive data; current controversies surround data usage rights.

Developers in the AI sector are not limited to historical texts alone. Several platforms, including Reddit and X, have begun restricting access to their data as they recognize its increasing value. Reddit has entered licensing deals with companies like Google, while X maintains exclusive content arrangements for real-time data utilization. This shift in content accessibility reflects the competitive landscape where AI companies struggle to acquire adequate and relevant training data without facing legal repercussions.

The execution of the Institutional Data Initiative is a step towards easing these pressures by providing a legally safe pool of historical texts, allowing for responsible model training. However, comprehensive strategies will still be necessary to ensure AI models are competitive and capable of understanding contemporary language and references.

How effectively this resource will fulfill the ongoing demand for comprehensive and diverse data remains a question as investigations into data usage continue.


Featured image credit: Clay Banks/Unsplash

Tags: AIFeatured

Related Posts

PlayStation 6 leak points to 2027 release window

PlayStation 6 leak points to 2027 release window

June 23, 2026
Samsung unveils UFS 5.0 storage for future Galaxy phones

Samsung unveils UFS 5.0 storage for future Galaxy phones

June 23, 2026
Getty Images partners with OpenAI to supply licensed visuals for ChatGPT

Getty Images partners with OpenAI to supply licensed visuals for ChatGPT

June 23, 2026
Instagram for TV launches on Samsung TVs in the US

Instagram for TV launches on Samsung TVs in the US

June 23, 2026
Valve opens Steam Machine reservations starting at ,049

Valve opens Steam Machine reservations starting at $1,049

June 23, 2026
Apple releases iOS 27 beta 2 with new “Write with Siri” feature

Apple releases iOS 27 beta 2 with new “Write with Siri” feature

June 23, 2026

LATEST NEWS

PlayStation 6 leak points to 2027 release window

Samsung unveils UFS 5.0 storage for future Galaxy phones

Getty Images partners with OpenAI to supply licensed visuals for ChatGPT

Instagram for TV launches on Samsung TVs in the US

Valve opens Steam Machine reservations starting at $1,049

Apple releases iOS 27 beta 2 with new “Write with Siri” feature

BEST AI MODELS LEADERBOARD

See the best AI models, ranked by intelligence, benchmark results, speed and token price. Find the most suitable LLMs, Text-to-Image, Image Editing, Text-to-Speech, Text-to-Video and Image-to-Video  artificial intelligence model for your tasks and business.

LATEST TOOLS

Moonbeam

Charisma AI

Essay Writer by Papertyper

Slite

Wonderin AI

Spur

Stenography

Calldesk

MaxAI.me

PhotoRestore

Dataconomy

COPYRIGHT © DATACONOMY MEDIA GMBH, ALL RIGHTS RESERVED.

  • About
  • Imprint
  • Contact
  • Legal & Privacy

Follow Us

  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Whitepapers
    • AI Models Leaderboard
  • AI tools
  • Newsletter
  • + More
    • Glossary
    • Conversations
    • Events
    • About
      • Who we are
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
No Result
View All Result
Subscribe

This website uses cookies to improve your experience. You can choose to accept or reject them. Visit our Privacy Policy.