Dataconomy
  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Glossary
    • Whitepapers
  • Newsletter
  • + More
    • Conversations
    • Events
    • About
      • About
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
Subscribe
No Result
View All Result
  • AI
  • Tech
  • Cybersecurity
  • Finance
  • DeFi & Blockchain
  • Startups
  • Gaming
Dataconomy
  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Glossary
    • Whitepapers
  • Newsletter
  • + More
    • Conversations
    • Events
    • About
      • About
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
Subscribe
No Result
View All Result
Dataconomy
No Result
View All Result

Anthropic trashed millions of books to train its AI

To avoid legal hurdles, Anthropic bought used books in bulk, ripped them apart, scanned the pages into machine-readable text, and discarded the originals—a process deemed transformative by a judge.

byKerem Gülen
June 26, 2025
in Artificial Intelligence, News
Home News Artificial Intelligence

Anthropic physically scanned millions of print books to train its AI assistant, Claude, subsequently discarding the originals, as revealed in court documents, according to Ars Tecnica. This extensive operation, detailed in a legal decision, involved the acquisition and destructive digitization of these texts. The company’s approach to data acquisition reflects a broader industry demand for high-quality textual information.

Anthropic engaged Tom Turvey, formerly the head of partnerships for Google Books, in February 2024. His mandate was to procure “all the books in the world” for the company. This hiring decision aimed to replicate Google’s legally validated book digitization strategy, which had successfully navigated copyright challenges and established fair use precedents. While destructive scanning is common in smaller-scale operations, Anthropic implemented it on a massive scale. The destructive process offered faster speed and lower costs, outweighing the need to preserve the physical books.

Judge William Alsup ruled this destructive scanning operation constituted fair use. This determination was contingent on several factors: Anthropic legally purchased the books, destroyed each print copy post-scanning, and maintained the digital files internally without distribution. The judge analogized the process to “conserv[ing] space” through format conversion, deeming it transformative. Had this method been consistently applied from the outset, it might have established the first legally sanctioned instance of AI fair use. However, Anthropic’s earlier use of pirated material undermined its initial legal standing.

Stay Ahead of the Curve!

Don't miss out on the latest insights, trends, and analysis in the world of data, technology, and startups. Subscribe to our newsletter and get exclusive content delivered straight to your inbox.

The AI industry exhibits a significant demand for high-quality text, which serves as a fundamental driver behind these data acquisition strategies. Large language models (LLMs), such as those powering Claude and ChatGPT, are trained by ingesting billions of words into neural networks. During this training, the AI system processes the text repeatedly, establishing statistical relationships between words and concepts. The quality of the training data directly influences the capabilities of the resulting AI model. Models trained on well-edited books and articles generally produce more coherent and accurate responses compared to those trained on lower-quality text sources.

Publishers retain legal control over content that AI companies seek for training purposes. Negotiating licenses for this content can be complex and time-consuming. The first-sale doctrine provided a legal workaround for Anthropic: once a physical book is purchased, the buyer can dispose of that specific copy, including destroying it. This principle allowed for the legal acquisition of physical books, circumventing direct licensing negotiations. Despite the legality, the procurement of physical books represented a substantial financial outlay.

Initially, Anthropic opted to use digitized versions of pirated books to acquire high-quality training data, a strategy chosen to avoid what CEO Dario Amodei termed the “legal/practice/business slog” of complex licensing negotiations. By 2024, however, Anthropic had become “not so gung ho about” utilizing pirated ebooks due to “legal reasons,” necessitating a more secure source of data. Purchasing used physical books offered a method to bypass licensing issues entirely while providing the professionally edited text essential for AI model training. Destructive scanning facilitated the rapid digitization of millions of volumes.

Anthropic invested “many millions of dollars” in this book buying and scanning operation. The company often acquired used books in bulk. The process involved stripping books from their bindings, cutting pages to workable dimensions, and scanning them as stacks of pages into PDFs. These PDFs included machine-readable text and covers. All paper originals were subsequently discarded. Court documents do not indicate that any rare books were destroyed, as Anthropic procured its books in bulk from major retailers. Other methods exist for extracting information from paper while preserving the physical documents; for example, The Internet Archive developed non-destructive book scanning techniques that maintain the integrity of physical volumes while creating digital copies.

In a related development, OpenAI and Microsoft announced a collaboration with Harvard’s libraries to train AI models using nearly 1 million public domain books, some dating back to the 15th century. These books are fully digitized but are preserved.


Featured image credit

Tags: AnthropicBooksclaudeFeatured

Related Posts

Is Grok 5 a revolution in AI or just Elon Musk’s latest overhyped vision?

Is Grok 5 a revolution in AI or just Elon Musk’s latest overhyped vision?

September 3, 2025
ICMP: Gemini, Claude and Llama 3 used music without any license

ICMP: Gemini, Claude and Llama 3 used music without any license

September 3, 2025
YouTube Premium cracks down on out-of-home family plans

YouTube Premium cracks down on out-of-home family plans

September 3, 2025
J-ENG unveils 7UEC50LSJA-HPSCR ammonia ship engine

J-ENG unveils 7UEC50LSJA-HPSCR ammonia ship engine

September 3, 2025
Judge rules Google won’t have to sell Chrome browser

Judge rules Google won’t have to sell Chrome browser

September 3, 2025
ShinyHunters uses vishing to breach Salesforce data

ShinyHunters uses vishing to breach Salesforce data

September 3, 2025

LATEST NEWS

Is Grok 5 a revolution in AI or just Elon Musk’s latest overhyped vision?

ICMP: Gemini, Claude and Llama 3 used music without any license

YouTube Premium cracks down on out-of-home family plans

J-ENG unveils 7UEC50LSJA-HPSCR ammonia ship engine

Judge rules Google won’t have to sell Chrome browser

ShinyHunters uses vishing to breach Salesforce data

Dataconomy

COPYRIGHT © DATACONOMY MEDIA GMBH, ALL RIGHTS RESERVED.

  • About
  • Imprint
  • Contact
  • Legal & Privacy

Follow Us

  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Glossary
    • Whitepapers
  • Newsletter
  • + More
    • Conversations
    • Events
    • About
      • About
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
No Result
View All Result
Subscribe

This website uses cookies. By continuing to use this website you are giving consent to cookies being used. Visit our Privacy Policy.