Claim: NVIDIA Green-lit Pirated Book Downloads For AI Training

NVIDIA executives authorized using millions of pirated books from Anna’s Archive for AI training, according to an expanded class-action lawsuit. The suit, citing internal NVIDIA documents, alleges the company contacted Anna’s Archive for high-speed access to its data. NVIDIA has benefited from the artificial intelligence boom, with revenue surging due to high demand for its AI-learning chips and data center services.

NVIDIA develops its own AI models, including NeMo, Retro-48B, InstructRetro, and Megatron. These models are trained using NVIDIA hardware and large text libraries, similar to practices at other technology companies. The company has faced legal challenges from copyright holders regarding its training methodologies.

Authors first sued NVIDIA in early 2024 for copyright infringement, claiming the company’s AI models were trained on the Books3 dataset, which included copyrighted works from Bibliotik without permission. NVIDIA defended its actions as fair use, stating that books are statistical correlations to its AI models. However, new evidence emerged during discovery.

Plaintiffs filed an amended complaint last Friday, expanding the lawsuit’s scope by adding more books, authors, and AI models. The amended complaint includes broader “shadow library” claims. Authors, including Abdi Nazemian, now cite internal NVIDIA emails and documents, alleging the company willingly downloaded millions of copyrighted books. The complaint claims “competitive pressures drove NVIDIA to piracy,” involving collaboration with Anna’s Archive.

According to the amended complaint, a member of NVIDIA’s data strategy team contacted Anna’s Archive to inquire about acquiring its pirated materials for pre-training large language models, including Anna’s Archive. The complaint states Anna’s Archive charged tens of thousands of dollars for “high-speed access” to its collections, and NVIDIA sought details on this access.

The complaint alleges Anna’s Archive warned NVIDIA that its library content was illegally acquired and maintained. Anna’s Archive reportedly asked NVIDIA executives for internal permission to proceed, which was granted within a week. After receiving permission from NVIDIA management, Anna’s Archive provided access to its pirated books. Anna’s Archive offered NVIDIA access to approximately 500 terabytes of data, including millions of books typically available through Internet Archive’s digital lending system. The complaint does not specify if NVIDIA paid Anna’s Archive. NVIDIA also faces accusations of using other pirated sources, including LibGen, Sci-Hub, and Z-Library, in addition to the Books3 database.

Authors allege NVIDIA not only downloaded and used pirated books for its AI training but also distributed scripts and tools enabling corporate customers to download “The Pile,” which contains the Books3 pirated dataset. These allegations introduce new claims of vicarious and contributory infringement, asserting NVIDIA generated revenue from customers by facilitating access to these pirated datasets. The authors seek compensation for damages for named authors and potentially hundreds of others joining the class-action lawsuit.

This revelation marks the first public disclosure of correspondence between a major U.S. tech company and Anna’s Archive. The first consolidated and amended complaint, filed at the U.S. District Court for the Northern District of California, names authors Abdi Nazemian, Brian Keene, Stewart O’Nan, Andre Dubus III, and Susan Orlean.

Featured image credit