They Trained on Everything: How AI Labs Consumed the World's Books and Why They're Coming for the Rest

Every frontier AI model is, at its core, a compression of human writing. GPT-5.2, Claude 4.5, Gemini 3, Llama, Qwen - all of them learned language by reading books. Millions of books. And nearly every major AI company picked up those books through channels that range from legally gray to straightforwardly pirated.

Now they are running out of text. The data wall - the point at which the supply of quality human-produced writing can no longer keep up with the appetite of scaling laws - is no longer a theoretical concern. Epoch AI estimates the usable stock of quality public text at roughly 300 trillion tokens. Frontier models already train on 10 to 15 trillion tokens per run, and each generation demands more. The front end of exhaustion falls somewhere between 2026 and 2028.

This is the story of what AI labs consumed, how they got it, and where they are looking next - including ancient libraries that haven't been translated in a thousand years.

TL;DR

What	Scale
Books in GPT-3 training	~67 billion tokens from LibGen (16% of all training data)
Books pirated by Meta	81.7 TB from shadow libraries; CEO approved use
Books pirated by Anthropic	7+ million from LibGen and Pirate Library Mirror
Books physically destroyed by Anthropic	~2 million (spines cut, pages scanned, remains recycled)
Largest copyright settlement	$1.5 billion (Anthropic, September 2025)
Estimated usable public text	~300 trillion tokens
Current frontier model training	10-15 trillion tokens per run
Projected exhaustion	2026-2028 (Epoch AI)
Books ever published worldwide	~130 million
Books Google has scanned	~40 million
World's manuscripts never digitized	~85-90%

Part 1: What they used

The pirate pipeline

The story starts with two datasets that OpenAI has never publicly described. In its 2020 GPT-3 paper, the company listed "Books1" and "Books2" as training sources, accounting for 16% of the model's data - roughly 67 billion tokens. No further detail was given.

Court filings have since revealed that internally, these datasets were called LibGen1 and LibGen2, named for Library Genesis, the world's largest shadow library of pirated books and academic papers. OpenAI claims the datasets were deleted in mid-2022 and that the employees who collected them have left the company. A November 2025 court order forced OpenAI to turn over all communications with its lawyers about the deletion.

OpenAI wasn't alone. In 2020, independent developer Shawn Presser scraped approximately 197,000 books from Bibliotik, a private torrent tracker, and published them as Books3 - deliberately named to highlight what OpenAI wouldn't disclose. Books3 was folded into The Pile, an open training dataset from EleutherAI, and subsequently used by Meta, Bloomberg, Nvidia, and dozens of others.

Meta's "stealth mode"

Court documents in the ongoing author lawsuits against Meta paint the most detailed picture of how AI companies handled pirated books internally. Meta engineers torrented at least 81.7 TB from shadow libraries including Z-Library, LibGen, and Anna's Archive. Internal communications show the operation ran in what staff called "stealth mode":

Downloads were routed through non-Facebook servers to avoid tracing
BitTorrent seeding was minimized
Employees were instructed not to "externally cite the use of any training data including LibGen"
Copyright markings were removed from downloaded files

One engineer wrote: "Torrenting from a corporate laptop doesn't feel right." Another senior AI researcher pushed back more directly: "I don't think we should use pirated material. I really need to draw a line here."

The decision went up the chain. Unredacted messages show that CEO Mark Zuckerberg personally approved the use of LibGen for Llama 3 training. Meta's reasoning, as documented internally: legally licensing copyrighted book content would be "too costly, time-consuming, and potentially damaging to any future fair use claims."

Anthropic's Project Panama

Anthropic's approach was, in some ways, the most ambitious. The company downloaded more than 7 million digitized books from pirated sources - roughly 5 million from LibGen and 2 million from the Pirate Library Mirror. Co-founder Ben Mann personally downloaded books from LibGen over 11 days in June 2021.

But Anthropic also went physical. Under the internal codename "Project Panama" - described in company documents as "our effort to destructively scan all the books in the world" - Anthropic purchased books in bulk from Better World Books and World of Books, cut off their spines with a hydraulic cutting machine, ran the pages through production-grade scanners at up to 8,000 books and 2 million pages per day, and sent the remains to recycling.

The operation consumed nearly 2 million physical books and cost tens of millions of dollars. Anthropic hired Tom Turvey, the former head of partnerships for Google Books, to lead the effort.

In June 2025, a federal judge ruled the physical scanning was "quintessentially transformative" fair use - the books were legally purchased and destroyed after scanning. But downloading pirated copies from LibGen wasn't protected. In September 2025, Anthropic agreed to pay $1.5 billion to settle the class-action lawsuit - the largest copyright settlement in U.S. history, covering approximately 500,000 copyrighted works at $3,000 per book.

Nvidia's quiet inquiry

Nvidia's involvement is more recently documented. Court filings include correspondence between a Nvidia data strategy team employee and Anna's Archive, the pirated book aggregator, in which the employee wrote that Nvidia was "exploring including Anna's Archive in pre-training data for our LLMs." The complaint alleges Nvidia received millions of pirated books totaling roughly 500 terabytes.

The scoreboard

Company	Pirated books used	Known consequences
OpenAI	Books1 + Books2 (LibGen), ~67B tokens	Datasets deleted; court-ordered disclosure of deletion communications
Meta	81.7+ TB from shadow libraries	Ongoing lawsuits; Zuckerberg approval documented
Anthropic	7+ million books (LibGen + PiLiMi)	$1.5B settlement (largest in U.S. copyright history)
Nvidia	"Millions" via Anna's Archive (~500 TB)	Ongoing class-action lawsuit

Part 2: The data wall

The fundamental problem driving all of this is simple: there isn't enough text.

Epoch AI estimates the effective stock of quality, repetition-adjusted, human-generated public text at approximately 300 trillion tokens. That sounds enormous, but the high-quality core - books, academic papers, curated web content, Wikipedia - amounts to only about 9 to 27 trillion tokens. Meta's Llama 3 was trained on 15 trillion tokens. Qwen's smallest model (0.6B parameters) was trained on 36 trillion tokens - 60,000 tokens per parameter, roughly 3,000 times the ratio recommended by DeepMind's original Chinchilla scaling laws.

Every generation of models demands more data. And the supply is fixed.

Scaling law math

DeepMind's 2022 Chinchilla paper established that the best training ratio is roughly 20 tokens per parameter. But modern practice has blown far past that:

Model	Parameters	Training tokens	Tokens per parameter
GPT-3 (2020)	175B	300B	1.7x
Llama 2 (2023)	70B	2T	29x
Llama 3 (2024)	405B	15T	37x
Qwen3-0.6B (2025)	600M	36T	60,000x

The reason is economic: smaller models are cheaper at inference, so companies massively overtrain them during the one-time training phase to maximize quality at deployment. But this strategy consumes data at an accelerating rate.

The synthetic data trap

One response to the data wall is generating synthetic training data - using AI to write text for training other AI. By April 2025, over 74% of newly created webpages contained AI-generated text. Future Common Crawl snapshots will be contaminated.

A landmark Nature paper demonstrated the core risk: model collapse. Training LLMs on predecessor-generated text causes a degenerative process where models gradually forget the true underlying data distribution. Lexical, syntactic, and semantic diversity all decline. Even fractions as small as 1 in 1,000 synthetic samples can trigger the effect.

The only known mitigation is anchoring synthetic data to a non-shrinking corpus of real human writing. Which brings us back to the supply problem.

Part 3: The licensing gold rush

Unable to keep pirating and increasingly constrained by lawsuits, AI companies have shifted to licensing deals with publishers and platforms:

Deal	Estimated value
OpenAI + News Corp (WSJ, NY Post)	$250M+ over 5 years
Reddit + Google + OpenAI	$203M+ aggregate
OpenAI + Axel Springer (Politico, Business Insider)	"Tens of millions of euros" (3 years)
John Wiley & Sons + undisclosed AI company	$23M initially, projected $44M
Taylor & Francis + Microsoft	$10M
OpenAI + Associated Press	Undisclosed
OpenAI + Financial Times	Undisclosed
Shutterstock + OpenAI	Undisclosed (6-year deal)
HarperCollins + AI companies	$5,000 per title (opt-in nonfiction)

Reddit alone reported a 450% year-over-year increase in non-ad revenue after its IPO, driven almost entirely by data licensing. Stack Overflow signed deals with both OpenAI and Google for access to 15 years of developer Q&A.

OpenAI also used over one million hours of YouTube video transcripts to train GPT-4, with president Greg Brockman personally selecting videos. Internal employees expressed concern that this violated YouTube's Terms of Service.

With few exceptions, publishers have signed these deals without asking their authors for consent.

Part 4: The world's unread libraries

Against this backdrop of exhausted English-language text, there is a different kind of data scarcity. Not a shortage of words, but a shortage of accessible words.

Of the around 130 million books ever published, Google has scanned roughly 40 million - and most of those sit dormant behind legal barriers. Only about 5% of the world's textual knowledge exists in fully searchable electronic form. Most human manuscript heritage has never been digitized, let alone translated.

The Tibetan Buddhist Canon

The Kangyur and Tengyur - the core texts of Tibetan Buddhism - span 332 volumes and roughly 232,000 pages. The 84000 translation project has been working since 2009. After 17 years, they have published approximately half the Kangyur in English. The Tengyur, at 224 volumes, has a 100-year translation timeline. Only a fraction of the broader Tibetan textual tradition has ever been rendered into any Western language.

AI is now entering this space. OpenPecha is building an agentic AI system for translating classical Tibetan. The Buddhist Digital Resource Center released a Tibetan OCR app in March 2025, developed with Monlam AI, that processes images of woodblock-printed texts in about 15 seconds per page. These are legitimate scholarly tools, not data acquisition pipelines - but the texts they unlock are, inevitably, potential training material.

Timbuktu

The libraries of Timbuktu hold an estimated 700,000 manuscripts covering Islamic theology, astronomy, medicine, law, and mathematics, dating from the 13th to 18th centuries. In 2012, librarians smuggled manuscripts out of the city when al-Qaeda-linked militants occupied it. In August 2025, the Malian military government began returning them, 13 years later.

The Hill Museum & Manuscript Library has photographed over 150,000 of these manuscripts. Google Arts & Culture digitized over 40,000 pages. The vast majority remain undigitized and untranslated.

The Herculaneum scrolls

Roughly 300 papyrus scrolls from the only surviving intact library of the ancient Greco-Roman world sit carbonized in Naples, sealed by the eruption of Vesuvius in 79 CE. The Vesuvius Challenge, funded by the Musk Foundation ($2.08M) and run by former GitHub CEO Nat Friedman, uses synchrotron X-ray scanning and AI-based "virtual unwrapping" to read them without physical contact.

In May 2025, researchers recovered the title of a still-sealed scroll for the first time in history: On Vices by the Epicurean philosopher Philodemus. A master plan now aims to scan and read all 300 extant scrolls. This is archaeology, not data acquisition - but it's also AI creating text that hasn't existed in readable form for 2,000 years.

Ancient Chinese texts

Chinese tech companies are the most active in AI-driven ancient text digitization. Alibaba's DAMO Academy built an AI model that recognizes 30,000 ancient Chinese characters with 97.5% accuracy, 30x faster than human reading, in partnership with the National Library of China and UC Berkeley. ByteDance's Shidianguji platform, launched with Peking University, has digitized over 18,000 Chinese classics totaling nearly 200 million characters.

These efforts are framed as cultural preservation. Whether the resulting corpora also flow into training pipelines for Qwen or other models is not publicly disclosed.

India's 10 million manuscripts

India has an estimated 10 million manuscripts - probably the largest collection in the world - in Sanskrit, Prakrit, Tamil, Hindi, Gujarati, Ge'ez, and dozens of other languages. The National Mission for Manuscripts has catalogued about one million. Jain bhandaras (libraries) in Jaisalmer and Patan house thousands of palm-leaf manuscripts dating back nearly a millennium, covering not just religious texts but mathematics, astronomy, and medicine.

No major AI company has announced an initiative targeting this collection. The primary challenges remain physical access and conservation.

What was already lost

Some libraries were destroyed before anyone could read them:

Library of Alexandria: Declined over centuries, final destruction debated (48 BCE fire, 270s CE attacks, possible 640s destruction). No original scrolls survive. Many works preserved through chains of copies - Abbasid scholars translating Greek into Arabic, Renaissance humanists recovering manuscripts from monastery basements.
House of Wisdom, Baghdad: Destroyed by Mongol forces in 1258. The scholar Nasir al-Din al-Tusi rescued thousands of volumes to Maragheh. Many texts survived through the Islamic Translation Movement's earlier dissemination.
Nalanda University, India: Destroyed around 1193 CE. The library complex reportedly housed millions of manuscripts and burned for three months. Tibetan scribes preserved over 4,000 Sanskrit treatises through thorough translation. The Eternity Civilization Foundation has launched an AI initiative to translate these preserved works.

The scale of what remains unread

Collection	Estimated scale	Digitized	AI involvement
Tibetan Buddhist Canon	232,000 pages	Largely scanned	OpenPecha, BDRC OCR, Monlam AI
Timbuktu manuscripts	~700,000	~150,000 photographed	Google Arts & Culture (limited)
Herculaneum scrolls	~300 scrolls	2 scanned so far	Vesuvius Challenge
Indian manuscripts	~10 million	~1 million catalogued	Limited
Dunhuang manuscripts	Tens of thousands	9,900+ volumes	Tencent Cloud AI, Elastic
Vatican Apostolic Library	80,000 manuscripts	~30,000 online	Neural dating system
Ethiopian manuscripts	Tens of thousands	Microfilming completed 2025	First Ge'ez OCR tool
Dead Sea Scrolls	25,000+ fragments	Fully imaged	"Enoch" AI dating model
Arabic/Persian manuscripts	Millions worldwide	Hundreds of thousands imaged	OpenITI OCR
Ancient Chinese texts	Millions of documents	18,000+ classics digitized	Alibaba DAMO, ByteDance

Part 5: What comes next

The convergence of two forces - the data wall and the world's undigitized textual heritage - creates an obvious question: will AI companies fund the digitization and translation of ancient manuscripts to feed their training pipelines?

So far, the answer is mostly no. The Harvard Library Public Domain Corpus (983,004 volumes, 242 billion tokens), announced by OpenAI and Microsoft in June 2025, used already-digitized books from the Google Books scanning era. Anthropic's approach was to buy and destroy modern books, not fund scholarly digitization. No frontier AI lab has been caught directly sponsoring ancient manuscript work for training purposes.

But the incentives are aligning. Meta's NLLB-200 translation model already covers 200 languages. Cohere's Tiny Aya handles 70. The tools to translate ancient Tibetan, Ge'ez, classical Chinese, and medieval Arabic are getting good enough to process at scale. And the companies building those tools are the same companies that need the tokens.

The data wall isn't just about running out of English. It's about the realization that most of what humanity has written - across thousands of years, hundreds of languages, and millions of manuscripts - has never been fed into a model. Not because it doesn't exist, but because no one has digitized it, translated it, or even catalogued it.

95% of the Tibetan Buddhist canon remains untranslated. 700,000 Timbuktu manuscripts sit largely unread. 10 million Indian manuscripts have never been catalogued. 300 ancient Roman scrolls are being read for the first time by AI.

The question isn't whether AI companies will eventually come for this material. It is whether the effort will be driven by scholarship, by commerce, or - as with the pirated libraries that bootstrapped the current generation of models - by both at once, with the boundaries left deliberately unclear.

Sources: