They Trained on Everything: How AI Labs Consumed the World's Books and Why They're Coming for the Rest
From pirated libraries to destroyed books to ancient manuscripts, AI companies have consumed millions of copyrighted works and are now approaching the limits of available human text. Here is what they used, what they stole, and what they are looking for next.

Every frontier AI model is, at its core, a compression of human writing. GPT-5.2, Claude 4.5, Gemini 3, Llama, Qwen -- all of them learned language by reading books. Millions of books. And nearly every major AI company acquired those books through channels that range from legally gray to straightforwardly pirated.
Now they are running out of text. The data wall -- the point at which the supply of quality human-generated writing can no longer keep up with the appetite of scaling laws -- is no longer a theoretical concern. Epoch AI estimates the usable stock of quality public text at roughly 300 trillion tokens. Frontier models already train on 10 to 15 trillion tokens per run, and each generation demands more. The front end of exhaustion falls somewhere between 2026 and 2028.
This is the story of what AI labs consumed, how they got it, and where they are looking next -- including ancient libraries that haven't been translated in a thousand years.
TL;DR
| What | Scale |
|---|---|
| Books in GPT-3 training | ~67 billion tokens from LibGen (16% of all training data) |
| Books pirated by Meta | 81.7 TB from shadow libraries; CEO approved use |
| Books pirated by Anthropic | 7+ million from LibGen and Pirate Library Mirror |
| Books physically destroyed by Anthropic | ~2 million (spines cut, pages scanned, remains recycled) |
| Largest copyright settlement | $1.5 billion (Anthropic, September 2025) |
| Estimated usable public text | ~300 trillion tokens |
| Current frontier model training | 10-15 trillion tokens per run |
| Projected exhaustion | 2026-2028 (Epoch AI) |
| Books ever published worldwide | ~130 million |
| Books Google has scanned | ~40 million |
| World's manuscripts never digitized | ~85-90% |
Part 1: What they used
The pirate pipeline
The story starts with two datasets that OpenAI has never publicly described. In its 2020 GPT-3 paper, the company listed "Books1" and "Books2" as training sources, accounting for 16% of the model's data -- roughly 67 billion tokens. No further detail was given.
Court filings have since revealed that internally, these datasets were called LibGen1 and LibGen2, named for Library Genesis, the world's largest shadow library of pirated books and academic papers. OpenAI claims the datasets were deleted in mid-2022 and that the employees who collected them have left the company. A November 2025 court order forced OpenAI to turn over all communications with its lawyers about the deletion.
OpenAI was not alone. In 2020, independent developer Shawn Presser scraped approximately 197,000 books from Bibliotik, a private torrent tracker, and published them as Books3 -- deliberately named to highlight what OpenAI wouldn't disclose. Books3 was folded into The Pile, an open training dataset from EleutherAI, and subsequently used by Meta, Bloomberg, Nvidia, and dozens of others.
Meta's "stealth mode"
Court documents in the ongoing author lawsuits against Meta paint the most detailed picture of how AI companies handled pirated books internally. Meta engineers torrented at least 81.7 TB from shadow libraries including Z-Library, LibGen, and Anna's Archive. Internal communications show the operation ran in what staff called "stealth mode":
- Downloads were routed through non-Facebook servers to avoid tracing
- BitTorrent seeding was minimized
- Employees were instructed not to "externally cite the use of any training data including LibGen"
- Copyright markings were removed from downloaded files
One engineer wrote: "Torrenting from a corporate laptop doesn't feel right." Another senior AI researcher pushed back more directly: "I don't think we should use pirated material. I really need to draw a line here."
The decision went up the chain. Unredacted messages indicate that CEO Mark Zuckerberg personally approved the use of LibGen for Llama 3 training. Meta's reasoning, as documented internally: legally licensing copyrighted book content would be "too costly, time-consuming, and potentially damaging to any future fair use claims."
Anthropic's Project Panama
Anthropic's approach was, in some ways, the most ambitious. The company downloaded more than 7 million digitized books from pirated sources -- roughly 5 million from LibGen and 2 million from the Pirate Library Mirror. Co-founder Ben Mann personally downloaded books from LibGen over 11 days in June 2021.
But Anthropic also went physical. Under the internal codename "Project Panama" -- described in company documents as "our effort to destructively scan all the books in the world" -- Anthropic purchased books in bulk from Better World Books and World of Books, cut off their spines with a hydraulic cutting machine, ran the pages through production-grade scanners at up to 8,000 books and 2 million pages per day, and sent the remains to recycling.
The operation consumed nearly 2 million physical books and cost tens of millions of dollars. Anthropic hired Tom Turvey, the former head of partnerships for Google Books, to lead the effort.
In June 2025, a federal judge ruled the physical scanning was "quintessentially transformative" fair use -- the books were legally purchased and destroyed after scanning. But downloading pirated copies from LibGen was not protected. In September 2025, Anthropic agreed to pay $1.5 billion to settle the class-action lawsuit -- the largest copyright settlement in U.S. history, covering approximately 500,000 copyrighted works at $3,000 per book.
Nvidia's quiet inquiry
Nvidia's involvement is more recently documented. Court filings include correspondence between an Nvidia data strategy team employee and Anna's Archive, the pirated book aggregator, in which the employee wrote that Nvidia was "exploring including Anna's Archive in pre-training data for our LLMs." The complaint alleges Nvidia received millions of pirated books totaling roughly 500 terabytes.
The scoreboard
| Company | Pirated books used | Known consequences |
|---|---|---|
| OpenAI | Books1 + Books2 (LibGen), ~67B tokens | Datasets deleted; court-ordered disclosure of deletion communications |
| Meta | 81.7+ TB from shadow libraries | Ongoing lawsuits; Zuckerberg approval documented |
| Anthropic | 7+ million books (LibGen + PiLiMi) | $1.5B settlement (largest in U.S. copyright history) |
| Nvidia | "Millions" via Anna's Archive (~500 TB) | Ongoing class-action lawsuit |
Part 2: The data wall
The fundamental problem driving all of this is simple: there isn't enough text.
Epoch AI estimates the effective stock of quality, repetition-adjusted, human-generated public text at approximately 300 trillion tokens. That sounds enormous, but the high-quality core -- books, academic papers, curated web content, Wikipedia -- amounts to only about 9 to 27 trillion tokens. Meta's Llama 3 was trained on 15 trillion tokens. Qwen's smallest model (0.6B parameters) was trained on 36 trillion tokens -- 60,000 tokens per parameter, roughly 3,000 times the ratio recommended by DeepMind's original Chinchilla scaling laws.
Every generation of models demands more data. And the supply is fixed.
Scaling law math
DeepMind's 2022 Chinchilla paper established that the optimal training ratio is roughly 20 tokens per parameter. But modern practice has blown far past that:
| Model | Parameters | Training tokens | Tokens per parameter |
|---|---|---|---|
| GPT-3 (2020) | 175B | 300B | 1.7x |
| Llama 2 (2023) | 70B | 2T | 29x |
| Llama 3 (2024) | 405B | 15T | 37x |
| Qwen3-0.6B (2025) | 600M | 36T | 60,000x |
The reason is economic: smaller models are cheaper at inference, so companies massively overtrain them during the one-time training phase to maximize quality at deployment. But this strategy consumes data at an accelerating rate.
The synthetic data trap
One response to the data wall is generating synthetic training data -- using AI to write text for training other AI. By April 2025, over 74% of newly created webpages contained AI-generated text. Future Common Crawl snapshots will be contaminated.
A landmark Nature paper demonstrated the core risk: model collapse. Training LLMs on predecessor-generated text causes a degenerative process where models progressively forget the true underlying data distribution. Lexical, syntactic, and semantic diversity all decline. Even fractions as small as 1 in 1,000 synthetic samples can trigger the effect.
The only known mitigation is anchoring synthetic data to a non-shrinking corpus of real human writing. Which brings us back to the supply problem.
Part 3: The licensing gold rush
Unable to keep pirating and increasingly constrained by lawsuits, AI companies have shifted to licensing deals with publishers and platforms:
| Deal | Estimated value |
|---|---|
| OpenAI + News Corp (WSJ, NY Post) | $250M+ over 5 years |
| Reddit + Google + OpenAI | $203M+ aggregate |
| OpenAI + Axel Springer (Politico, Business Insider) | "Tens of millions of euros" (3 years) |
| John Wiley & Sons + undisclosed AI company | $23M initially, projected $44M |
| Taylor & Francis + Microsoft | $10M |
| OpenAI + Associated Press | Undisclosed |
| OpenAI + Financial Times | Undisclosed |
| Shutterstock + OpenAI | Undisclosed (6-year deal) |
| HarperCollins + AI companies | $5,000 per title (opt-in nonfiction) |
Reddit alone reported a 450% year-over-year increase in non-ad revenue after its IPO, driven almost entirely by data licensing. Stack Overflow signed deals with both OpenAI and Google for access to 15 years of developer Q&A.
OpenAI also used over one million hours of YouTube video transcripts to train GPT-4, with president Greg Brockman personally selecting videos. Internal employees expressed concern that this violated YouTube's Terms of Service.
With few exceptions, publishers have signed these deals without asking their authors for consent.
Part 4: The world's unread libraries
Against this backdrop of exhausted English-language text, there is a different kind of data scarcity. Not a shortage of words, but a shortage of accessible words.
Of the approximately 130 million books ever published, Google has scanned roughly 40 million -- and most of those sit dormant behind legal barriers. Only about 5% of the world's textual knowledge exists in fully searchable electronic form. The vast majority of human manuscript heritage has never been digitized, let alone translated.
The Tibetan Buddhist Canon
The Kangyur and Tengyur -- the core texts of Tibetan Buddhism -- span 332 volumes and roughly 232,000 pages. The 84000 translation project has been working since 2009. After 17 years, they have published approximately half the Kangyur in English. The Tengyur, at 224 volumes, has a 100-year translation timeline. Only a fraction of the broader Tibetan textual tradition has ever been rendered into any Western language.
AI is now entering this space. OpenPecha is building an agentic AI system for translating classical Tibetan. The Buddhist Digital Resource Center released a Tibetan OCR app in March 2025, developed with Monlam AI, that processes images of woodblock-printed texts in about 15 seconds per page. These are legitimate scholarly tools, not data acquisition pipelines -- but the texts they unlock are, inevitably, potential training material.
Timbuktu
The libraries of Timbuktu hold an estimated 700,000 manuscripts covering Islamic theology, astronomy, medicine, law, and mathematics, dating from the 13th to 18th centuries. In 2012, librarians smuggled manuscripts out of the city when al-Qaeda-linked militants occupied it. In August 2025, the Malian military government began returning them, 13 years later.
The Hill Museum & Manuscript Library has photographed over 150,000 of these manuscripts. Google Arts & Culture digitized over 40,000 pages. The vast majority remain undigitized and untranslated.
The Herculaneum scrolls
Roughly 300 papyrus scrolls from the only surviving intact library of the ancient Greco-Roman world sit carbonized in Naples, sealed by the eruption of Vesuvius in 79 CE. The Vesuvius Challenge, funded by the Musk Foundation ($2.08M) and run by former GitHub CEO Nat Friedman, uses synchrotron X-ray scanning and AI-based "virtual unwrapping" to read them without physical contact.
In May 2025, researchers recovered the title of a still-sealed scroll for the first time in history: On Vices by the Epicurean philosopher Philodemus. A master plan now aims to scan and read all 300 extant scrolls. This is archaeology, not data acquisition -- but it is also AI creating text that has not existed in readable form for 2,000 years.
Ancient Chinese texts
Chinese tech companies are the most active in AI-driven ancient text digitization. Alibaba's DAMO Academy built an AI model that recognizes 30,000 ancient Chinese characters with 97.5% accuracy, 30x faster than human reading, in partnership with the National Library of China and UC Berkeley. ByteDance's Shidianguji platform, launched with Peking University, has digitized over 18,000 Chinese classics totaling nearly 200 million characters.
These efforts are framed as cultural preservation. Whether the resulting corpora also flow into training pipelines for Qwen or other models is not publicly disclosed.
India's 10 million manuscripts
India possesses an estimated 10 million manuscripts -- probably the largest collection in the world -- in Sanskrit, Prakrit, Tamil, Hindi, Gujarati, Ge'ez, and dozens of other languages. The National Mission for Manuscripts has catalogued about one million. Jain bhandaras (libraries) in Jaisalmer and Patan house thousands of palm-leaf manuscripts dating back nearly a millennium, covering not just religious texts but mathematics, astronomy, and medicine.
No major AI company has announced an initiative targeting this collection. The primary challenges remain physical access and conservation.
What was already lost
Some libraries were destroyed before anyone could read them:
- Library of Alexandria: Declined over centuries, final destruction debated (48 BCE fire, 270s CE attacks, possible 640s destruction). No original scrolls survive. Many works preserved through chains of copies -- Abbasid scholars translating Greek into Arabic, Renaissance humanists recovering manuscripts from monastery basements.
- House of Wisdom, Baghdad: Destroyed by Mongol forces in 1258. The scholar Nasir al-Din al-Tusi rescued thousands of volumes to Maragheh. Many texts survived through the Islamic Translation Movement's earlier dissemination.
- Nalanda University, India: Destroyed around 1193 CE. The library complex reportedly housed millions of manuscripts and burned for three months. Tibetan scribes preserved over 4,000 Sanskrit treatises through meticulous translation. The Eternity Civilization Foundation has launched an AI initiative to translate these preserved works.
The scale of what remains unread
| Collection | Estimated scale | Digitized | AI involvement |
|---|---|---|---|
| Tibetan Buddhist Canon | 232,000 pages | Largely scanned | OpenPecha, BDRC OCR, Monlam AI |
| Timbuktu manuscripts | ~700,000 | ~150,000 photographed | Google Arts & Culture (limited) |
| Herculaneum scrolls | ~300 scrolls | 2 scanned so far | Vesuvius Challenge |
| Indian manuscripts | ~10 million | ~1 million catalogued | Limited |
| Dunhuang manuscripts | Tens of thousands | 9,900+ volumes | Tencent Cloud AI, Elastic |
| Vatican Apostolic Library | 80,000 manuscripts | ~30,000 online | Neural dating system |
| Ethiopian manuscripts | Tens of thousands | Microfilming completed 2025 | First Ge'ez OCR tool |
| Dead Sea Scrolls | 25,000+ fragments | Fully imaged | "Enoch" AI dating model |
| Arabic/Persian manuscripts | Millions worldwide | Hundreds of thousands imaged | OpenITI OCR |
| Ancient Chinese texts | Millions of documents | 18,000+ classics digitized | Alibaba DAMO, ByteDance |
Part 5: What comes next
The convergence of two forces -- the data wall and the world's undigitized textual heritage -- creates an obvious question: will AI companies fund the digitization and translation of ancient manuscripts to feed their training pipelines?
So far, the answer is mostly no. The Harvard Library Public Domain Corpus (983,004 volumes, 242 billion tokens), announced by OpenAI and Microsoft in June 2025, used already-digitized books from the Google Books scanning era. Anthropic's approach was to buy and destroy modern books, not fund scholarly digitization. No frontier AI lab has been caught directly sponsoring ancient manuscript work for training purposes.
But the incentives are aligning. Meta's NLLB-200 translation model already covers 200 languages. Cohere's Tiny Aya handles 70. The tools to translate ancient Tibetan, Ge'ez, classical Chinese, and medieval Arabic are getting good enough to process at scale. And the companies building those tools are the same companies that need the tokens.
The data wall is not just about running out of English. It is about the realization that most of what humanity has written -- across thousands of years, hundreds of languages, and millions of manuscripts -- has never been fed into a model. Not because it doesn't exist, but because no one has digitized it, translated it, or even catalogued it.
95% of the Tibetan Buddhist canon remains untranslated. 700,000 Timbuktu manuscripts sit largely unread. 10 million Indian manuscripts have never been catalogued. 300 ancient Roman scrolls are being read for the first time by AI.
The question is not whether AI companies will eventually come for this material. It is whether the effort will be driven by scholarship, by commerce, or -- as with the pirated libraries that bootstrapped the current generation of models -- by both at once, with the boundaries left deliberately unclear.
Sources:
- Epoch AI - Will We Run Out of Data?
- Authors Guild - Meta's Massive AI Training Book Heist
- NPR - Anthropic to Pay Authors $1.5B
- Washington Post - Anthropic's Book Scanning Operation
- Hollywood Reporter - OpenAI Loses Key Discovery Battle
- TechCrunch - Zuckerberg Approved LibGen for Llama
- Nature - AI Models Collapse When Trained on Recursive Data
- 84000: Translating the Words of the Buddha
- Vesuvius Challenge
- Alibaba Cloud - AI Digitization of Chinese Ancient Books
- HMML - Timbuktu Manuscripts
- Google Books - Wikipedia
- Harvard Library Public Domain Corpus
- TorrentFreak - Nvidia Contacted Anna's Archive
- OpenPecha - Agentic AI Translation
- BDRC - Tibetan OCR App
