What Is Training Data?
Training data is the raw material that shapes everything an AI model knows and can do. Just as a person's education depends on what books they read and experiences they have, an LLM's capabilities are fundamentally determined by the text it was trained on. The quality, diversity, and scale of training data matter more than almost any architectural choice.
Data is the single most important ingredient in modern AI. A mediocre architecture trained on excellent data will outperform a brilliant architecture trained on poor data. This is why training data has become one of the most valuable — and contested — resources in the AI industry.
Data Source Explorer
Explore the major training data sources, their scale, and how they compare.
Training Data Sources
Explore the datasets behind modern LLMs
Legitimate Data Sources
Major LLM training datasets draw from a variety of publicly accessible and licensed sources. The scale is staggering — modern models train on trillions of tokens.
Common Crawl
A nonprofit that crawls the web monthly. Contains petabytes of raw HTML from billions of pages. The backbone of most training datasets, but requires heavy filtering.
Wikipedia
High-quality encyclopedic content in 300+ languages. Universally used in LLM training due to its structured, factual nature.
Books & Literature
Books provide long-form, coherent reasoning and narrative that web text often lacks. Datasets like Books3 (from Bibliotik) contained ~196,000 books.
GitHub / Code
Public code repositories provide programming knowledge. The Stack (by BigCode) contains permissively licensed code from GitHub.
Academic Papers
ArXiv, Semantic Scholar, PubMed — scientific papers provide technical knowledge and formal reasoning.
The Pile (EleutherAI)
An 825GB curated dataset combining 22 diverse sources: Wikipedia, PubMed, ArXiv, GitHub, StackExchange, USPTO patents, and more. Designed for research.
RedPajama
An open reproduction of the LLaMA training dataset. Contains Common Crawl, C4, GitHub, Wikipedia, books, ArXiv, and StackExchange.
FineWeb (HuggingFace)
A 15T token dataset derived from 96 Common Crawl snapshots with aggressive quality filtering. Currently one of the highest-quality open web datasets.
Controversial & Illegal Sources
The demand for training data has led companies into legally and ethically gray areas. Several high-profile lawsuits and controversies have shaped the debate.
The Books3 Controversy
Books3 contained ~196,000 pirated books scraped from shadow library Bibliotik, including copyrighted works by living authors. It was included in The Pile and used to train models by Meta, Bloomberg, and others. Authors filed class-action lawsuits, and the dataset was eventually removed from public access.
NYT vs. OpenAI
The New York Times sued OpenAI and Microsoft in December 2023, alleging that GPT models were trained on millions of NYT articles without permission. The lawsuit showed that ChatGPT could reproduce NYT articles nearly verbatim — proving the content was memorized, not just "learned from."
Reddit & Social Media Scraping
Reddit's entire corpus was used to train models without compensating users. Reddit later struck a $60M/year deal with Google for AI training access, effectively monetizing user content that was created for free. Twitter/X similarly restricted API access and began charging for data.
GDPR & Privacy Violations
European regulators have investigated whether training on personal data from the web violates GDPR. Italy temporarily banned ChatGPT in 2023. The fundamental tension: web crawls inevitably contain personal information that individuals never consented to be used for AI training.
Art & Creative Works
Image models (Stable Diffusion, Midjourney) were trained on LAION-5B, which contained billions of copyrighted images scraped from the internet. Artists filed lawsuits arguing this constitutes copyright infringement at industrial scale.
License Laundering
Some datasets are released under permissive licenses despite containing copyrighted material. The argument that "it was publicly accessible on the web" does not make it legally licensed for AI training.
The Data Quality Problem
Raw web data is noisy, redundant, and often toxic. The quality of the training data directly determines the quality of the model. Cleaning and curating data is as important as the training process itself.
Garbage In, Garbage Out
Models faithfully learn whatever patterns exist in their training data — including errors, biases, spam, and misinformation. A model trained on low-quality data will produce low-quality outputs, regardless of its architecture.
Deduplication
Web crawls contain massive amounts of duplicated content (boilerplate, scraped mirrors, copy-paste). Training on duplicates causes models to memorize rather than generalize, and can cause training instability. MinHash LSH is the standard deduplication approach.
Toxic Content Filtering
The web contains hate speech, explicit content, and extremist material. Models must either filter this out during data preparation or learn to avoid reproducing it during alignment. Both approaches have trade-offs: over-filtering removes legitimate content, under-filtering teaches harmful patterns.
Language Bias
English dominates most training datasets (~60-90% of tokens). This means models are significantly more capable in English than other languages. Languages with less web presence (most African and Indigenous languages) are severely underrepresented.
Benchmark Contamination
When benchmark test sets accidentally appear in training data, models score artificially high on evaluations. This "data contamination" makes it hard to assess true model capabilities and has led to inflated benchmark results across the industry.
Synthetic Data: The Deep Dive
With natural data sources becoming exhausted and legally contested, synthetic data — training data generated by AI models themselves — has become the frontier of AI training. This is the most rapidly evolving area in AI data.
What Is Synthetic Data?
Synthetic data is training data generated by AI models rather than collected from human sources. This can range from simple paraphrasing to complex multi-step reasoning chains generated by frontier models. The key insight is that AI models can create training data that is often higher quality than what humans would produce, because models can generate at scale while maintaining consistency.
Self-Play & Self-Improvement
The concept originated with games: AlphaGo trained on human games, but AlphaZero learned entirely by playing against itself. This self-play paradigm has been adapted for language models — models improve by generating and evaluating their own outputs.
AlphaZero surpassed all human knowledge of Go, Chess, and Shogi within hours of training — using zero human data. This proved that self-generated data can surpass human-sourced data in quality.
Distillation as Data Generation
A powerful teacher model generates training data for a smaller student model. The student learns to mimic the teacher's behavior, effectively compressing the teacher's knowledge. This is one of the most practically important synthetic data techniques.
Microsoft's Phi-3 was trained largely on synthetic data generated by GPT-4. Orca was explicitly trained to mimic GPT-4's reasoning traces. This approach has produced surprisingly capable small models.
Constitutional AI (Anthropic)
The model critiques and revises its own outputs based on a set of written principles (a "constitution"). This generates synthetic preference data without human labelers: the AI produces both the flawed response and the improved version, creating training pairs.
The model generates a response, then evaluates it against principles like "be helpful, harmless, and honest." It then revises the response to better align with these principles. Both versions become training data.
Rejection Sampling & Best-of-N
Generate N candidate responses, score them with a reward model or verifier, and keep only the best ones. This creates a dataset of high-quality responses that the model can learn from. Simple but effective.
For math problems, generate 100 solutions, verify which ones reach the correct answer, and train on only the correct solutions. This filters out errors and teaches reliable reasoning.
RLHF/DPO Synthetic Preference Data
Instead of expensive human preference labeling, models can generate their own preference pairs. A strong model judges which of two responses is better, creating synthetic preference data for DPO or RLHF training.
This approach has enabled preference-tuning of open-source models at scales that would be prohibitively expensive with human annotators — millions of preference pairs instead of tens of thousands.
The Model Collapse Problem
When models are trained on data generated by other models (or themselves), quality can degrade over generations. Each generation of training amplifies artifacts and errors while losing diversity and nuance from the original distribution. This is called "model collapse."
Think of it like photocopying a photocopy — each generation loses fidelity. AI-generated text has subtle statistical signatures that, when amplified through recursive training, push the distribution away from natural language. Rare but important knowledge gets lost while common patterns get over-represented.
Research from Rice and Stanford (2023) showed that models trained recursively on their own outputs eventually degenerate. The solution: always mix synthetic data with real human data, and carefully monitor quality.
Scaling Laws for Synthetic Data
Synthetic data helps most when it's targeted at specific weaknesses. Randomly generating more data has diminishing returns, but carefully designed synthetic data can dramatically improve performance in specific domains.
When Synthetic Data Helps: domain-specific tasks (math, code, reasoning), when real data is scarce or expensive, for alignment and safety training, when diversity of training scenarios matters.
When It Hurts: recursive self-training without quality control, when the source model has systematic biases, for tasks requiring genuine real-world knowledge, when used as a complete replacement for real data.
Real-World Examples
Synthetic data is already powering some of the most capable models available today.
Phi-3 (Microsoft)
Small models (3.8B parameters) trained on heavily filtered web data plus GPT-4-generated synthetic "textbook quality" data. Achieves performance competitive with models 10x its size.
Orca 2 (Microsoft)
Trained on synthetic reasoning traces from GPT-4. The key innovation: teaching the student model to use different reasoning strategies (step-by-step, direct answer, etc.) depending on task complexity.
WizardLM (Evol-Instruct)
Uses "evolutionary instruction tuning" — starting from simple prompts and iteratively making them more complex using LLM-driven evolution. This generates a diverse set of increasingly challenging instructions.
Nemotron-4 (NVIDIA)
NVIDIA's 340B parameter model used to generate synthetic data for training smaller models. Over 98% of the alignment data for Nemotron-4 340B was synthetically generated.
Cosmopedia (HuggingFace)
The largest open synthetic dataset: 25B tokens of textbooks, blog posts, and stories generated by Mixtral-8x7B. Designed to provide diverse, educational content for pretraining.
The Future of Training Data
The landscape of training data is shifting rapidly, driven by legal pressures, data scarcity, and new multimodal demands.
The Data Wall
We may be approaching the limits of available text data on the internet. Estimates suggest the total "stock" of quality text on the web is 50-300 trillion tokens. Frontier models are already training on significant fractions of this. This scarcity is driving the push toward synthetic data and multimodal training.
Multimodal Training Data
Video, audio, and image data represent vastly larger untapped pools. YouTube alone has 800M+ videos. Training on video data could teach models about physical causality, temporal reasoning, and the real world in ways text alone cannot.
Regulatory Landscape
The EU AI Act requires transparency about training data for high-risk AI systems. Copyright lawsuits are establishing legal precedent. The trend is toward more disclosure and potentially licensing requirements for training data.
Data Licensing & Marketplaces
A new industry is emerging around licensed training data. Publishers, content creators, and data brokers are negotiating deals with AI companies. Reddit's $60M Google deal was just the beginning.
Key Takeaways
- 1Training data is the single most important factor in determining model capabilities — more important than architecture or training methods.
- 2Major datasets (Common Crawl, The Pile, RedPajama, FineWeb) are built from web crawls, books, code, and academic papers at trillion-token scale.
- 3The legal landscape is rapidly evolving — lawsuits over copyrighted books, news articles, and creative works are establishing new precedents.
- 4Data quality matters enormously: deduplication, toxic content filtering, and benchmark contamination are active challenges.
- 5Synthetic data is the frontier — distillation, self-play, constitutional AI, and rejection sampling are producing increasingly capable models.
- 6Model collapse is a real risk: training recursively on AI-generated data degrades quality without careful controls.
- 7We may be hitting a "data wall" for text, pushing the field toward multimodal data and synthetic generation.
- 8Regulation (EU AI Act) and licensing deals are reshaping how training data is sourced and disclosed.