How LLMs Are Trained
Large language models go through multiple training stages, each with different objectives and techniques. Understanding this pipeline is crucial for understanding model capabilities and limitations.
Why Training Matters
The training process fundamentally shapes what LLMs can and cannot do. Different training approaches produce models with different strengths, weaknesses, and behaviors.
The LLM Training Pipeline
Modern LLMs go through multiple training stages, each with different objectives. Understanding this pipeline is crucial for understanding where alignment fits in.
Stage 1: Pretraining
The foundation model is trained on massive text corpora (trillions of tokens) using self-supervised learning. The model learns to predict the next token, developing broad knowledge and language capabilities.
→ Goal: Learn language patterns, facts, and reasoning from raw text.
→ Data: Web pages, books, code, scientific papers—typically 1-10+ trillion tokens.
→ Result: A capable but unaligned "base model" that completes text but doesn't follow instructions.
Stage 2: Supervised Fine-Tuning (SFT)
The base model is fine-tuned on curated instruction-response pairs created by human annotators. This teaches the model to follow instructions and respond helpfully.
→ Goal: Transform the base model into an instruction-following assistant.
→ Data: ~10K-100K high-quality instruction-response examples.
→ Result: A model that can follow instructions but may still produce harmful or unhelpful outputs.
Stage 3: RLHF / Preference Tuning
Human evaluators rank model outputs by quality. A reward model learns these preferences, then the LLM is optimized to maximize the reward using reinforcement learning (PPO) or direct preference optimization (DPO).
→ Goal: Align the model with human preferences for helpfulness, harmlessness, and honesty.
→ Data: Human preference comparisons (A is better than B).
→ Result: A model that produces outputs humans prefer and avoids harmful behaviors.
Stage 4: Continued Training & Specialized Alignment
Models may undergo additional training for specific capabilities (coding, math, tool use) or safety refinements (red teaming, constitutional AI). This stage is ongoing throughout deployment.
The RL Paradigm: Learning Without Human Labels
A revolutionary approach where models learn reasoning through pure reinforcement learning on verifiable tasks, without human demonstrations or preference labels.
What is the RL Paradigm?
Instead of learning from human-written examples (SFT) or human preferences (RLHF), models learn directly from outcome-based rewards. If the answer is correct, the model is rewarded. If wrong, it's penalized. No human labeling required.
DeepSeek R1-Zero: A Case Study
DeepSeek R1-Zero demonstrated that powerful reasoning can emerge from pure RL, without any supervised fine-tuning. The model developed chain-of-thought reasoning, self-verification, and even "aha moments" entirely through reinforcement learning.
No SFT Required
R1-Zero was trained directly from a base model using only RL, skipping the SFT stage entirely. Reasoning behaviors emerged naturally.
Verifiable Rewards
Training focused on tasks with objectively verifiable answers: math problems, coding challenges, logical puzzles. No subjective human judgment needed.
Emergent Behaviors
The model spontaneously developed extended thinking, self-correction, and reflection—behaviors that previous models only learned from human demonstrations.
Readability Challenges
Pure RL models can develop unusual reasoning patterns that are hard to interpret. DeepSeek added a small amount of human data to improve readability.
RL Paradigm vs. Traditional RLHF
These approaches solve different problems and can be complementary.
RLHF Approach
Learn from human preferences. Requires expensive human labeling. Good for subjective tasks like writing quality and helpfulness.
RL Paradigm Approach
Learn from verifiable outcomes. No human labeling needed. Excellent for reasoning, math, and coding where correctness is objective.
Hybrid Approach
Modern models often combine both: RL for reasoning capabilities, RLHF for alignment and user preferences.
Key Alignment Concepts
Fundamental ideas in AI alignment research.
Outer Alignment
Ensuring the training objective (reward function) correctly captures what we want. Even perfect optimization of a misspecified objective leads to bad outcomes.
Inner Alignment
Ensuring the learned model actually optimizes for the training objective, not some proxy goal that happens to correlate during training.
Specification Problem
The fundamental difficulty of precisely stating what we want in all situations. Human values are complex, contextual, and sometimes contradictory.
Robustness
Maintaining alignment under distribution shift, adversarial pressure, and novel situations the model wasn't trained on.
Deceptive Alignment
A theoretical risk where a model appears aligned during training but pursues different goals when deployed—behaving well only because it's being evaluated.
Goal Misgeneralization
When a model learns a proxy goal that works in training but fails in deployment. Example: learning to get positive feedback rather than being genuinely helpful.
DPO vs RLHF: A Deep Comparison
Direct Preference Optimization (DPO) and Reinforcement Learning from Human Feedback (RLHF) are the two dominant approaches for aligning LLMs with human preferences. Understanding their differences is crucial for choosing the right technique.
RLHF: The Traditional Approach
RLHF uses a separate reward model trained on human preferences, then optimizes the LLM using reinforcement learning (typically PPO) to maximize that reward.
Step 1: Collect Preferences
Humans compare pairs of model outputs and select which they prefer. This creates a dataset of preference rankings.
Step 2: Train Reward Model
A separate neural network learns to predict human preferences, assigning scores to model outputs.
Step 3: RL Optimization
Use PPO (Proximal Policy Optimization) to update the LLM to generate outputs that maximize the reward model's scores.
DPO: The Simplified Alternative
DPO skips the reward model entirely, directly optimizing the LLM on preference data using a clever mathematical reformulation.
Step 1: Collect Preferences
Same as RLHF—humans compare pairs of outputs and indicate which they prefer.
Step 2: Direct Optimization
Instead of training a separate reward model, DPO directly updates the LLM to increase probability of preferred outputs.
Step 3: No RL Required
Uses standard supervised learning techniques, avoiding the instability and complexity of reinforcement learning.
Head-to-Head Comparison
| Aspect | RLHF | DPO |
|---|---|---|
| Complexity | High: requires reward model + RL training | Low: single-stage supervised learning |
| Reward Model | Required (separate neural network) | Not needed (implicit in loss function) |
| Training Stability | Can be unstable, requires careful tuning | Generally more stable and predictable |
| Flexibility | More flexible, reward model reusable | Less flexible, tied to specific preferences |
| Used By | GPT-4, Claude, early Llama models | Llama 3, Zephyr, many open-source models |
GRPO: Group Relative Policy Optimization
GRPO is an alignment technique developed by DeepSeek that uses relative rankings within groups of responses, eliminating the need for a separate reward model while maintaining training stability.
How GRPO Works
Instead of absolute reward scores, GRPO compares multiple responses to the same prompt and uses their relative rankings to compute policy gradients.
Generate Response Group
For each prompt, generate multiple candidate responses (typically 4-16) from the current policy.
Rank Within Group
Score and rank responses within each group. The ranking can use verifiable rewards (for math/code) or learned preferences.
Relative Gradient Update
Update the policy to increase probability of higher-ranked responses relative to lower-ranked ones within each group.
Advantages
- +No separate reward model needed—reduces memory and complexity
- +More stable than PPO—relative comparisons are more robust than absolute scores
- +Works well with verifiable rewards (math, code) and learned preferences
Key Applications
- -DeepSeek R1 reasoning model training
- -Mathematical and coding task optimization
- -Efficient alignment without reward model overhead
Synthetic Data for Alignment
Using AI models to generate training data is revolutionizing alignment. This approach can scale beyond human annotation capacity while maintaining quality through careful design.
Synthetic Data Generation Methods
Several techniques have emerged for generating high-quality synthetic training data for alignment.
Constitutional AI (Anthropic)
The model critiques and revises its own outputs based on a set of principles. The AI generates both the problematic response and the improved version, creating preference pairs without human labeling.
Self-Instruct & Evol-Instruct
Models generate their own instruction-response pairs, which are then filtered for quality. Evol-Instruct (used in WizardLM) iteratively makes instructions more complex.
Model Distillation
A larger, more capable model generates training data for a smaller model. This transfers knowledge and alignment properties, as seen in many open-source models trained on GPT-4 outputs.
Benefits
- +Massive scale—generate millions of examples cheaply
- +Consistent quality—no human annotator fatigue or disagreement
- +Targeted generation—create data for specific weaknesses
Risks & Limitations
- !Model collapse—training on AI-generated data can degrade capabilities
- !Bias amplification—AI biases get reinforced in synthetic data
- !Quality ceiling—synthetic data quality limited by source model
Alignment Techniques
RLHF (Reinforcement Learning from Human Feedback)
Train a reward model on human preferences, then use RL to optimize the LLM against it. The dominant alignment technique since GPT-4.
Constitutional AI (CAI)
Define principles (a "constitution") and have the model critique and revise its own outputs. Reduces reliance on human labelers and scales better.
Direct Preference Optimization (DPO)
Skip the reward model—directly optimize the LLM on preference data. Simpler and more stable than RLHF.
Red Teaming
Adversarial testing by humans or other AI models to find failure modes, jailbreaks, and harmful outputs before deployment.
Interpretability
Understanding what models are actually learning internally. Crucial for verifying alignment rather than just measuring behavior.
Safety Filters & Guardrails
Additional layers that filter inputs/outputs for harmful content. A defense-in-depth measure, not a replacement for alignment.
Fine-Tuning vs. Alignment
Fine-tuning and alignment are related but distinct concepts.
Fine-Tuning
Adapting a model to new tasks or domains by training on task-specific data. Can be done for any purpose.
Alignment
Specifically making a model's behavior match human values and intentions. A subset of fine-tuning with a specific goal.
Post-Training
The umbrella term for everything after pretraining: SFT, RLHF, specialized fine-tuning, safety training, etc.
Key Takeaways
- 1LLM training has distinct stages: pretraining → SFT → RLHF → specialized alignment
- 2The RL paradigm (e.g., DeepSeek R1-Zero) shows reasoning can emerge from pure RL without human demonstrations
- 3RLHF aligns models with human preferences; pure RL optimizes for verifiable outcomes
- 4Modern models often combine multiple techniques: SFT for instruction following, RLHF for preferences, RL for reasoning
- 5Understanding the training pipeline helps you understand model behavior and limitations
- 6The field is rapidly evolving—new paradigms like pure RL are changing how we think about training