Learn AI Concepts | Interactive Guide

How LLMs Are Trained

Large language models go through multiple training stages, each with different objectives and techniques. Understanding this pipeline is crucial for understanding model capabilities and limitations.

Why Training Matters

The training process fundamentally shapes what LLMs can and cannot do. Different training approaches produce models with different strengths, weaknesses, and behaviors.

The LLM Training Pipeline

Modern LLMs go through multiple training stages, each with different objectives. Understanding this pipeline is crucial for understanding where alignment fits in.

Stage 1: Pretraining

The foundation model is trained on massive text corpora (trillions of tokens) using self-supervised learning. The model learns to predict the next token, developing broad knowledge and language capabilities.

→ Goal: Learn language patterns, facts, and reasoning from raw text.

→ Data: Web pages, books, code, scientific papers—typically 1-10+ trillion tokens.

→ Result: A capable but unaligned "base model" that completes text but doesn't follow instructions.

Stage 2: Supervised Fine-Tuning (SFT)

The base model is fine-tuned on curated instruction-response pairs created by human annotators. This teaches the model to follow instructions and respond helpfully.

→ Goal: Transform the base model into an instruction-following assistant.

→ Data: ~10K-100K high-quality instruction-response examples.

→ Result: A model that can follow instructions but may still produce harmful or unhelpful outputs.

Stage 3: RLHF / Preference Tuning

Human evaluators rank model outputs by quality. A reward model learns these preferences, then the LLM is optimized to maximize the reward using reinforcement learning (PPO) or direct preference optimization (DPO).

→ Goal: Align the model with human preferences for helpfulness, harmlessness, and honesty.

→ Data: Human preference comparisons (A is better than B).

→ Result: A model that produces outputs humans prefer and avoids harmful behaviors.

Stage 4: Continued Training & Specialized Alignment

Models may undergo additional training for specific capabilities (coding, math, tool use) or safety refinements (red teaming, constitutional AI). This stage is ongoing throughout deployment.

The RL Paradigm: Learning Without Human Labels

A revolutionary approach where models learn reasoning through pure reinforcement learning on verifiable tasks, without human demonstrations or preference labels.

What is the RL Paradigm?

Instead of learning from human-written examples (SFT) or human preferences (RLHF), models learn directly from outcome-based rewards. If the answer is correct, the model is rewarded. If wrong, it's penalized. No human labeling required.

DeepSeek R1-Zero: A Case Study

DeepSeek R1-Zero demonstrated that powerful reasoning can emerge from pure RL, without any supervised fine-tuning. The model developed chain-of-thought reasoning, self-verification, and even "aha moments" entirely through reinforcement learning.

No SFT Required

R1-Zero was trained directly from a base model using only RL, skipping the SFT stage entirely. Reasoning behaviors emerged naturally.

Verifiable Rewards

Training focused on tasks with objectively verifiable answers: math problems, coding challenges, logical puzzles. No subjective human judgment needed.

Emergent Behaviors

The model spontaneously developed extended thinking, self-correction, and reflection—behaviors that previous models only learned from human demonstrations.

Readability Challenges

Pure RL models can develop unusual reasoning patterns that are hard to interpret. DeepSeek added a small amount of human data to improve readability.

RL Paradigm vs. Traditional RLHF

These approaches solve different problems and can be complementary.

RLHF Approach

Learn from human preferences. Requires expensive human labeling. Good for subjective tasks like writing quality and helpfulness.

RL Paradigm Approach

Learn from verifiable outcomes. No human labeling needed. Excellent for reasoning, math, and coding where correctness is objective.

Hybrid Approach

Modern models often combine both: RL for reasoning capabilities, RLHF for alignment and user preferences.

Key Alignment Concepts

Fundamental ideas in AI alignment research.

Outer Alignment

Ensuring the training objective (reward function) correctly captures what we want. Even perfect optimization of a misspecified objective leads to bad outcomes.

Inner Alignment

Ensuring the learned model actually optimizes for the training objective, not some proxy goal that happens to correlate during training.

Specification Problem

The fundamental difficulty of precisely stating what we want in all situations. Human values are complex, contextual, and sometimes contradictory.

Robustness

Maintaining alignment under distribution shift, adversarial pressure, and novel situations the model wasn't trained on.

Deceptive Alignment

A theoretical risk where a model appears aligned during training but pursues different goals when deployed—behaving well only because it's being evaluated.

Goal Misgeneralization

When a model learns a proxy goal that works in training but fails in deployment. Example: learning to get positive feedback rather than being genuinely helpful.

DPO vs RLHF: A Deep Comparison

Direct Preference Optimization (DPO) and Reinforcement Learning from Human Feedback (RLHF) are the two dominant approaches for aligning LLMs with human preferences. Understanding their differences is crucial for choosing the right technique.

RLHF: The Traditional Approach

RLHF uses a separate reward model trained on human preferences, then optimizes the LLM using reinforcement learning (typically PPO) to maximize that reward.

Step 1: Collect Preferences

Humans compare pairs of model outputs and select which they prefer. This creates a dataset of preference rankings.

Step 2: Train Reward Model

A separate neural network learns to predict human preferences, assigning scores to model outputs.

Step 3: RL Optimization

Use PPO (Proximal Policy Optimization) to update the LLM to generate outputs that maximize the reward model's scores.

DPO: The Simplified Alternative

DPO skips the reward model entirely, directly optimizing the LLM on preference data using a clever mathematical reformulation.

Step 1: Collect Preferences

Same as RLHF—humans compare pairs of outputs and indicate which they prefer.

Step 2: Direct Optimization

Instead of training a separate reward model, DPO directly updates the LLM to increase probability of preferred outputs.

Step 3: No RL Required

Uses standard supervised learning techniques, avoiding the instability and complexity of reinforcement learning.

Head-to-Head Comparison

Aspect	RLHF	DPO
Complexity	High: requires reward model + RL training	Low: single-stage supervised learning
Reward Model	Required (separate neural network)	Not needed (implicit in loss function)
Training Stability	Can be unstable, requires careful tuning	Generally more stable and predictable
Flexibility	More flexible, reward model reusable	Less flexible, tied to specific preferences
Used By	GPT-4, Claude, early Llama models	Llama 3, Zephyr, many open-source models

GRPO: Group Relative Policy Optimization

GRPO is an alignment technique developed by DeepSeek that uses relative rankings within groups of responses, eliminating the need for a separate reward model while maintaining training stability.

How GRPO Works

Instead of absolute reward scores, GRPO compares multiple responses to the same prompt and uses their relative rankings to compute policy gradients.

Generate Response Group

For each prompt, generate multiple candidate responses (typically 4-16) from the current policy.

Rank Within Group

Score and rank responses within each group. The ranking can use verifiable rewards (for math/code) or learned preferences.

Relative Gradient Update

Update the policy to increase probability of higher-ranked responses relative to lower-ranked ones within each group.

Advantages

+No separate reward model needed—reduces memory and complexity
+More stable than PPO—relative comparisons are more robust than absolute scores
+Works well with verifiable rewards (math, code) and learned preferences

Key Applications

-DeepSeek R1 reasoning model training
-Mathematical and coding task optimization
-Efficient alignment without reward model overhead

Synthetic Data for Alignment

Using AI models to generate training data is revolutionizing alignment. This approach can scale beyond human annotation capacity while maintaining quality through careful design.

Synthetic Data Generation Methods

Several techniques have emerged for generating high-quality synthetic training data for alignment.

Constitutional AI (Anthropic)

The model critiques and revises its own outputs based on a set of principles. The AI generates both the problematic response and the improved version, creating preference pairs without human labeling.

Self-Instruct & Evol-Instruct

Models generate their own instruction-response pairs, which are then filtered for quality. Evol-Instruct (used in WizardLM) iteratively makes instructions more complex.

Model Distillation

A larger, more capable model generates training data for a smaller model. This transfers knowledge and alignment properties, as seen in many open-source models trained on GPT-4 outputs.

Benefits

+Massive scale—generate millions of examples cheaply
+Consistent quality—no human annotator fatigue or disagreement
+Targeted generation—create data for specific weaknesses

Risks & Limitations

!Model collapse—training on AI-generated data can degrade capabilities
!Bias amplification—AI biases get reinforced in synthetic data
!Quality ceiling—synthetic data quality limited by source model

Alignment Techniques

RLHF (Reinforcement Learning from Human Feedback)

Train a reward model on human preferences, then use RL to optimize the LLM against it. The dominant alignment technique since GPT-4.

Constitutional AI (CAI)

Define principles (a "constitution") and have the model critique and revise its own outputs. Reduces reliance on human labelers and scales better.

Direct Preference Optimization (DPO)

Skip the reward model—directly optimize the LLM on preference data. Simpler and more stable than RLHF.

Red Teaming

Adversarial testing by humans or other AI models to find failure modes, jailbreaks, and harmful outputs before deployment.

Interpretability

Understanding what models are actually learning internally. Crucial for verifying alignment rather than just measuring behavior.

Safety Filters & Guardrails

Additional layers that filter inputs/outputs for harmful content. A defense-in-depth measure, not a replacement for alignment.

Fine-Tuning vs. Alignment

Fine-tuning and alignment are related but distinct concepts.

Fine-Tuning

Adapting a model to new tasks or domains by training on task-specific data. Can be done for any purpose.

Alignment

Specifically making a model's behavior match human values and intentions. A subset of fine-tuning with a specific goal.

Post-Training

The umbrella term for everything after pretraining: SFT, RLHF, specialized fine-tuning, safety training, etc.

Key Takeaways

1LLM training has distinct stages: pretraining → SFT → RLHF → specialized alignment
2The RL paradigm (e.g., DeepSeek R1-Zero) shows reasoning can emerge from pure RL without human demonstrations
3RLHF aligns models with human preferences; pure RL optimizes for verifiable outcomes
4Modern models often combine multiple techniques: SFT for instruction following, RLHF for preferences, RL for reasoning
5Understanding the training pipeline helps you understand model behavior and limitations
6The field is rapidly evolving—new paradigms like pure RL are changing how we think about training