What is Knowledge Distillation?
Knowledge distillation is a model compression technique where a smaller "student" model is trained to replicate the behavior of a larger, more capable "teacher" model. Instead of training the student from scratch on raw data, it learns from the teacher's output probability distributions—capturing not just what the teacher predicts, but how confident it is across all possible predictions.
"Imagine a master chef teaching an apprentice—not just the recipes, but all the subtle intuitions: why this spice almost works, why that technique is close but not quite right."
Distillation transfers these nuanced judgments by sharing the full probability distribution, not just the final answer.
The Teacher-Student Paradigm
Distillation follows a straightforward two-phase process: first train a large, powerful teacher model, then use its outputs to train a smaller, efficient student.
Teacher Model
A large, high-capacity model (e.g., GPT-4, Claude Opus) trained on massive datasets. It has learned rich representations and nuanced decision boundaries. Its role is to generate soft probability distributions that encode its knowledge.
Student Model
A smaller, more efficient model designed for deployment. It learns by matching the teacher's probability distributions rather than just the ground truth labels. This allows it to capture the teacher's "dark knowledge"—the relationships between classes that hard labels discard.
The Key Insight: Distributions, Not Tokens
Why Distributions Make Distillation So Effective
The fundamental reason distillation works so well is that we train on full probability distributions, not single tokens or hard labels. When a teacher model processes "The capital of France is ___", it doesn't just output "Paris"—it produces a probability distribution over its entire vocabulary.
This distribution contains rich information: "Paris" gets 92%, but "Lyon" gets 3%, "Marseille" gets 1.5%, and "Berlin" gets 0.8%. These "wrong" answers encode the teacher's understanding of geography, similarity between cities, and conceptual relationships. A hard label of just "Paris" throws all of this knowledge away.
Hard Labels (Traditional Training)
Binary: either right or wrong. No nuance. The model learns nothing about the relationships between outputs.
Soft Labels (Distillation)
Rich signal: every probability encodes a relationship. The student learns that Lyon is more similar to Paris than Berlin is.
Temperature & Distribution Softening
Hard Distribution (T=1)
At T=1, the dominant token overwhelms others. Little information in the tail.
Soft Distribution (T=3)
Higher temperature reveals relationships between tokens that hard labels hide.
Medium temperature: the distribution is smoothed, revealing meaningful relationships between tokens. This is the sweet spot for distillation.
Why Distillation Works
Distillation is remarkably effective because soft labels provide a much richer training signal than hard labels:
Richer Gradient Signal
Each training example provides information about all output classes simultaneously, not just the correct one. This means each example effectively teaches the student about thousands of relationships at once.
Dark Knowledge Transfer
The teacher's "mistakes" are informative. When the teacher assigns 3% probability to "Lyon" for a question about France's capital, it tells the student that Lyon is relevant to France—knowledge that hard labels completely discard.
Better Generalization
Students trained via distillation often generalize better than models trained on hard labels alone, even when the student has much fewer parameters. The soft labels act as a powerful regularizer.
Sample Efficiency
Because each training example carries more information (a full distribution vs. a single label), the student needs fewer examples to learn effectively. This reduces training time and data requirements.
The Distillation Loss
The training objective combines two losses: the standard cross-entropy with ground truth labels, and the KL divergence between teacher and student distributions:
- CECross-Entropy with ground truth: ensures the student still learns from real labels
- KLKL Divergence: measures how different the student's distribution is from the teacher's. The student is penalized for deviating from the teacher's soft probabilities.
- TTemperature: controls how soft/smooth the distributions are. Higher T reveals more inter-class relationships.
- αAlpha: balances the two loss terms. Typical values range from 0.1 to 0.9, with higher values placing more weight on matching the teacher.
The T² factor compensates for the scaling effect of temperature on gradients, ensuring the distillation loss and cross-entropy loss remain balanced regardless of temperature choice.
Types of Distillation
Different approaches depending on what knowledge is transferred from teacher to student:
Response-Based
The student mimics the teacher's final output distribution. This is the original and most common form, used by Hinton et al. (2015). Simple to implement and effective for classification and language modeling.
Feature-Based
The student learns to match intermediate representations (hidden states) of the teacher, not just the output. Captures deeper structural knowledge. Used in models like DistilBERT and TinyBERT.
Relation-Based
Transfers the relationships between different examples or layers, rather than individual outputs. Preserves how the teacher structures its internal representations and how it relates different inputs to each other.
Online Distillation
Teacher and student train simultaneously, learning from each other. No pre-trained teacher required. Useful when you cannot afford to train a massive teacher model first.
Real-World Examples
Distillation is used extensively in production AI systems:
DistilBERT (Hugging Face)
A distilled version of BERT that is 60% smaller, 60% faster, and retains 97% of BERT's language understanding. Trained using a combination of response-based and feature-based distillation. One of the most widely deployed distilled models.
OpenAI GPT-4 to GPT-4o-mini
GPT-4o-mini is widely believed to be distilled from larger GPT-4 class models. It offers substantially lower latency and cost while maintaining competitive performance on most tasks. This pattern—a large frontier model distilled into a smaller, faster variant—has become standard practice.
DeepSeek R1 Distillation
DeepSeek released distilled versions of their R1 reasoning model into Qwen and Llama base models. These distilled variants bring advanced reasoning capabilities to much smaller, more deployable models, demonstrating that even complex chain-of-thought reasoning can be effectively distilled.
Key Takeaways
- 1Knowledge distillation trains smaller models to replicate larger ones by learning from full probability distributions, not just final answers
- 2The critical insight is that we train on distributions, not single tokens—soft labels encode rich relational knowledge ("dark knowledge") that hard labels discard entirely
- 3Temperature softening reveals inter-class relationships hidden in the teacher's distribution, making distillation far more effective than simple label matching
- 4Distilled models can retain 95-99% of teacher performance at a fraction of the size, making frontier AI capabilities accessible for real-world deployment
- 5Distillation has become a standard practice in the industry—most small, fast models you use daily (GPT-4o-mini, DistilBERT, Gemini Flash) are likely distilled from larger teachers