What is Fine-Tuning?
You have a pre-trained language model with billions of parameters that knows a lot about the world. But you want it to be great at a specific task — writing legal briefs, coding in Rust, or speaking like your brand. Fine-tuning adapts the model by continuing training on your specialized data.
"The problem: full fine-tuning means updating ALL parameters."
For a 70B parameter model, that means storing and updating 70 billion weights. You need a full copy of the model in memory, plus optimizer states (2-3x the model size). That's hundreds of gigabytes of VRAM — expensive, slow, and impractical for most teams.
The LoRA Insight
LoRA (Low-Rank Adaptation) is based on a key observation: when you fine-tune a model, the weight updates tend to be low-rank. Instead of updating a huge d×d weight matrix W directly, you decompose the update as ΔW = A × B, where A is d×r and B is r×d, with r much smaller than d.
LoRA Matrix Decomposition
Adjust the rank r to see how LoRA decomposes a large weight update into two small matrices.
Why LoRA is Easy to Train
By only training the small A and B matrices while keeping the base model frozen, LoRA dramatically reduces memory, compute, and storage requirements.
VRAM & Storage Comparison
Select a model size to compare GPU VRAM needed for full fine-tuning vs LoRA.
Higher rank → more trainable parameters → more VRAM and larger adapter files
~3,338× smaller — you can store hundreds of adapters for different tasks!
Less Memory
Only the small A and B matrices need gradients and optimizer states.
Faster Training
Far fewer parameters to update means faster iterations.
Hot-Swappable
Keep one base model, swap tiny adapters at inference time for different tasks.
No Catastrophic Forgetting
Because the base model weights stay completely frozen, LoRA can't destroy the model's existing knowledge. The adapter only adds to what the model already knows — it never subtracts. This is a huge advantage over full fine-tuning, where aggressive training can cause the model to forget its general capabilities.
Use Cases
LoRA adapters are used everywhere to specialize foundation models:
Task-Specific Adaptation
Train adapters for coding, medical diagnosis, legal analysis, or customer support. Each domain gets its own small adapter.
Style & Tone Adaptation
Match a specific brand voice, switch between formal and casual, or adapt writing style without retraining the whole model.
Language Adaptation
Improve performance in underrepresented languages by training a LoRA on language-specific data.
Instruction Following
Make a base model follow instructions better by training an adapter on instruction-response pairs.
When NOT to Use LoRA
LoRA is powerful, but it's not the right tool for every job:
Prompt Engineering Would Suffice
If you can get the behavior you want with a good system prompt or few-shot examples, don't train an adapter. It's cheaper, faster, and easier to iterate on.
You Need Broad New Knowledge
LoRA is great for style and behavior, but struggles to inject large amounts of factual knowledge. Use RAG (retrieval) instead for knowledge-heavy tasks.
Your Dataset Is Tiny or Noisy
With fewer than ~100 quality examples, LoRA will overfit or barely learn anything. Clean, curated data is essential — garbage in, garbage out.
You Need Real-Time Adaptation
LoRA requires a training step. If your use case needs the model to adapt on-the-fly to new information, use in-context learning or RAG instead.
Why LoRA Isn't Used for Pre-Training
LoRA is fantastic for adaptation, but it's fundamentally limited for learning brand-new knowledge from scratch. Here's why:
Rank vs Approximation Quality
See how increasing rank improves task-specific adaptation but struggles with general knowledge.
Adapting to a specific domain — saturates quickly at moderate rank
Learning fundamentally new knowledge — needs full-rank updates
How well the low-rank approximation captures arbitrary weight updates
✅ Sweet Spot Ranks 8-64 typically offer the best tradeoff: excellent task adaptation with minimal parameters. Most practitioners use r=8 or r=16.
Low-Rank Constraint
LoRA constrains updates to a low-rank subspace. Fine-tuning changes are empirically low-rank (small adaptations), but pre-training needs to learn fundamental representations that are full-rank.
Limited Expressiveness
A rank-8 update to a 4096×4096 matrix can only capture a tiny fraction of possible changes. Pre-training needs the full expressiveness of unconstrained weight updates.
Diminishing Returns
As you increase the rank to capture more complex changes, you approach the cost of full fine-tuning anyway — at that point, LoRA offers no advantage.
LoRA Variants & Evolution
The original LoRA paper spawned a family of improvements. Click each card to learn more.
QLoRA
▼Quantized base model + LoRA adapters = fine-tuning on consumer GPUs.
DoRA (Weight-Decomposed LoRA)
▼Separates weight magnitude from direction for better training dynamics.
LoRA+
▼Different learning rates for A and B matrices = faster convergence.
Key Takeaways
- 1LoRA decomposes weight updates into two small matrices (A×B), reducing trainable parameters by 99%+ while maintaining quality
- 2The base model stays frozen — no catastrophic forgetting, and you can swap tiny adapters for different tasks at inference time
- 3LoRA works because fine-tuning changes are empirically low-rank: you don't need full-rank updates for task adaptation
- 4QLoRA extends this further by quantizing the base model, enabling fine-tuning of 70B+ models on consumer hardware
- 5LoRA is not suitable for pre-training — learning fundamental knowledge requires full-rank, unconstrained weight updates