Fine-Tuning & LoRA

What is Fine-Tuning?

You have a pre-trained language model with billions of parameters that knows a lot about the world. But you want it to be great at a specific task — writing legal briefs, coding in Rust, or speaking like your brand. Fine-tuning adapts the model by continuing training on your specialized data.

"The problem: full fine-tuning means updating ALL parameters."

For a 70B parameter model, that means storing and updating 70 billion weights. You need a full copy of the model in memory, plus optimizer states (2-3x the model size). That's hundreds of gigabytes of VRAM — expensive, slow, and impractical for most teams.

The LoRA Insight

LoRA (Low-Rank Adaptation) is based on a key observation: when you fine-tune a model, the weight updates tend to be low-rank. Instead of updating a huge d×d weight matrix W directly, you decompose the update as ΔW = A × B, where A is d×r and B is r×d, with r much smaller than d.

LoRA Matrix Decomposition

Adjust the rank r to see how LoRA decomposes a large weight update into two small matrices.

Rank: r = 8(512×512 matrix)

r=1r=64

ΔW

512×512262,144 params

512×8

8×512

Full Parameters

262,144

d² = 512²

LoRA Parameters

8,192

2 × d × r = 2 × 512 × 8

Parameter Savings

96.9%

Only 3.1% of original

params_full = d² = 512² = 262,144

params_LoRA = 2 × d × r = 2 × 512 × 8 = 8,192

Why LoRA is Easy to Train

By only training the small A and B matrices while keeping the base model frozen, LoRA dramatically reduces memory, compute, and storage requirements.

VRAM & Storage Comparison

Select a model size to compare GPU VRAM needed for full fine-tuning vs LoRA.

Rank: r = 8

r=4r=128

Higher rank → more trainable parameters → more VRAM and larger adapter files

Full Fine-Tuning

28 GB

GPU VRAM needed

LoRA Fine-Tuning

6.2 GB

GPU VRAM needed

Storage: Full Model vs LoRA Adapter

Full Model Copy28 GB

LoRA Adapter0.0084 GB

~3,338× smaller — you can store hundreds of adapters for different tasks!

💾

Less Memory

Only the small A and B matrices need gradients and optimizer states.

⚡

Faster Training

Far fewer parameters to update means faster iterations.

🔀

Hot-Swappable

Keep one base model, swap tiny adapters at inference time for different tasks.

🧊

No Catastrophic Forgetting

Because the base model weights stay completely frozen, LoRA can't destroy the model's existing knowledge. The adapter only adds to what the model already knows — it never subtracts. This is a huge advantage over full fine-tuning, where aggressive training can cause the model to forget its general capabilities.

Use Cases

LoRA adapters are used everywhere to specialize foundation models:

💻

Task-Specific Adaptation

Train adapters for coding, medical diagnosis, legal analysis, or customer support. Each domain gets its own small adapter.

🎭

Style & Tone Adaptation

Match a specific brand voice, switch between formal and casual, or adapt writing style without retraining the whole model.

🌍

Language Adaptation

Improve performance in underrepresented languages by training a LoRA on language-specific data.

📋

Instruction Following

Make a base model follow instructions better by training an adapter on instruction-response pairs.

When NOT to Use LoRA

LoRA is powerful, but it's not the right tool for every job:

💬

Prompt Engineering Would Suffice

If you can get the behavior you want with a good system prompt or few-shot examples, don't train an adapter. It's cheaper, faster, and easier to iterate on.

📚

You Need Broad New Knowledge

LoRA is great for style and behavior, but struggles to inject large amounts of factual knowledge. Use RAG (retrieval) instead for knowledge-heavy tasks.

🗑️

Your Dataset Is Tiny or Noisy

With fewer than ~100 quality examples, LoRA will overfit or barely learn anything. Clean, curated data is essential — garbage in, garbage out.

⏱️

You Need Real-Time Adaptation

LoRA requires a training step. If your use case needs the model to adapt on-the-fly to new information, use in-context learning or RAG instead.

Why LoRA Isn't Used for Pre-Training

LoRA is fantastic for adaptation, but it's fundamentally limited for learning brand-new knowledge from scratch. Here's why:

Rank vs Approximation Quality

See how increasing rank improves task-specific adaptation but struggles with general knowledge.

Rank: r = 8

r=1r=512

Task-Specific Quality95%

Adapting to a specific domain — saturates quickly at moderate rank

General Knowledge Learning25%

Learning fundamentally new knowledge — needs full-rank updates

Matrix Reconstruction Quality63%

How well the low-rank approximation captures arbitrary weight updates

✅ Sweet Spot Ranks 8-64 typically offer the best tradeoff: excellent task adaptation with minimal parameters. Most practitioners use r=8 or r=16.

Low-Rank Constraint

LoRA constrains updates to a low-rank subspace. Fine-tuning changes are empirically low-rank (small adaptations), but pre-training needs to learn fundamental representations that are full-rank.

Limited Expressiveness

A rank-8 update to a 4096×4096 matrix can only capture a tiny fraction of possible changes. Pre-training needs the full expressiveness of unconstrained weight updates.

Diminishing Returns

As you increase the rank to capture more complex changes, you approach the cost of full fine-tuning anyway — at that point, LoRA offers no advantage.

LoRA Variants & Evolution

The original LoRA paper spawned a family of improvements. Click each card to learn more.

📦

QLoRA

▼

Quantized base model + LoRA adapters = fine-tuning on consumer GPUs.

🔬

DoRA (Weight-Decomposed LoRA)

▼

Separates weight magnitude from direction for better training dynamics.

⚡

LoRA+

▼

Different learning rates for A and B matrices = faster convergence.

Key Takeaways

1LoRA decomposes weight updates into two small matrices (A×B), reducing trainable parameters by 99%+ while maintaining quality
2The base model stays frozen — no catastrophic forgetting, and you can swap tiny adapters for different tasks at inference time
3LoRA works because fine-tuning changes are empirically low-rank: you don't need full-rank updates for task adaptation
4QLoRA extends this further by quantizing the base model, enabling fine-tuning of 70B+ models on consumer hardware
5LoRA is not suitable for pre-training — learning fundamental knowledge requires full-rank, unconstrained weight updates