What is Quantization?
Quantization is the process of reducing the numerical precision of model weights from 32-bit floating point (FP32) to lower bit representations like FP16, INT8, or INT4. This dramatically reduces memory requirements and speeds up inference.
"Like compressing a high-resolution photo to fit on your phone—you lose some detail, but the image remains recognizable and useful."
— The key insight is that neural networks are surprisingly robust to precision loss. Most weights can be stored with far fewer bits without catastrophic quality degradation.
Why Quantize?
Quantization unlocks the ability to run large models on consumer hardware and reduces inference costs in production.
Memory Reduction
A 70B parameter model at FP16 requires ~140GB VRAM. At INT4, it fits in ~35GB—runnable on high-end consumer GPUs.
Faster Inference
Lower precision arithmetic is faster. INT8 operations are 2-4x faster than FP32 on modern hardware.
Lower Costs
Smaller models mean fewer GPUs, lower cloud costs, and feasibility for edge deployment.
Democratization
Enables researchers and hobbyists to run frontier-class models locally without enterprise hardware.
Quantization Visualizer
See how precision affects model size and quality
Highly compressed (4 bits). Sweet spot for consumer hardware. Most users notice no quality difference.
Quantization Levels Explained
Each precision level represents a different tradeoff between model size and output quality.
| Level | Bits | Size | Accuracy | Use Case |
|---|---|---|---|---|
| FP32 (Full) | 32 | 100% | 100% | Training, reference inference |
| FP16 (Half) | 16 | 50% | ~99% | Standard inference |
| INT8 | 8 | 25% | ~97% | Production deployment |
| INT4 | 4 | 12.5% | ~90-95% | Consumer GPUs, edge |
| INT2 | 2 | 6.25% | ~70-80% | Extreme edge cases |
Recommendation: Q4 is the Sweet Spot
For most users running large models (70B+ parameters) locally:
- •Q4 (INT4) provides excellent quality-to-memory ratio
- •Most users cannot distinguish Q4 output from FP16 in blind tests
- •Enables running 70B models on 24GB consumer GPUs
- •Recommended formats: Q4_K_M or Q4_K_S for GGUF models
For critical applications requiring maximum accuracy, use FP16 or INT8. For casual use and experimentation, Q4 is ideal.
Quantization Techniques
Different methods for converting models to lower precision.
PTQ (Post-Training Quantization)
Apply quantization to an already-trained model. Fast and simple, but may have slightly higher accuracy loss. Works by calibrating quantization parameters on a small dataset.
QAT (Quantization-Aware Training)
Include quantization in the training process. The model learns to be robust to precision loss, yielding better accuracy but requiring full retraining.
GPTQ
One-shot quantization method designed for LLMs. Uses second-order information to minimize quantization error layer by layer. Popular for its speed and quality.
AWQ (Activation-aware Weight Quantization)
Identifies and preserves "salient" weights that matter most for accuracy. Achieves better quality than naive quantization by protecting important parameters.
GGUF Format
File format used by llama.cpp for quantized models. Supports various quantization levels (Q2-Q8) and is the standard for local LLM deployment.
GGUF K-Quant Methods
Understanding the naming convention for GGUF quantized models.
| Method | Quality | Size | Use Case |
|---|---|---|---|
| Q2_K | Poor | Smallest | Extreme compression only |
| Q3_K_S | Low | Very Small | Memory-constrained systems |
| Q3_K_M | Low-Medium | Small | Budget hardware |
| Q3_K_L | Medium | Moderate | Better Q3 quality |
| Q4_K_S | Good | Small | Recommended balance |
| Q4_K_M | Very Good | Moderate | Best overall choice |
| Q5_K_S | Excellent | Larger | Quality-focused |
| Q5_K_M | Excellent | Larger | Near-FP16 quality |
| Q6_K | Near-perfect | Large | Minimal loss |
| Q8_0 | Excellent | Large | Reference quality |
K-Quant Naming Explained
- KK = "K-quant" — uses importance-based quantization that varies precision by layer
- SS (Small) = More aggressive quantization on attention layers, smaller files
- MM (Medium) = Balanced quantization across all layers, best quality/size ratio
- LL (Large) = Less quantization on important layers, better quality
Key Insight: K-quants are "mixed precision"—they quantize different layers differently based on their importance to model quality. Attention layers typically use higher precision than feed-forward layers.
Real-World Impact
Concrete examples of what quantization enables.
Llama 3.1 70B at Different Quants
A 70B parameter model requires ~140GB at FP16. With quantization:
- Q8:Q8: ~70GB — Fits on 2x A100 40GB or 1x H100
- Q4_K_M:Q4_K_M: ~40GB — Fits on 2x RTX 4090 or 1x A100 80GB
- Q3_K_M:Q3_K_M: ~30GB — Fits on single RTX 4090 (24GB + some offload)
Quality Comparison
In blind tests comparing Q4_K_M to FP16 outputs:
- •85% of users could not identify which was quantized
- •Perplexity increase of only 0.1-0.5 points on common benchmarks
- •Code completion and reasoning tasks show minimal degradation
Cost Savings
Running a 70B model for inference:
Key Takeaways
- 1Quantization reduces model memory by 2-16x with surprisingly small accuracy loss
- 2Q4 (INT4) is the sweet spot for most local LLM use cases—excellent quality at 1/8th the memory
- 3K-quant methods (Q4_K_M, Q5_K_S) are "mixed precision" and outperform uniform quantization
- 4GPTQ and AWQ are the leading techniques for LLM quantization, with GGUF as the standard format
- 5Quantization democratizes AI by enabling frontier models on consumer hardware
- 6For critical applications, prefer higher precision (INT8/FP16); for experimentation, Q4 is ideal