Quantization

How reducing numerical precision enables running large models on consumer hardware with minimal quality loss.

What is Quantization?

Quantization is the process of reducing the numerical precision of model weights from 32-bit floating point (FP32) to lower bit representations like FP16, INT8, or INT4. This dramatically reduces memory requirements and speeds up inference.

"Like compressing a high-resolution photo to fit on your phone—you lose some detail, but the image remains recognizable and useful."

— The key insight is that neural networks are surprisingly robust to precision loss. Most weights can be stored with far fewer bits without catastrophic quality degradation.

Why Quantize?

Quantization unlocks the ability to run large models on consumer hardware and reduces inference costs in production.

1

Memory Reduction

A 70B parameter model at FP16 requires ~140GB VRAM. At INT4, it fits in ~35GB—runnable on high-end consumer GPUs.

2

Faster Inference

Lower precision arithmetic is faster. INT8 operations are 2-4x faster than FP32 on modern hardware.

3

Lower Costs

Smaller models mean fewer GPUs, lower cloud costs, and feasibility for edge deployment.

4

Democratization

Enables researchers and hobbyists to run frontier-class models locally without enterprise hardware.

Quantization Visualizer

See how precision affects model size and quality

Precision Level
FP32FP16INT8INT4INT2
Model Size
12.5%of original
Accuracy Retained
92%retained
~Perplexity Increase
+0.30
Weight Distribution
-1.00+1.0
At INT4: Weights quantized to 16 discrete levels
Explanation

Highly compressed (4 bits). Sweet spot for consumer hardware. Most users notice no quality difference.

4 bits per weight

Quantization Levels Explained

Each precision level represents a different tradeoff between model size and output quality.

LevelBitsSizeAccuracyUse Case
FP32 (Full)32100%100%Training, reference inference
FP16 (Half)1650%~99%Standard inference
INT8825%~97%Production deployment
INT4412.5%~90-95%Consumer GPUs, edge
INT226.25%~70-80%Extreme edge cases
💡

Recommendation: Q4 is the Sweet Spot

For most users running large models (70B+ parameters) locally:

  • Q4 (INT4) provides excellent quality-to-memory ratio
  • Most users cannot distinguish Q4 output from FP16 in blind tests
  • Enables running 70B models on 24GB consumer GPUs
  • Recommended formats: Q4_K_M or Q4_K_S for GGUF models

For critical applications requiring maximum accuracy, use FP16 or INT8. For casual use and experimentation, Q4 is ideal.

Quantization Techniques

Different methods for converting models to lower precision.

PTQ (Post-Training Quantization)

Apply quantization to an already-trained model. Fast and simple, but may have slightly higher accuracy loss. Works by calibrating quantization parameters on a small dataset.

QAT (Quantization-Aware Training)

Include quantization in the training process. The model learns to be robust to precision loss, yielding better accuracy but requiring full retraining.

GPTQ

One-shot quantization method designed for LLMs. Uses second-order information to minimize quantization error layer by layer. Popular for its speed and quality.

AWQ (Activation-aware Weight Quantization)

Identifies and preserves "salient" weights that matter most for accuracy. Achieves better quality than naive quantization by protecting important parameters.

GGUF Format

File format used by llama.cpp for quantized models. Supports various quantization levels (Q2-Q8) and is the standard for local LLM deployment.

GGUF K-Quant Methods

Understanding the naming convention for GGUF quantized models.

MethodQualitySizeUse Case
Q2_KPoorSmallestExtreme compression only
Q3_K_SLowVery SmallMemory-constrained systems
Q3_K_MLow-MediumSmallBudget hardware
Q3_K_LMediumModerateBetter Q3 quality
Q4_K_SGoodSmallRecommended balance
Q4_K_MVery GoodModerateBest overall choice
Q5_K_SExcellentLargerQuality-focused
Q5_K_MExcellentLargerNear-FP16 quality
Q6_KNear-perfectLargeMinimal loss
Q8_0ExcellentLargeReference quality

K-Quant Naming Explained

  • KK = "K-quant" — uses importance-based quantization that varies precision by layer
  • SS (Small) = More aggressive quantization on attention layers, smaller files
  • MM (Medium) = Balanced quantization across all layers, best quality/size ratio
  • LL (Large) = Less quantization on important layers, better quality

Key Insight: K-quants are "mixed precision"—they quantize different layers differently based on their importance to model quality. Attention layers typically use higher precision than feed-forward layers.

Real-World Impact

Concrete examples of what quantization enables.

Llama 3.1 70B at Different Quants

A 70B parameter model requires ~140GB at FP16. With quantization:

  • Q8:Q8: ~70GB — Fits on 2x A100 40GB or 1x H100
  • Q4_K_M:Q4_K_M: ~40GB — Fits on 2x RTX 4090 or 1x A100 80GB
  • Q3_K_M:Q3_K_M: ~30GB — Fits on single RTX 4090 (24GB + some offload)

Quality Comparison

In blind tests comparing Q4_K_M to FP16 outputs:

  • 85% of users could not identify which was quantized
  • Perplexity increase of only 0.1-0.5 points on common benchmarks
  • Code completion and reasoning tasks show minimal degradation

Cost Savings

Running a 70B model for inference:

FP16
FP16: ~$4-8/hour on cloud (2x A100)
Q4
Q4: ~$1-2/hour (single A100 or high-end consumer GPU)
Local
Local: One-time cost of consumer GPU vs ongoing cloud fees

Key Takeaways

  • 1Quantization reduces model memory by 2-16x with surprisingly small accuracy loss
  • 2Q4 (INT4) is the sweet spot for most local LLM use cases—excellent quality at 1/8th the memory
  • 3K-quant methods (Q4_K_M, Q5_K_S) are "mixed precision" and outperform uniform quantization
  • 4GPTQ and AWQ are the leading techniques for LLM quantization, with GGUF as the standard format
  • 5Quantization democratizes AI by enabling frontier models on consumer hardware
  • 6For critical applications, prefer higher precision (INT8/FP16); for experimentation, Q4 is ideal