Learn AI Concepts | Interactive Guide

What is Quantization?

Quantization is the process of reducing the numerical precision of model weights from 32-bit floating point (FP32) to lower bit representations like FP16, INT8, or INT4. This dramatically reduces memory requirements and speeds up inference.

"Like compressing a high-resolution photo to fit on your phone—you lose some detail, but the image remains recognizable and useful."

— The key insight is that neural networks are surprisingly robust to precision loss. Most weights can be stored with far fewer bits without catastrophic quality degradation.

Why Quantize?

Quantization unlocks the ability to run large models on consumer hardware and reduces inference costs in production.

Memory Reduction

A 70B parameter model at FP16 requires ~140GB VRAM. At INT4, it fits in ~35GB—runnable on high-end consumer GPUs.

Faster Inference

Lower precision arithmetic is faster. INT8 operations are 2-4x faster than FP32 on modern hardware.

Lower Costs

Smaller models mean fewer GPUs, lower cloud costs, and feasibility for edge deployment.

Democratization

Enables researchers and hobbyists to run frontier-class models locally without enterprise hardware.

Quantization Visualizer

See how precision affects model size and quality

Precision Level

FP32FP16INT8INT4INT2

Model Size

12.5%of original

Accuracy Retained

92%retained

~Perplexity Increase

+0.30

Weight Distribution

At INT4: Weights quantized to 16 discrete levels

Explanation

Highly compressed (4 bits). Sweet spot for consumer hardware. Most users notice no quality difference.

4 bits per weight

Quantization Levels Explained

Each precision level represents a different tradeoff between model size and output quality.

Level	Bits	Size	Accuracy	Use Case
FP32 (Full)	32	100%	100%	Training, reference inference
FP16 (Half)	16	50%	~99%	Standard inference
INT8	8	25%	~97%	Production deployment
INT4	4	12.5%	~90-95%	Consumer GPUs, edge
INT2	2	6.25%	~70-80%	Extreme edge cases

💡

Recommendation: Q4 is the Sweet Spot

For most users running large models (70B+ parameters) locally:

•Q4 (INT4) provides excellent quality-to-memory ratio
•Most users cannot distinguish Q4 output from FP16 in blind tests
•Enables running 70B models on 24GB consumer GPUs
•Recommended formats: Q4_K_M or Q4_K_S for GGUF models

For critical applications requiring maximum accuracy, use FP16 or INT8. For casual use and experimentation, Q4 is ideal.

Quantization Techniques

Different methods for converting models to lower precision.

PTQ (Post-Training Quantization)

Apply quantization to an already-trained model. Fast and simple, but may have slightly higher accuracy loss. Works by calibrating quantization parameters on a small dataset.

QAT (Quantization-Aware Training)

Include quantization in the training process. The model learns to be robust to precision loss, yielding better accuracy but requiring full retraining.

GPTQ

One-shot quantization method designed for LLMs. Uses second-order information to minimize quantization error layer by layer. Popular for its speed and quality.

AWQ (Activation-aware Weight Quantization)

Identifies and preserves "salient" weights that matter most for accuracy. Achieves better quality than naive quantization by protecting important parameters.

GGUF Format

File format used by llama.cpp for quantized models. Supports various quantization levels (Q2-Q8) and is the standard for local LLM deployment.

GGUF K-Quant Methods

Understanding the naming convention for GGUF quantized models.

Method	Quality	Size	Use Case
Q2_K	Poor	Smallest	Extreme compression only
Q3_K_S	Low	Very Small	Memory-constrained systems
Q3_K_M	Low-Medium	Small	Budget hardware
Q3_K_L	Medium	Moderate	Better Q3 quality
Q4_K_S	Good	Small	Recommended balance
Q4_K_M	Very Good	Moderate	Best overall choice
Q5_K_S	Excellent	Larger	Quality-focused
Q5_K_M	Excellent	Larger	Near-FP16 quality
Q6_K	Near-perfect	Large	Minimal loss
Q8_0	Excellent	Large	Reference quality

K-Quant Naming Explained

KK = "K-quant" — uses importance-based quantization that varies precision by layer
SS (Small) = More aggressive quantization on attention layers, smaller files
MM (Medium) = Balanced quantization across all layers, best quality/size ratio
LL (Large) = Less quantization on important layers, better quality

Key Insight: K-quants are "mixed precision"—they quantize different layers differently based on their importance to model quality. Attention layers typically use higher precision than feed-forward layers.

Real-World Impact

Concrete examples of what quantization enables.

Llama 3.1 70B at Different Quants

A 70B parameter model requires ~140GB at FP16. With quantization:

Q8:Q8: ~70GB — Fits on 2x A100 40GB or 1x H100
Q4_K_M:Q4_K_M: ~40GB — Fits on 2x RTX 4090 or 1x A100 80GB
Q3_K_M:Q3_K_M: ~30GB — Fits on single RTX 4090 (24GB + some offload)

Quality Comparison

In blind tests comparing Q4_K_M to FP16 outputs:

•85% of users could not identify which was quantized
•Perplexity increase of only 0.1-0.5 points on common benchmarks
•Code completion and reasoning tasks show minimal degradation

Cost Savings

Running a 70B model for inference:

FP16

FP16: ~$4-8/hour on cloud (2x A100)

Q4: ~$1-2/hour (single A100 or high-end consumer GPU)

Local

Local: One-time cost of consumer GPU vs ongoing cloud fees

Key Takeaways

1Quantization reduces model memory by 2-16x with surprisingly small accuracy loss
2Q4 (INT4) is the sweet spot for most local LLM use cases—excellent quality at 1/8th the memory
3K-quant methods (Q4_K_M, Q5_K_S) are "mixed precision" and outperform uniform quantization
4GPTQ and AWQ are the leading techniques for LLM quantization, with GGUF as the standard format
5Quantization democratizes AI by enabling frontier models on consumer hardware
6For critical applications, prefer higher precision (INT8/FP16); for experimentation, Q4 is ideal