Learn AI Concepts | Interactive Guide

Quick Model Presets

Click a model to auto-fill parameters. Adjust quantization and context length below.

VRAM Estimator

Total Parameters

billion

Mixture of Experts model

Quantization Level

Context Length

Estimated Total VRAM

9.0 GB

Model weights: 4.2 GB

KV cache: 4.29 GB

Runtime overhead: 0.50 GB

GPU Compatibility

8 GB

Won't fit

12 GB

Fits

16 GB

Fits

24 GB

Fits

32 GB

Fits

48 GB

Fits

80 GB

Fits

96 GB

Fits

128 GB

Fits

Speed Estimator

Select GPU

NVIDIA Consumer

NVIDIA Pro

Apple Silicon

AMD APU

Datacenter

Estimated Generation Speed

Fast

~131 tok/s

1008 GB/s × 0.55 ÷ 4.2 GB = ~131 tok/s

<5 tok/s

Slow

5–15

Usable

15–30

Good

>30

Fast

The Offloading Speed Cliff

When a model doesn't fit in VRAM, layers can be offloaded to CPU RAM or even disk. But the speed penalty is brutal:

GPU VRAM

1,008 GB/s → ~131 tok/s

CPU RAM (DDR5)

70 GB/s → ~9 tok/s

CPU RAM (DDR4)

40 GB/s → ~5 tok/s

NVMe SSD

6 GB/s → ~47 tok/min

SATA SSD

500 MB/s → ~4 tok/min

Drag to simulate partial offloading

100% GPU0% CPU

~130.6 tok/s

Estimated with offloading: 131 tok/s (100% VRAM)

⚠️ Even offloading 10% of layers to CPU RAM can cut your speed by 50%+. The bottleneck is the slowest link in the chain.

Smart Offloading for MoE Models

Mixture-of-Experts models are uniquely suited for offloading because only a fraction of experts activate per token. Here's how to maximize performance:

🎯

Keep the hot path in VRAM

Attention layers, the router/gate network, and shared layers are used for every token. These must stay in VRAM. In llama.cpp, use --ngl to control how many layers are on GPU.

💤

Offload inactive experts to CPU RAM

Most MoE models activate 2-4 experts out of 64+. The inactive experts can live in CPU RAM with minimal impact — they're not read during inference anyway.

⚡

Use expert prefetching

Advanced runtimes (like llama.cpp with --override-kv) can predict which experts the next token will need and prefetch them from CPU→GPU while the current token is processing.

🧮

Example: Qwen3.5-35B-A3B at Q4

Total size: ~18 GB. But only ~3B params are active per token. With smart offloading, you can run this on a 12GB GPU: keep attention + active experts in VRAM (~6 GB), offload the rest to RAM. Speed: nearly the same as full VRAM, because inactive experts aren't read.

💡 Key Insight

For MoE models, VRAM determines which models you CAN run. For dense models, VRAM determines how FAST you can run them. A 35B MoE model with 3B active params on a 12GB GPU can be faster than a 14B dense model on the same GPU.

How the Formulas Work

VRAM Formula

VRAM ≈ (params × bits_per_param / 8) + KV_cache + 0.5 GB

Each parameter is stored using the number of bits determined by quantization. FP16 uses 16 bits (2 bytes) per parameter, Q4_K_M uses roughly 4.8 bits. Divide by 8 to convert bits to bytes.

KV Cache Formula

KV_cache ≈ 2 × n_layers × d_model × ctx_len × 2 bytes

During generation, each layer stores a Key and Value vector for every token in the context. With longer contexts, KV cache can use several GB — this is why 32K context costs much more VRAM than 4K.

Speed Formula (Roofline Model)

tok/s ≈ memory_bandwidth / model_size_in_ram

LLM inference is memory-bandwidth-bound: each token requires reading the entire model from VRAM. We apply a 55% efficiency factor based on real llama.cpp benchmarks — real-world speeds are lower than theoretical bandwidth due to memory controller overhead, kernel latency, and compute bottlenecks. For MoE models, only active parameters are read per token.

Important Caveats

1These are estimates. Real VRAM usage depends on the inference engine (llama.cpp, vLLM, Ollama), batch size, and implementation details.
2Flash Attention and paged KV cache can significantly reduce memory usage in practice.
3CPU offloading lets you run larger models than your GPU VRAM allows, at the cost of much slower speed.
4Actual speed depends on compute utilization, not just bandwidth. Batched inference, speculative decoding, and flash attention all change the picture.
5K-quant sizes (Q4_K_M, Q5_K_M, etc.) vary slightly by model architecture. The bits-per-param values here are typical averages.

Quantization →

Learn how quantization reduces model size with minimal quality loss.

Running Models Locally →

Complete guide to running models on your own hardware.