Quick Model Presets
Click a model to auto-fill parameters. Adjust quantization and context length below.
VRAM Estimator
Speed Estimator
1008 GB/s × 0.55 ÷ 4.2 GB = ~131 tok/s
The Offloading Speed Cliff
When a model doesn't fit in VRAM, layers can be offloaded to CPU RAM or even disk. But the speed penalty is brutal:
Estimated with offloading: 131 tok/s (100% VRAM)
Smart Offloading for MoE Models
Mixture-of-Experts models are uniquely suited for offloading because only a fraction of experts activate per token. Here's how to maximize performance:
Keep the hot path in VRAM
Attention layers, the router/gate network, and shared layers are used for every token. These must stay in VRAM. In llama.cpp, use --ngl to control how many layers are on GPU.
Offload inactive experts to CPU RAM
Most MoE models activate 2-4 experts out of 64+. The inactive experts can live in CPU RAM with minimal impact — they're not read during inference anyway.
Use expert prefetching
Advanced runtimes (like llama.cpp with --override-kv) can predict which experts the next token will need and prefetch them from CPU→GPU while the current token is processing.
Example: Qwen3.5-35B-A3B at Q4
Total size: ~18 GB. But only ~3B params are active per token. With smart offloading, you can run this on a 12GB GPU: keep attention + active experts in VRAM (~6 GB), offload the rest to RAM. Speed: nearly the same as full VRAM, because inactive experts aren't read.
💡 Key Insight
For MoE models, VRAM determines which models you CAN run. For dense models, VRAM determines how FAST you can run them. A 35B MoE model with 3B active params on a 12GB GPU can be faster than a 14B dense model on the same GPU.
How the Formulas Work
VRAM Formula
Each parameter is stored using the number of bits determined by quantization. FP16 uses 16 bits (2 bytes) per parameter, Q4_K_M uses roughly 4.8 bits. Divide by 8 to convert bits to bytes.
KV Cache Formula
During generation, each layer stores a Key and Value vector for every token in the context. With longer contexts, KV cache can use several GB — this is why 32K context costs much more VRAM than 4K.
Speed Formula (Roofline Model)
LLM inference is memory-bandwidth-bound: each token requires reading the entire model from VRAM. We apply a 55% efficiency factor based on real llama.cpp benchmarks — real-world speeds are lower than theoretical bandwidth due to memory controller overhead, kernel latency, and compute bottlenecks. For MoE models, only active parameters are read per token.
Important Caveats
- 1These are estimates. Real VRAM usage depends on the inference engine (llama.cpp, vLLM, Ollama), batch size, and implementation details.
- 2Flash Attention and paged KV cache can significantly reduce memory usage in practice.
- 3CPU offloading lets you run larger models than your GPU VRAM allows, at the cost of much slower speed.
- 4Actual speed depends on compute utilization, not just bandwidth. Batched inference, speculative decoding, and flash attention all change the picture.
- 5K-quant sizes (Q4_K_M, Q5_K_M, etc.) vary slightly by model architecture. The bits-per-param values here are typical averages.
Quantization →
Learn how quantization reduces model size with minimal quality loss.
Running Models Locally →
Complete guide to running models on your own hardware.