VRAM Calculator

Beginner

Estimate VRAM requirements and inference speed for running LLMs locally.

Last updated: Feb 27, 2026

Quick Model Presets

Click a model to auto-fill parameters. Adjust quantization and context length below.

VRAM Estimator

billion
Mixture of Experts model
Estimated Total VRAM
9.0 GB
Model weights: 4.2 GB
KV cache: 4.29 GB
Runtime overhead: 0.50 GB
GPU Compatibility
8 GB
Won't fit
12 GB
Fits
16 GB
Fits
24 GB
Fits
32 GB
Fits
48 GB
Fits
80 GB
Fits
96 GB
Fits
128 GB
Fits

Speed Estimator

NVIDIA Consumer
NVIDIA Pro
Apple Silicon
AMD APU
Datacenter
Estimated Generation Speed
Fast
~131 tok/s

1008 GB/s × 0.55 ÷ 4.2 GB = ~131 tok/s

<5 tok/s
Slow
5–15
Usable
15–30
Good
>30
Fast

The Offloading Speed Cliff

When a model doesn't fit in VRAM, layers can be offloaded to CPU RAM or even disk. But the speed penalty is brutal:

GPU VRAM
1,008 GB/s~131 tok/s
CPU RAM (DDR5)
70 GB/s~9 tok/s
CPU RAM (DDR4)
40 GB/s~5 tok/s
NVMe SSD
6 GB/s~47 tok/min
SATA SSD
500 MB/s~4 tok/min
100% GPU0% CPU
~130.6 tok/s

Estimated with offloading: 131 tok/s (100% VRAM)

⚠️ Even offloading 10% of layers to CPU RAM can cut your speed by 50%+. The bottleneck is the slowest link in the chain.

Smart Offloading for MoE Models

Mixture-of-Experts models are uniquely suited for offloading because only a fraction of experts activate per token. Here's how to maximize performance:

🎯

Keep the hot path in VRAM

Attention layers, the router/gate network, and shared layers are used for every token. These must stay in VRAM. In llama.cpp, use --ngl to control how many layers are on GPU.

💤

Offload inactive experts to CPU RAM

Most MoE models activate 2-4 experts out of 64+. The inactive experts can live in CPU RAM with minimal impact — they're not read during inference anyway.

Use expert prefetching

Advanced runtimes (like llama.cpp with --override-kv) can predict which experts the next token will need and prefetch them from CPU→GPU while the current token is processing.

🧮

Example: Qwen3.5-35B-A3B at Q4

Total size: ~18 GB. But only ~3B params are active per token. With smart offloading, you can run this on a 12GB GPU: keep attention + active experts in VRAM (~6 GB), offload the rest to RAM. Speed: nearly the same as full VRAM, because inactive experts aren't read.

💡 Key Insight

For MoE models, VRAM determines which models you CAN run. For dense models, VRAM determines how FAST you can run them. A 35B MoE model with 3B active params on a 12GB GPU can be faster than a 14B dense model on the same GPU.

How the Formulas Work

VRAM Formula

VRAM ≈ (params × bits_per_param / 8) + KV_cache + 0.5 GB

Each parameter is stored using the number of bits determined by quantization. FP16 uses 16 bits (2 bytes) per parameter, Q4_K_M uses roughly 4.8 bits. Divide by 8 to convert bits to bytes.

KV Cache Formula

KV_cache ≈ 2 × n_layers × d_model × ctx_len × 2 bytes

During generation, each layer stores a Key and Value vector for every token in the context. With longer contexts, KV cache can use several GB — this is why 32K context costs much more VRAM than 4K.

Speed Formula (Roofline Model)

tok/s ≈ memory_bandwidth / model_size_in_ram

LLM inference is memory-bandwidth-bound: each token requires reading the entire model from VRAM. We apply a 55% efficiency factor based on real llama.cpp benchmarks — real-world speeds are lower than theoretical bandwidth due to memory controller overhead, kernel latency, and compute bottlenecks. For MoE models, only active parameters are read per token.

Important Caveats

  • 1These are estimates. Real VRAM usage depends on the inference engine (llama.cpp, vLLM, Ollama), batch size, and implementation details.
  • 2Flash Attention and paged KV cache can significantly reduce memory usage in practice.
  • 3CPU offloading lets you run larger models than your GPU VRAM allows, at the cost of much slower speed.
  • 4Actual speed depends on compute utilization, not just bandwidth. Batched inference, speculative decoding, and flash attention all change the picture.
  • 5K-quant sizes (Q4_K_M, Q5_K_M, etc.) vary slightly by model architecture. The bits-per-param values here are typical averages.

Quantization

Learn how quantization reduces model size with minimal quality loss.

Running Models Locally

Complete guide to running models on your own hardware.