Why Run Locally?
Running models on your own machine gives you capabilities that cloud APIs cannot match.
Complete Privacy
Your data never leaves your machine. No logging, no third-party access, no compliance worries.
Zero API Costs
After the one-time hardware investment, every token is free. Run as many queries as you want.
Offline Access
Works without internet. Use AI on planes, in secure environments, or anywhere connectivity is limited.
Full Customization
Choose any model, any quantization, any parameters. Fine-tune for your specific use case.
Deep Learning
Nothing teaches you how LLMs work like running and experimenting with them directly.
Total Control
No rate limits, no content filters you did not choose, no surprise API changes or deprecations.
Hardware Requirements
Select a model size and quantization level to see how much VRAM you need and which GPUs can handle it.
The MoE Advantage for Local Inference
Mixture of Experts (MoE) models route each token through only a subset of "expert" layers. The key advantage is speed: fewer active parameters means faster generation. But all parameters still live in VRAM — MoE does not save memory.
Faster Generation Speed
Only a subset of experts compute per token. Mixtral 8x7B activates 12.9B of its 46.7B parameters — meaning it generates tokens ~3x faster than a similarly-intelligent dense 70B model.
Large-Model Intelligence
All 46.7B parameters store knowledge across all experts. You get reasoning quality far above what a 13B dense model could achieve, because the full capacity is available.
VRAM Is Still Based on Total Params
All expert weights must be loaded into memory. Mixtral 8x7B at Q4 needs ~26 GB VRAM — similar to a dense 30B model, not a 13B. MoE saves compute, not memory.
The insight: Mixtral 8x7B activates only 12.9B of its 46.7B parameters per token — delivering 70B-class intelligence at 3.5x the speed. But it still needs ~26 GB VRAM because all expert weights must be loaded. MoE trades VRAM for speed, not the other way around.
MoE is a fundamental architecture shift, not just an optimization trick. Understanding how expert routing works helps you pick the right model for your hardware.
Deep dive into Mixture of Experts →Popular Tools
The local inference ecosystem has matured rapidly. Here are the tools that matter, from beginner-friendly to production-grade.
Ollama
llama.cpp
LM Studio
vLLM
text-generation-webui
The Quantization Tradeoff
Quantization is the key technology that makes local inference practical. By reducing the precision of model weights, you can fit much larger models into limited VRAM.
A 70B parameter model at FP16 needs 140 GB of memory -- far beyond any consumer GPU. But at Q4 quantization, it fits into 40 GB, making it runnable on high-end consumer hardware with only a modest quality loss. The lower you quantize, the more speed and memory you gain, but quality degrades.
Deep dive into Quantization →VRAM Calculator →
Not sure if a model fits your GPU? Calculate VRAM requirements and estimated speed for any model and quantization level.
Getting Started
Follow these five steps to go from zero to running your first local model.
Pick a Tool
Start with Ollama or LM Studio -- they handle everything for you. Move to llama.cpp or vLLM when you need more control.
Check Your VRAM
Run nvidia-smi (NVIDIA) or check Activity Monitor (Mac). This determines what models you can run.
Choose a Model Size
Start with 7B models. They are fast, capable, and fit on most GPUs. Move to 13B or 70B as you need more capability.
Pick a Quantization Level
Q4 is the sweet spot for most users: good quality with reasonable VRAM use. Go Q8 if you have the memory, Q2 if you are tight.
Run It
Download the model and start chatting. With Ollama: ollama pull llama3.2 then ollama run llama3.2. That is it.
Quickstart Demo
Here is what it looks like to install Ollama and run your first model -- three commands and you are chatting.
Tips and Tricks
- 1Context length directly impacts VRAM usage. A 7B model with 128K context needs significantly more memory than with 4K context. Start small and increase as needed.
- 2GPU offloading lets you split a model between GPU and CPU. You get GPU speed for the layers that fit, with CPU handling the rest. Slower than full GPU, but runs larger models.
- 3CPU-only inference works but is 5-10x slower than GPU. Great for testing, less ideal for interactive use. Apple Silicon is the exception -- unified memory makes CPU inference fast.
- 4For 8 GB VRAM: stick to 7B Q4 models. For 12 GB: 7B Q8 or 13B Q4. For 24 GB: 13B Q8 or 70B Q4. For 32 GB+: 70B Q4-Q8 comfortably.
- 5Llama 3.2, Mistral, Phi-3, and Qwen 2.5 are excellent choices for local inference. Each excels at different tasks -- experiment to find your best fit.
- 6Run models as an API server (Ollama and LM Studio both support this) to integrate local models into your own applications, scripts, and workflows.