Running Models Locally

Why Run Locally?

Running models on your own machine gives you capabilities that cloud APIs cannot match.

Complete Privacy

Your data never leaves your machine. No logging, no third-party access, no compliance worries.

Zero API Costs

After the one-time hardware investment, every token is free. Run as many queries as you want.

Offline Access

Works without internet. Use AI on planes, in secure environments, or anywhere connectivity is limited.

Full Customization

Choose any model, any quantization, any parameters. Fine-tune for your specific use case.

Deep Learning

Nothing teaches you how LLMs work like running and experimenting with them directly.

Total Control

No rate limits, no content filters you did not choose, no surprise API changes or deprecations.

Hardware Requirements

Select a model size and quantization level to see how much VRAM you need and which GPUs can handle it.

Model Size

Quantization

VRAM Required

4 GB

Estimated Speed

~55 tok/s

Approximate, varies by GPU and configuration

GPU Compatibility

RTX 3060 (12GB)

RTX 3090 (24GB)

RTX 4070 Ti Super (16GB)

RTX 4090 (24GB)

RTX 5070 Ti (16GB)

RTX 5080 (16GB)

RTX 5090 (32GB)

RTX 6000 Pro (96GB)

RTX PRO 6000 (96GB)

M3 Pro (18GB) (18GB)

M3 Max (36GB) (36GB)

M4 Pro (24GB) (24GB)

M4 Max (64GB) (64GB)

M4 Max (128GB) (128GB)

The MoE Advantage for Local Inference

Mixture of Experts (MoE) models route each token through only a subset of "expert" layers. The key advantage is speed: fewer active parameters means faster generation. But all parameters still live in VRAM — MoE does not save memory.

Faster Generation Speed

Only a subset of experts compute per token. Mixtral 8x7B activates 12.9B of its 46.7B parameters — meaning it generates tokens ~3x faster than a similarly-intelligent dense 70B model.

Large-Model Intelligence

All 46.7B parameters store knowledge across all experts. You get reasoning quality far above what a 13B dense model could achieve, because the full capacity is available.

VRAM Is Still Based on Total Params

All expert weights must be loaded into memory. Mixtral 8x7B at Q4 needs ~26 GB VRAM — similar to a dense 30B model, not a 13B. MoE saves compute, not memory.

Intelligence

Mixtral 8x7B

75/100

Llama 3.1 70B

85/100

Mistral 7B

45/100

Speed (tokens/sec, RTX 4090)

Mixtral 8x7B

~35 tok/s

Llama 3.1 70B

~10 tok/s

Mistral 7B

~100 tok/s

VRAM Usage (Q4)

Mixtral 8x7B

26 GB

Llama 3.1 70B

40 GB

Mistral 7B

5 GB

Mixtral 8x7B

Total: 46.7B

Active: 12.9B

VRAM (Q4): 26 GB

✨ Speed advantage

Llama 3.1 70B

Total: 70B

Active: 70B

VRAM (Q4): 40 GB

Mistral 7B

Total: 7.2B

Active: 7.2B

VRAM (Q4): 5 GB

Quantization is the key technology that makes local inference practical. By reducing the precision of model weights, you can fit much larger models into limited VRAM.

A 70B parameter model at FP16 needs 140 GB of memory -- far beyond any consumer GPU. But at Q4 quantization, it fits into 40 GB, making it runnable on high-end consumer hardware with only a modest quality loss. The lower you quantize, the more speed and memory you gain, but quality degrades.

Deep dive into Quantization →

🧮

VRAM Calculator →

Not sure if a model fits your GPU? Calculate VRAM requirements and estimated speed for any model and quantization level.

Getting Started

Follow these five steps to go from zero to running your first local model.

Pick a Tool

Start with Ollama or LM Studio -- they handle everything for you. Move to llama.cpp or vLLM when you need more control.

Check Your VRAM

Run nvidia-smi (NVIDIA) or check Activity Monitor (Mac). This determines what models you can run.

Choose a Model Size

Start with 7B models. They are fast, capable, and fit on most GPUs. Move to 13B or 70B as you need more capability.

Pick a Quantization Level

Q4 is the sweet spot for most users: good quality with reasonable VRAM use. Go Q8 if you have the memory, Q2 if you are tight.

Run It

Download the model and start chatting. With Ollama: ollama pull llama3.2 then ollama run llama3.2. That is it.

Quickstart Demo

Here is what it looks like to install Ollama and run your first model -- three commands and you are chatting.

terminal

$ |

Tips and Tricks

1Context length directly impacts VRAM usage. A 7B model with 128K context needs significantly more memory than with 4K context. Start small and increase as needed.
2GPU offloading lets you split a model between GPU and CPU. You get GPU speed for the layers that fit, with CPU handling the rest. Slower than full GPU, but runs larger models.
3CPU-only inference works but is 5-10x slower than GPU. Great for testing, less ideal for interactive use. Apple Silicon is the exception -- unified memory makes CPU inference fast.
4For 8 GB VRAM: stick to 7B Q4 models. For 12 GB: 7B Q8 or 13B Q4. For 24 GB: 13B Q8 or 70B Q4. For 32 GB+: 70B Q4-Q8 comfortably.
5Llama 3.2, Mistral, Phi-3, and Qwen 2.5 are excellent choices for local inference. Each excels at different tasks -- experiment to find your best fit.
6Run models as an API server (Ollama and LM Studio both support this) to integrate local models into your own applications, scripts, and workflows.