5 Topics

LLM Inference

Understand how large language models generate text efficiently — from KV caching to batching strategies and serving infrastructure.

Help Make This Better

This guide is open source. Got an idea for a new topic? Found a bug? Want to improve an explanation? Every contribution helps.

Request a Topic Report a Bug Star on GitHub

KV CacheStore computed keys and values to avoid redundant workFeb 9, 2026

E

Prompt CachingReuse computed KV caches across API requests to save cost and latencyFeb 19, 2026

I

Batching & ThroughputProcess multiple requests simultaneously for higher throughputFeb 9, 2026

I

Running Models LocallyRun LLMs on your own hardware for privacy, speed, and zero API costsFeb 9, 2026

B

VRAM CalculatorEstimate VRAM requirements and inference speed for running local LLMsFeb 27, 2026

Back to all topics