What is Batching in LLM Inference?
When a single user sends a prompt to an LLM, the GPU processes just one request β using only a fraction of its compute capacity. Batching combines multiple requests and processes them simultaneously on the same GPU, dramatically increasing throughput.
Static Batching
All requests in the batch start and finish together. The GPU waits for the slowest request to complete before any new requests can join. Simple but wasteful β shorter requests sit idle.
Dynamic / Continuous Batching
Requests can join and leave the batch at any time. When one request finishes, a new one immediately takes its slot. This "Orca-style" approach maximizes GPU utilization.
Why Single-Request Inference Wastes Resources
A modern GPU like the A100 has 312 TFLOPS of compute and 2 TB/s of memory bandwidth. A single decode step for one request barely scratches the surface β the GPU spends most of its time waiting for memory, not computing. Batching fills this gap by sharing the memory transfer cost across many requests.
Throughput vs Batch Size
Drag the slider to see how batching affects performance
As batch size increases, total throughput rises steeply at first β then plateaus β then COLLAPSES. The rise comes from amortizing weight-loading costs across more requests. The collapse happens when memory is exhausted: KV caches overflow VRAM, the system starts swapping, and everything falls apart.
The plateau happens because of DRAM bandwidth saturation β not compute saturation. Even at batch 256, GPU compute units are still underutilized. The bottleneck is how fast weights can be loaded from memory.
Prefill vs Decode Phase
Two fundamentally different workloads in every LLM request
Every LLM request has two phases. Prefill processes all input tokens in parallel β it's blazingly fast because it saturates the GPU's compute units. Decode generates tokens one at a time β it's painfully slow because it's bottlenecked by memory bandwidth, not compute. Prefill can process ~1000 tokens in the time decode generates ~20.
Speed Race: Prefill vs Decode
Watch prefill process 1000 tokens while decode barely gets started. Prefill processes ~50 tokens per tick; decode processes 1 token per tick.
Why the Speed Difference?
Arithmetic intensity = FLOPS per byte loaded from memory. Prefill has high arithmetic intensity (many operations per weight load, since it processes many tokens). Decode has extremely low arithmetic intensity (loads the same weights but only computes for 1 token). The GPU is a supercomputer held hostage by a straw-sized memory pipe.
Prefill Phase
Like a conveyor belt processing all items at once
Decode Phase
Like a single worker hand-crafting items one by one
π‘ Prefill is 1000Γ faster per token than decode. In prefill, the GPU tensor cores are the bottleneck (compute-bound). In decode, DRAM bandwidth is the bottleneck (memory-bound) β the GPU sits idle >80% of the time waiting for weight data. This is exactly why batching helps: it fills those idle cycles with useful work from other requests.
The Throughput-Latency Tradeoff
Batching isn't free lunch. More batching means more total tokens per second, but each individual user waits longer for their response. Operators must balance server efficiency against user experience.
Low Batch Size
Low total throughput, but each user gets fast responses (45 tok/s per user)
High Batch Size
High total throughput, but each user gets much slower responses (9 tok/s per user)
π‘ The sweet spot depends on your use case: real-time chat needs low latency (small batches), while batch processing jobs can maximize throughput (large batches). Most production systems target batch sizes of 32-128.
Continuous Batching
How modern serving engines eliminate idle GPU time
Static batching wastes GPU time because all requests must wait for the longest one to finish. Continuous batching (aka "Orca-style") lets requests join and leave independently. When a request finishes, its slot is immediately filled with a new one. Try the interactive timeline below β toggle between static and continuous to see the difference in GPU utilization.
π‘ With continuous batching, finished slots are immediately refilled. GPU utilization stays near 100% as long as there are queued requests. This is how vLLM, TGI, and other modern engines work.
Per-User vs System Throughput
What happens to each user as the system scales
As you add more concurrent users, the system serves more total tokens per second β but each individual user gets a smaller slice of the bandwidth. Beyond a critical point, the system collapses entirely: memory exhaustion, swapping, and cascading failures destroy throughput for everyone.
Key Takeaways
- 1Batching amortizes the cost of loading model weights across multiple requests, dramatically increasing total throughput
- 2But there's a limit: too many batched requests causes memory exhaustion and throughput collapse (OOM)
- 3Prefill is compute-bound and 1000Γ faster per token than decode, which is memory-bandwidth bound
- 4Continuous batching eliminates idle GPU time by dynamically filling slots as requests complete
- 5Per-user speed always decreases with more concurrent users β the system trades individual speed for total capacity