Batching & Throughput

What is Batching in LLM Inference?

When a single user sends a prompt to an LLM, the GPU processes just one request — using only a fraction of its compute capacity. Batching combines multiple requests and processes them simultaneously on the same GPU, dramatically increasing throughput.

Static Batching

All requests in the batch start and finish together. The GPU waits for the slowest request to complete before any new requests can join. Simple but wasteful — shorter requests sit idle.

Dynamic / Continuous Batching

Requests can join and leave the batch at any time. When one request finishes, a new one immediately takes its slot. This "Orca-style" approach maximizes GPU utilization.

Why Single-Request Inference Wastes Resources

A modern GPU like the A100 has 312 TFLOPS of compute and 2 TB/s of memory bandwidth. A single decode step for one request barely scratches the surface — the GPU spends most of its time waiting for memory, not computing. Batching fills this gap by sharing the memory transfer cost across many requests.

📊

Throughput vs Batch Size

Drag the slider to see how batching affects performance

As batch size increases, total throughput rises steeply at first — then plateaus — then COLLAPSES. The rise comes from amortizing weight-loading costs across more requests. The collapse happens when memory is exhausted: KV caches overflow VRAM, the system starts swapping, and everything falls apart.

Batch Size: 1

11024

Total System Throughput

45 tok/s

Total tokens/s across ALL users combined

Per-User Speed

45 tok/s

How fast each individual user receives tokens

Total System Throughput Per-User Speed

128

256

512

1024

⚠️ More requests → more total throughput, but diminishing returns

The plateau happens because of DRAM bandwidth saturation — not compute saturation. Even at batch 256, GPU compute units are still underutilized. The bottleneck is how fast weights can be loaded from memory.

⚡

Prefill vs Decode Phase

Two fundamentally different workloads in every LLM request

Every LLM request has two phases. Prefill processes all input tokens in parallel — it's blazingly fast because it saturates the GPU's compute units. Decode generates tokens one at a time — it's painfully slow because it's bottlenecked by memory bandwidth, not compute. Prefill can process ~1000 tokens in the time decode generates ~20.

PrefillDecode

Prompt In

WhatisthecapitalofFrance?

Speed Race: Prefill vs Decode

Watch prefill process 1000 tokens while decode barely gets started. Prefill processes ~50 tokens per tick; decode processes 1 token per tick.

Prefill0/1000 tokens

Decode0/1000 tokens

Why the Speed Difference?

🔥

Prefill

Limited by: Compute (FLOPS)

312 TFLOPS

Tensor cores are the bottleneck

🐌

Decode

Limited by: Memory Bandwidth (GB/s)

2 TB/s

Loading weights from DRAM is the bottleneck

Arithmetic intensity = FLOPS per byte loaded from memory. Prefill has high arithmetic intensity (many operations per weight load, since it processes many tokens). Decode has extremely low arithmetic intensity (loads the same weights but only computes for 1 token). The GPU is a supercomputer held hostage by a straw-sized memory pipe.

Prefill Phase

● Compute-bound (FLOPS limited)

● High GPU utilization (~80-95%)

● All tokens in parallel

● TTFT (Time to First Token)

● ~50,000 tok/s on A100

Like a conveyor belt processing all items at once

Decode Phase

● Memory-bound (bandwidth limited)

● Low GPU utilization (<20%)

● One token at a time

● TBOT (Time Between Output Tokens)

● ~50 tok/s per request on A100

Like a single worker hand-crafting items one by one

💡 Prefill is 1000× faster per token than decode. In prefill, the GPU tensor cores are the bottleneck (compute-bound). In decode, DRAM bandwidth is the bottleneck (memory-bound) — the GPU sits idle >80% of the time waiting for weight data. This is exactly why batching helps: it fills those idle cycles with useful work from other requests.

The Throughput-Latency Tradeoff

Batching isn't free lunch. More batching means more total tokens per second, but each individual user waits longer for their response. Operators must balance server efficiency against user experience.

🐇

Low Batch Size

~45 tok/s total

~45 tok/s per user

Low total throughput, but each user gets fast responses (45 tok/s per user)

🏭

High Batch Size

~2,304 tok/s total

~9 tok/s per user

High total throughput, but each user gets much slower responses (9 tok/s per user)

💡 The sweet spot depends on your use case: real-time chat needs low latency (small batches), while batch processing jobs can maximize throughput (large batches). Most production systems target batch sizes of 32-128.

🔄

Continuous Batching

How modern serving engines eliminate idle GPU time

Static batching wastes GPU time because all requests must wait for the longest one to finish. Continuous batching (aka "Orca-style") lets requests join and leave independently. When a request finishes, its slot is immediately filled with a new one. Try the interactive timeline below — toggle between static and continuous to see the difference in GPU utilization.

Time step: 0 / 19

Slot 1

Slot 2

Slot 3

Slot 4

GPU Utilization100%

Active Idle New request

💡 With continuous batching, finished slots are immediately refilled. GPU utilization stays near 100% as long as there are queued requests. This is how vLLM, TGI, and other modern engines work.

👥

Per-User vs System Throughput

What happens to each user as the system scales

As you add more concurrent users, the system serves more total tokens per second — but each individual user gets a smaller slice of the bandwidth. Beyond a critical point, the system collapses entirely: memory exhaustion, swapping, and cascading failures destroy throughput for everyone.

Concurrent Users: 1

1128

Per-User Speed

43.9 tok/s

Total System Throughput

44 tok/s

Each user's share of GPU bandwidth:

👤

Each user receives: 43.9 tok/s

Key Takeaways

1Batching amortizes the cost of loading model weights across multiple requests, dramatically increasing total throughput
2But there's a limit: too many batched requests causes memory exhaustion and throughput collapse (OOM)
3Prefill is compute-bound and 1000× faster per token than decode, which is memory-bandwidth bound
4Continuous batching eliminates idle GPU time by dynamically filling slots as requests complete
5Per-user speed always decreases with more concurrent users — the system trades individual speed for total capacity