What is Prompt Caching?
When you send a request to an LLM API, the model must process every input token through its transformer layers, computing key-value (KV) pairs for the attention mechanism. Prompt Caching stores these computed KV pairs on the server side so that subsequent requests sharing the same prompt prefix can skip this expensive computation entirely. Instead of reprocessing thousands of tokens, the model loads the cached KV state and only processes the new, unique tokens.
Connection to KV Cache
Prompt Caching is the API-level application of the KV Cache concept you learned about in the previous article. While KV Cache operates within a single generation (caching tokens as they are produced), Prompt Caching persists the KV cache across separate API requests that share the same prefix.
Cache Hit vs. Cache Miss
When a request arrives, the provider hashes the prompt prefix and checks if a matching KV cache exists. The difference in processing is dramatic:
Cache Miss (First Request)
Cache Hit (Subsequent)
Edit the prompt below and watch how changes affect caching. Cache matching works character-by-character from the start β any edit invalidates everything after it.
How Providers Implement It
Each major LLM provider has their own approach to prompt caching, with different trade-offs in control, pricing, and minimum token requirements.
Anthropic (Claude)
Explicit opt-in via cache_control parameter. Mark specific content blocks as cacheable. 90% cost reduction on cached tokens, ~3Γ latency improvement. Minimum 1,024 tokens for caching. Cache TTL is 5 minutes (refreshed on hit).
OpenAI (GPT-4o)
Automatic β no code changes needed. The API automatically caches matching prefixes of 1,024+ tokens. 50% cost discount on cached input tokens. Caching happens transparently in the background.
Google (Gemini)
Explicit via Context Caching API. Create named cache objects with configurable TTL. 75% discount on cached tokens. Best for very large contexts (32k+ tokens) reused across many requests.
When to Use Prompt Caching
Ideal For
- βLong system prompts reused across conversations (e.g., AI assistants with detailed instructions)
- βFew-shot examples that stay constant while user queries change
- βLarge documents (contracts, codebases) analyzed with multiple different questions
- βAgentic workflows where the same tool definitions and context are sent repeatedly
Not Ideal For
- βUnique, one-off prompts that are never repeated
- βVery short prompts (under 1,024 tokens) β below the caching threshold
- βPrompts where the prefix changes frequently between requests
Cost & Performance Impact
The savings from prompt caching are substantial, especially for applications with long, repeated prompt prefixes. Here are the numbers from major providers:
cost reduction on cached tokens (Anthropic)
faster time-to-first-token (Anthropic)
default cache lifetime (refreshed on each hit)
Pricing Comparison (Input Tokens)
| Provider | Base Price | Cached Price | Savings |
|---|---|---|---|
| Anthropic (Claude) | $3.00 / MTok | $0.30 / MTok | 90% |
| OpenAI (GPT-4o) | $2.50 / MTok | $1.25 / MTok | 50% |
| Google (Gemini) | $1.25 / MTok | $0.3125 / MTok | 75% |
Prices shown for flagship models as of early 2025. Cache write tokens may cost 25% more than base price (Anthropic). Always check current pricing docs.
default cache lifetime (refreshed on each hit)
Watch how the 5-minute TTL refreshes on each cache hit. A gap in requests lets the cache expire β the next request creates a new cache.
Code Example: Anthropic API
Anthropic's implementation gives you explicit control over what gets cached. Add cache_control to any content block in your system prompt or messages:
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are an expert legal assistant. Here is the "
"complete contract document you must analyze...\n\n"
"[... 15,000 tokens of contract text ...]",
"cache_control": {"type": "ephemeral"} # β Cache this!
}
],
messages=[
{"role": "user", "content": "Summarize the key obligations."}
]
)
# Check cache performance in the response
usage = response.usage
print(f"Input tokens: {usage.input_tokens}")
print(f"Cache creation: {usage.cache_creation_input_tokens}")
print(f"Cache read (hits): {usage.cache_read_input_tokens}")
# First request: cache_creation = 15000, cache_read = 0
# Second request: cache_creation = 0, cache_read = 15000 β 90% cheaper!Implementation Tips
Put static content first
Cache matching works on prefixes. Place your system prompt and few-shot examples before any dynamic content so the prefix stays stable across requests.
Mind the minimum token count
Anthropic requires at least 1,024 tokens in a cacheable block (2,048 for Claude Haiku). Content below this threshold won't be cached.
Understand the TTL
Anthropic's cache lives for 5 minutes, refreshed on each hit. For infrequent requests, the cache may expire between calls. OpenAI's automatic caching has similar time-based expiry.
Monitor cache hit rates
Check the usage fields in API responses (cache_creation_input_tokens vs cache_read_input_tokens) to verify caching is working. Low hit rates mean your prefix is changing too often.
Key Takeaways
Prompt Caching reuses computed KV pairs across API requests, skipping redundant computation for repeated prompt prefixes.
Anthropic offers the best savings (90% cost, ~3Γ speed) with explicit cache_control. OpenAI does it automatically at 50% savings.
Best for long system prompts, few-shot examples, and large documents queried multiple times.
Structure prompts with static content first (prefix) and dynamic content last to maximize cache hit rates.