Learn AI Concepts | Interactive Guide

What is Prompt Caching?

When you send a request to an LLM API, the model must process every input token through its transformer layers, computing key-value (KV) pairs for the attention mechanism. Prompt Caching stores these computed KV pairs on the server side so that subsequent requests sharing the same prompt prefix can skip this expensive computation entirely. Instead of reprocessing thousands of tokens, the model loads the cached KV state and only processes the new, unique tokens.

Connection to KV Cache

Prompt Caching is the API-level application of the KV Cache concept you learned about in the previous article. While KV Cache operates within a single generation (caching tokens as they are produced), Prompt Caching persists the KV cache across separate API requests that share the same prefix.

Cache Hit vs. Cache Miss

When a request arrives, the provider hashes the prompt prefix and checks if a matching KV cache exists. The difference in processing is dramatic:

Prompt Structure

Cached Prefix

System: You are an expert legal assistant. Analyze the contract...

+

New Tokens

User: Summarize the key obligations.

Cache Miss (First Request)

Tokens

0

Time

0ms

Cost

$0.0000

Cache Hit (Subsequent)

Tokens

0

Time

0ms

Cost

$0.0000

Prefix Matching Explorer

Edit the prompt below and watch how changes affect caching. Cache matching works character-by-character from the start — any edit invalidates everything after it.

Original Prompt (cached)

You are a helpful AI assistant specializing in code review. Always provide constructive feedback. Focus on security, performance, and readability. Use markdown formatting in your responses. --- Review the following Python function: def calculate_total(items): total = 0 for item in items: total += item.price * item.quantity return total

Your Edited Prompt

Cache Boundary Preview

You are a helpful AI assistant specializing in code review. Always provide constructive feedback. Focus on security, performance, and readability. Use markdown formatting in your responses. --- Review the following Python function: def calculate_total(items): total = 0 for item in items: total += item.price * item.quantity return total

Cached Tokens

90

New Tokens

0

Savings

100%

Cached TokensNew Tokens

How Providers Implement It

Each major LLM provider has their own approach to prompt caching, with different trade-offs in control, pricing, and minimum token requirements.

🟣

Anthropic (Claude)

Explicit opt-in via cache_control parameter. Mark specific content blocks as cacheable. 90% cost reduction on cached tokens, ~3× latency improvement. Minimum 1,024 tokens for caching. Cache TTL is 5 minutes (refreshed on hit).

🟢

OpenAI (GPT-4o)

Automatic — no code changes needed. The API automatically caches matching prefixes of 1,024+ tokens. 50% cost discount on cached input tokens. Caching happens transparently in the background.

🔵

Google (Gemini)

Explicit via Context Caching API. Create named cache objects with configurable TTL. 75% discount on cached tokens. Best for very large contexts (32k+ tokens) reused across many requests.

When to Use Prompt Caching

Ideal For

✓Long system prompts reused across conversations (e.g., AI assistants with detailed instructions)
✓Few-shot examples that stay constant while user queries change
✓Large documents (contracts, codebases) analyzed with multiple different questions
✓Agentic workflows where the same tool definitions and context are sent repeatedly

Not Ideal For

✗Unique, one-off prompts that are never repeated
✗Very short prompts (under 1,024 tokens) — below the caching threshold
✗Prompts where the prefix changes frequently between requests

Cost & Performance Impact

The savings from prompt caching are substantial, especially for applications with long, repeated prompt prefixes. Here are the numbers from major providers:

90%

cost reduction on cached tokens (Anthropic)

~3×

faster time-to-first-token (Anthropic)

5 min

default cache lifetime (refreshed on each hit)

Select Provider

Total Prompt Tokens

10,000

100100K

Cached Prefix %

80%

0%100%

Requests / Day

100

110K

Cost per Request

Without Caching$0.0300

With Caching$0.008400

72%saved per day

Daily Savings

$2.16

saved per day

Monthly Savings

$64.80

saved per month (30 days)

Pricing Comparison (Input Tokens)

Provider	Base Price	Cached Price	Savings
Anthropic (Claude)	$3.00 / MTok	$0.30 / MTok	90%
OpenAI (GPT-4o)	$2.50 / MTok	$1.25 / MTok	50%
Google (Gemini)	$1.25 / MTok	$0.3125 / MTok	75%

Prices shown for flagship models as of early 2025. Cache write tokens may cost 25% more than base price (Anthropic). Always check current pricing docs.

default cache lifetime (refreshed on each hit)

Cache Lifecycle

Watch how the 5-minute TTL refreshes on each cache hit. A gap in requests lets the cache expire — the next request creates a new cache.

Time (min):0.0/ 15

Cache Active

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

Cache Lifecycle

t=0mCache CreatedTTL → 5m

Cache CreatedCache Hit (TTL Refreshed)Cache Expired (Miss)Cache Active

Code Example: Anthropic API

Anthropic's implementation gives you explicit control over what gets cached. Add cache_control to any content block in your system prompt or messages:

anthropic_prompt_caching.py

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are an expert legal assistant. Here is the "
                    "complete contract document you must analyze...\n\n"
                    "[... 15,000 tokens of contract text ...]",
            "cache_control": {"type": "ephemeral"}  # ← Cache this!
        }
    ],
    messages=[
        {"role": "user", "content": "Summarize the key obligations."}
    ]
)

# Check cache performance in the response
usage = response.usage
print(f"Input tokens:        {usage.input_tokens}")
print(f"Cache creation:      {usage.cache_creation_input_tokens}")
print(f"Cache read (hits):   {usage.cache_read_input_tokens}")
# First request:  cache_creation = 15000, cache_read = 0
# Second request: cache_creation = 0,     cache_read = 15000  ← 90% cheaper!

Implementation Tips

📌

Put static content first

Cache matching works on prefixes. Place your system prompt and few-shot examples before any dynamic content so the prefix stays stable across requests.

🔢

Mind the minimum token count

Anthropic requires at least 1,024 tokens in a cacheable block (2,048 for Claude Haiku). Content below this threshold won't be cached.

⏱️

Understand the TTL

Anthropic's cache lives for 5 minutes, refreshed on each hit. For infrequent requests, the cache may expire between calls. OpenAI's automatic caching has similar time-based expiry.

📏

Monitor cache hit rates

Check the usage fields in API responses (cache_creation_input_tokens vs cache_read_input_tokens) to verify caching is working. Low hit rates mean your prefix is changing too often.

Key Takeaways

✦

Prompt Caching reuses computed KV pairs across API requests, skipping redundant computation for repeated prompt prefixes.

✦

Anthropic offers the best savings (90% cost, ~3× speed) with explicit cache_control. OpenAI does it automatically at 50% savings.

✦

Best for long system prompts, few-shot examples, and large documents queried multiple times.

✦

Structure prompts with static content first (prefix) and dynamic content last to maximize cache hit rates.