Prompt caching

📖 5 min readUpdated 2026-04-18

Here's the trick that makes long-running agents economically viable: you can tell Claude's API "this part of the prompt won't change - don't recompute it every time." The API caches the precomputed version and serves it back at about 10% of the normal cost. On any agent with a stable system prompt, this cuts your bill by 90%. It's the single highest-leverage performance tweak you can make that doesn't change what the agent does at all.

The intuition.

A typical agent prompt has a lot of stuff that's the same call after call. Your system prompt explaining what the agent is. Your list of tool definitions. Maybe a long document you pasted in for context. Only the user's latest message and the conversation history change turn-to-turn.

Without caching, you're paying full price to process all that stable stuff every single time. With caching, you pay once upfront to cache it, then pay a tiny fraction to reference the cached version. Over a long session or a scheduled agent that runs many times, the savings compound fast.

What a typical agent prompt looks like.

~ prompt layout, cache-friendly on left ~

Stable pieces at the BOTTOM (the start of the prompt, which is the "prefix" from the model's perspective). Dynamic pieces at the TOP (the end of the prompt). This ordering isn't cosmetic; it's what makes caching possible.

How the cache actually works.

You mark specific points in your prompt as "cache breakpoints." The first request computes the content up to each breakpoint and caches it. On subsequent requests, if your prompt matches the cached prefix EXACTLY (byte-for-byte), the cached version gets reused instead of being recomputed.

Cache hits pay about 10% of the normal input cost. Cache writes (the first time) pay a small premium (~25% more than normal input). Cache lasts 5 minutes by default - but every hit refreshes the TTL, so an agent making frequent calls keeps its cache warm indefinitely.

You get up to 4 breakpoints per request. Use them for different stability tiers:

Breakpoint 1: system prompt (most stable - changes rarely)
Breakpoint 2: tool definitions (changes with code releases)
Breakpoint 3: long reference docs (stable per-session)
Breakpoint 4: accumulated conversation (changes each turn)

The cost math, concretely.

~ 100 calls, 10k-token stable prefix ~

Numbers work out like this at Sonnet pricing:

Regular input: $3 per million tokens
Cached read: $0.30 per million tokens (one-tenth)
Cache write: $3.75 per million tokens (25% premium, one time)

An agent doing 100 calls with a 10,000-token stable prefix:

No caching: 100 × 10,000 × $3/M = $3.00
With caching: 1 × 10,000 × $3.75/M + 99 × 10,000 × $0.30/M = $0.34

~90% savings on the stable portion. On agents with big system prompts or long retrieved context, the absolute dollar savings get real fast.

Latency wins, not just cost.

Cache hits are also faster. The model skips re-processing the cached content, so time-to-first-token on a long cached prompt drops roughly 40%. On interactive agents where user-perceived latency matters (chat, voice), this is a meaningful UX improvement. Not just a cost play.

When caching isn't worth bothering with.

~ skip caching vs worth caching ~

Gotchas that will trip you.

Exact-match, byte for byte. A stray space, a changed newline, a different Unicode variant - any of those misses the cache. If you're templating dynamically, be religious about keeping the stable sections byte-identical across calls.
Cache writes cost a little extra. Only mark things as cacheable if they'll be reused. Caching something that appears once actively costs you money.
Ordering matters. Stable content MUST come before dynamic content. The cache matches by prefix - if the user's message appears before your stable tool definitions, nothing downstream of that message can be cached.
Per-API-key, per-model. Cache entries don't transfer between different API keys or different models. Switching Sonnet → Opus means re-caching.
5-minute TTL. If your agent sleeps 10 minutes between calls, you'll pay the write premium twice for no benefit. Batch your work into bursts or keep a heartbeat request alive.

The workflow.

Identify the stable parts of your prompt. System prompt, tool definitions, long reference docs.
Restructure so stable content comes first. If it doesn't already - fix the order. This is usually a small refactor.
Add cache breakpoints. 1-2 at minimum (after system prompt, after tool definitions). More if you have multiple stability tiers.
Measure. Check your actual cost before and after. Cache-hit rate should be 95%+ on a warm agent.
Keep the stable sections stable. Audit your templating to make sure nothing mutates them accidentally.

Bottom line: if you have a stable prompt prefix and an agent making more than a handful of calls, prompt caching isn't optional - it's operational hygiene. Set it up once, and every future agent you build inherits the savings automatically.

Prompt caching

The intuition.

What a typical agent prompt looks like.

How the cache actually works.

The cost math, concretely.

Latency wins, not just cost.

When caching isn't worth bothering with.

Gotchas that will trip you.

The workflow.

Further reading

Watch