Prompt caching

Here's the trick that makes long-running agents economically viable: you can tell Claude's API "this part of the prompt won't change - don't recompute it every time." The API caches the precomputed version and serves it back at about 10% of the normal cost. On any agent with a stable system prompt, this cuts your bill by 90%. It's the single highest-leverage performance tweak you can make that doesn't change what the agent does at all.

The intuition.

A typical agent prompt has a lot of stuff that's the same call after call. Your system prompt explaining what the agent is. Your list of tool definitions. Maybe a long document you pasted in for context. Only the user's latest message and the conversation history change turn-to-turn.

Without caching, you're paying full price to process all that stable stuff every single time. With caching, you pay once upfront to cache it, then pay a tiny fraction to reference the cached version. Over a long session or a scheduled agent that runs many times, the savings compound fast.

What a typical agent prompt looks like.

~ prompt layout, cache-friendly on left ~

Stable pieces at the BOTTOM (the start of the prompt, which is the "prefix" from the model's perspective). Dynamic pieces at the TOP (the end of the prompt). This ordering isn't cosmetic; it's what makes caching possible.

How the cache actually works.

You mark specific points in your prompt as "cache breakpoints." The first request computes the content up to each breakpoint and caches it. On subsequent requests, if your prompt matches the cached prefix EXACTLY (byte-for-byte), the cached version gets reused instead of being recomputed.

Cache hits pay about 10% of the normal input cost. Cache writes (the first time) pay a small premium (~25% more than normal input). Cache lasts 5 minutes by default - but every hit refreshes the TTL, so an agent making frequent calls keeps its cache warm indefinitely.

You get up to 4 breakpoints per request. Use them for different stability tiers:

The cost math, concretely.

~ 100 calls, 10k-token stable prefix ~

Numbers work out like this at Sonnet pricing:

An agent doing 100 calls with a 10,000-token stable prefix:

~90% savings on the stable portion. On agents with big system prompts or long retrieved context, the absolute dollar savings get real fast.

Latency wins, not just cost.

Cache hits are also faster. The model skips re-processing the cached content, so time-to-first-token on a long cached prompt drops roughly 40%. On interactive agents where user-perceived latency matters (chat, voice), this is a meaningful UX improvement. Not just a cost play.

When caching isn't worth bothering with.

~ skip caching vs worth caching ~

Gotchas that will trip you.

The workflow.

  1. Identify the stable parts of your prompt. System prompt, tool definitions, long reference docs.
  2. Restructure so stable content comes first. If it doesn't already - fix the order. This is usually a small refactor.
  3. Add cache breakpoints. 1-2 at minimum (after system prompt, after tool definitions). More if you have multiple stability tiers.
  4. Measure. Check your actual cost before and after. Cache-hit rate should be 95%+ on a warm agent.
  5. Keep the stable sections stable. Audit your templating to make sure nothing mutates them accidentally.
Bottom line: if you have a stable prompt prefix and an agent making more than a handful of calls, prompt caching isn't optional - it's operational hygiene. Set it up once, and every future agent you build inherits the savings automatically.