Prompt caching lets the API reuse pre-computed attention state across requests that share a prompt prefix. For agents with stable system prompts, it's a 5–10× cost cut and material latency reduction.
You mark a portion of the prompt as cacheable. The first request computes and caches it. Subsequent requests that send the exact same prefix hit the cache and pay ~10% of the normal input cost for that prefix.
Claude's prompt cache has a 5-minute TTL by default. Each request that hits the cache refreshes it. So an agent making frequent calls keeps the cache warm indefinitely.
Agents repeat prompts. A typical agent call looks like:
[ system prompt 5k tokens ] ← same every call
[ tool definitions 2k tokens ] ← same every call
[ conversation so far 10k tokens ] ← grows each turn
[ new message ]
Without caching, every call pays full price for the 7k stable part. With caching, you pay ~700 tokens' worth.
Put the stable content first. Claude's cache matches by exact prefix. Anything that changes invalidates the cache from that point forward.
GOOD order:
1. System prompt [cache-ok, stable]
2. Tool definitions [cache-ok, stable]
3. Long context docs [cache-ok, stable-ish]
4. Conversation history [changes each turn]
5. New user message [changes each turn]
BAD order:
1. System prompt
2. User message ← invalidates cache
3. Tool definitions
You can mark specific points as cache breakpoints. Each breakpoint is a separate cache block. The first N blocks (up to the limit, typically 4) are cached independently.
Use breakpoints to cache different stability tiers:
At Claude Sonnet pricing (directional):
For an agent making 100 calls with a 10k-token stable prefix:
~90% savings. On agents with long system prompts or large retrieved context, it's substantial.
Cached reads are faster than fresh computation. Typical TTFT (time to first token) on a long cached prompt is ~40% lower.