Caching is the highest-impact performance optimization for most production RAG systems. The savings compound: embedding cache hits save time and money, retrieval cache hits save more, response cache hits save the most. Here are the layers that actually hit in practice.
Cache the embedding of each query. Keyed by exact query text + embedding model version.
Cache the top-k chunks returned for each query.
Cache the reranked ordering.
Cache the complete generated answer for exact-match queries.
Claude, Gemini, and OpenAI (in limited beta) support caching parts of the prompt on the provider side. Cache the system prompt with retrieved context. Next call with the same prompt prefix reuses the cached tokens.
Hash of: normalized_query_text + embedding_model_version
Hash of: normalized_query_text + index_version + filter_signature
Filter signature captures any metadata filters (tenant, permissions, etc.).
Hash of: normalized_query_text + context_signature + model_version + user_context
User context matters: a response that references user-specific data shouldn't be cached across users.
Cache hit rates depend on normalization. Apply consistently:
"What's our refund policy?" and "what's our refund policy" should hit the same cache entry.
TTL on every entry. Simple, acceptable for many use cases.
Invalidate when underlying data changes:
Event-based invalidation keeps cache fresh but adds complexity. Often implemented as cache versioning: bump a version number, treat old cache entries as invalid.
Check the cached entry's validity only when it's requested. If stale, regenerate.
Simplest implementation: cache entries store the index version they were generated against. On read, compare to current version; if mismatch, regenerate.
Standard choice. Fast, simple, supports TTL and eviction policies. Good for query embedding cache and retrieval cache.
Similar to Redis, slightly less features but very high performance.
For small-scale or single-instance deployments. No network hop. Doesn't survive restarts.
If responses can be cached publicly (rare for RAG), CDN edge caching gives global low-latency serving.
Most RAG systems have per-user context (permissions, preferences, history). Cache keys must include user context:
Tradeoff: higher hit rate with user-agnostic caching, better correctness with user-scoped.
Same principle applies across tenants:
If the answer depends on user history or profile, caching produces wrong answers for different users.
"How many orders have we processed today?" should never hit a stale cache.
Queries that intentionally have non-deterministic output (creative generation, summarization styles).
Dashboard these. A cache layer with <5% hit rate is probably not worth the complexity.
Each cache layer's savings compound:
At scale, effective caching can reduce total cost and latency by 40-70%.
Next: Observability and tracing.