Caching strategies

Caching is the highest-impact performance optimization for most production RAG systems. The savings compound: embedding cache hits save time and money, retrieval cache hits save more, response cache hits save the most. Here are the layers that actually hit in practice.

The cache hierarchy

Layer 1: Query embedding cache

Cache the embedding of each query. Keyed by exact query text + embedding model version.

Layer 2: Retrieval result cache

Cache the top-k chunks returned for each query.

Layer 3: Reranked result cache

Cache the reranked ordering.

Layer 4: Full response cache

Cache the complete generated answer for exact-match queries.

Layer 5: Prompt cache (LLM provider feature)

Claude, Gemini, and OpenAI (in limited beta) support caching parts of the prompt on the provider side. Cache the system prompt with retrieved context. Next call with the same prompt prefix reuses the cached tokens.

Cache key design

For query embedding

Hash of: normalized_query_text + embedding_model_version

For retrieval results

Hash of: normalized_query_text + index_version + filter_signature

Filter signature captures any metadata filters (tenant, permissions, etc.).

For responses

Hash of: normalized_query_text + context_signature + model_version + user_context

User context matters: a response that references user-specific data shouldn't be cached across users.

Normalization

Cache hit rates depend on normalization. Apply consistently:

"What's our refund policy?" and "what's our refund policy" should hit the same cache entry.

Cache invalidation

Time-based

TTL on every entry. Simple, acceptable for many use cases.

Event-based

Invalidate when underlying data changes:

Event-based invalidation keeps cache fresh but adds complexity. Often implemented as cache versioning: bump a version number, treat old cache entries as invalid.

Lazy invalidation

Check the cached entry's validity only when it's requested. If stale, regenerate.

Simplest implementation: cache entries store the index version they were generated against. On read, compare to current version; if mismatch, regenerate.

Cache stores

Redis

Standard choice. Fast, simple, supports TTL and eviction policies. Good for query embedding cache and retrieval cache.

Memcached

Similar to Redis, slightly less features but very high performance.

In-process (LRU)

For small-scale or single-instance deployments. No network hop. Doesn't survive restarts.

CDN (for response cache)

If responses can be cached publicly (rare for RAG), CDN edge caching gives global low-latency serving.

User-scoped caching

Most RAG systems have per-user context (permissions, preferences, history). Cache keys must include user context:

Tradeoff: higher hit rate with user-agnostic caching, better correctness with user-scoped.

Multi-tenant caching

Same principle applies across tenants:

What NOT to cache

Generation for user-specific contexts

If the answer depends on user history or profile, caching produces wrong answers for different users.

Real-time data queries

"How many orders have we processed today?" should never hit a stale cache.

Dynamic results

Queries that intentionally have non-deterministic output (creative generation, summarization styles).

Measuring cache effectiveness

Dashboard these. A cache layer with <5% hit rate is probably not worth the complexity.

The compound effect

Each cache layer's savings compound:

At scale, effective caching can reduce total cost and latency by 40-70%.

Next: Observability and tracing.