Home›Expertise›RAGS to Riches›Caching strategies

Caching strategies

📖 5 min readUpdated 2026-04-18

Caching is the highest-impact performance optimization for most production RAG systems. The savings compound: embedding cache hits save time and money, retrieval cache hits save more, response cache hits save the most. Here are the layers that actually hit in practice.

The cache hierarchy

Layer 1: Query embedding cache

Cache the embedding of each query. Keyed by exact query text + embedding model version.

Hit rate: 20-40% for conversational interfaces, higher for FAQ-style
Savings: 30-150ms + embedding API cost per hit
TTL: long (7-30 days), invalidate on model version change
Storage: small, 4-12 KB per cached embedding

Layer 2: Retrieval result cache

Cache the top-k chunks returned for each query.

Hit rate: 10-30%
Savings: embedding time + vector DB query time
TTL: shorter, invalidate when index updates
Storage: moderate, top-k chunk IDs per query

Layer 3: Reranked result cache

Cache the reranked ordering.

Hit rate: similar to retrieval cache
Savings: reranker inference cost (often 100-400ms)
TTL: same as retrieval cache

Layer 4: Full response cache

Cache the complete generated answer for exact-match queries.

Hit rate: 5-20% (varies hugely by use case)
Savings: entire LLM generation cost (the biggest single cost)
TTL: hours to days, depends on content freshness requirements
Storage: moderate, full answer text per query

Layer 5: Prompt cache (LLM provider feature)

Claude, Gemini, and OpenAI (in limited beta) support caching parts of the prompt on the provider side. Cache the system prompt with retrieved context. Next call with the same prompt prefix reuses the cached tokens.

Hit rate: high in multi-turn conversations
Savings: 50-90% of input token cost, significant latency reduction
Prompts must be structured to maximize cacheable prefix

Cache key design

For query embedding

Hash of: normalized_query_text + embedding_model_version

For retrieval results

Hash of: normalized_query_text + index_version + filter_signature

Filter signature captures any metadata filters (tenant, permissions, etc.).

For responses

Hash of: normalized_query_text + context_signature + model_version + user_context

User context matters: a response that references user-specific data shouldn't be cached across users.

Normalization

Cache hit rates depend on normalization. Apply consistently:

Lowercase
Trim whitespace
Remove trailing punctuation
Standardize quote characters
Strip question mark

"What's our refund policy?" and "what's our refund policy" should hit the same cache entry.

Cache invalidation

Time-based

TTL on every entry. Simple, acceptable for many use cases.

Event-based

Invalidate when underlying data changes:

Document updated → invalidate affected retrieval caches
Index reindexed → invalidate all retrieval and response caches
Model version changed → invalidate all embedding and response caches

Event-based invalidation keeps cache fresh but adds complexity. Often implemented as cache versioning: bump a version number, treat old cache entries as invalid.

Lazy invalidation

Check the cached entry's validity only when it's requested. If stale, regenerate.

Simplest implementation: cache entries store the index version they were generated against. On read, compare to current version; if mismatch, regenerate.

Cache stores

Redis

Standard choice. Fast, simple, supports TTL and eviction policies. Good for query embedding cache and retrieval cache.

Memcached

Similar to Redis, slightly less features but very high performance.

In-process (LRU)

For small-scale or single-instance deployments. No network hop. Doesn't survive restarts.

CDN (for response cache)

If responses can be cached publicly (rare for RAG), CDN edge caching gives global low-latency serving.

User-scoped caching

Most RAG systems have per-user context (permissions, preferences, history). Cache keys must include user context:

Per-user retrieval cache: same query, different users, different results (due to access controls)
Per-user response cache: personalized answers shouldn't cross users

Tradeoff: higher hit rate with user-agnostic caching, better correctness with user-scoped.

Multi-tenant caching

Same principle applies across tenants:

Tenant-scoped cache keys (include tenant_id in the key)
Separate cache instances per very large tenant (optional, for isolation)

What NOT to cache

Generation for user-specific contexts

If the answer depends on user history or profile, caching produces wrong answers for different users.

Real-time data queries

"How many orders have we processed today?" should never hit a stale cache.

Dynamic results

Queries that intentionally have non-deterministic output (creative generation, summarization styles).

Measuring cache effectiveness

Hit rate per cache layer
Latency savings per hit
Memory/storage used
Stale cache returned (measure correctness impact)

Dashboard these. A cache layer with <5% hit rate is probably not worth the complexity.

The compound effect

Each cache layer's savings compound:

30% embedding cache hit → 30% of queries skip embedding
Of the remaining, 20% hit retrieval cache → further savings
Of remaining, 10% hit response cache → biggest win

At scale, effective caching can reduce total cost and latency by 40-70%.

What to do with this

Turn on provider prompt caching first. Biggest savings for the least work.
Always scope keys to user + permissions to prevent cross-user leaks.
Dashboard hit rate per layer; kill layers below 5%.