A notebook RAG can take 10 seconds per query. A production RAG has a budget: 500ms, 2 seconds, maybe 5. Here's where latency goes in a typical system and the levers I pull to cut it.
For a standard RAG pipeline (hybrid retrieval + reranking + generation):
Query embedding: 30-150ms Vector search: 20-100ms BM25 search: 10-50ms Fusion: <5ms Rerank (cross-encoder): 100-400ms LLM generation (first token): 300-1500ms LLM generation (total): 1000-5000ms -------- Total to first token: 500-2000ms Total end-to-end: 1500-5500ms
Generation dominates. Everything else combined is usually less than the LLM call.
A 3-second total response time feels fast if the first token arrives in 500ms. Same total feels slow if nothing appears until 3 seconds.
Stream the LLM output. Perceived latency drops dramatically. Users tolerate 3-5 second total response if they see progress from 500ms in.
Embedding the query doesn't require the retrieval infrastructure. Start both in parallel:
Saves 50-150ms if done right.
If using hybrid (dense + sparse) or multi-query, run all retrievals in parallel. Total latency is max of all, not sum.
Build the pipeline with async/await from day one. Retrofitting async is painful. Every I/O call (embedding API, vector DB, LLM) should be awaitable.
Track this specifically:
First-token latency is the UX-critical number. Total latency matters for backend cost and throughput.
Hash the query, cache the embedding for 24 hours. Hit rate can be 20-40% on common query patterns. See caching.
Cache full top-k results per query. Invalidate on index updates. Hit rate depends on query repetition.
For exact-match queries, cache the full generated response. Rarely applicable to RAG (queries vary too much) but useful for FAQ-style systems.
If using Claude or Gemini, cache the system prompt (which contains retrieved context) so it doesn't re-tokenize on identical prompts within a session. Saves 30-50% of generation time for multi-turn conversations.
Simple queries to fast models, complex queries to slow models:
70% of queries can often use the fast model with no quality loss. Average latency drops significantly.
p50 latency might be 1.5s but p99 is 15s. Tail latency kills UX:
Set timeouts at every layer. Fall back gracefully (smaller model, cached response, error message). The goal: p99 within 2-3x p50, not 10x.
Instrument every stage. You can't optimize what you can't see.
See observability.
For a chat-style RAG system, target:
Much faster, users won't notice. Much slower, users will leave.
Next: Caching strategies.