Latency optimization
📖 5 min readUpdated 2026-04-18
A notebook RAG can take 10 seconds per query. A production RAG has a budget: 500ms, 2 seconds, maybe 5. Here's where latency goes in a typical system and the levers I pull to cut it.
The typical latency breakdown
For a standard RAG pipeline (hybrid retrieval + reranking + generation):
Query embedding: 30-150ms
Vector search: 20-100ms
BM25 search: 10-50ms
Fusion: <5ms
Rerank (cross-encoder): 100-400ms
LLM generation (first token): 300-1500ms
LLM generation (total): 1000-5000ms
--------
Total to first token: 500-2000ms
Total end-to-end: 1500-5500ms
Generation dominates. Everything else combined is usually less than the LLM call.
Optimization priorities
1. Generation (biggest lever)
- Use a faster model (Haiku over Sonnet, 4o-mini over 4o) when quality permits
- Stream the response so user sees output within 500ms of first token
- Reduce context size (less reranked top-k, tighter prompts)
- Use provider's low-latency modes where available
- Prompt caching for common system prompts (Claude, Gemini support this)
2. Embedding
- Cache query embeddings (same query → same embedding, TTL by how often the model updates)
- Self-host for lower latency than API calls (save 50-150ms)
- Smaller models (text-embedding-3-small over large)
- Batch when possible (doesn't help per-query latency, helps throughput)
3. Retrieval
- Co-locate vector DB with application (network latency)
- Tune HNSW ef_search lower for faster queries
- Pre-warm hot content in memory
- Cache top-k results for popular queries
4. Reranking
- Fewer candidates to rerank (top-20 vs top-100)
- Smaller reranker model (MiniLM-L-6 vs electra-base)
- Skip reranking for high-confidence retrievals
- Run reranker on GPU for batch efficiency
Streaming is essential
A 3-second total response time feels fast if the first token arrives in 500ms. Same total feels slow if nothing appears until 3 seconds.
Stream the LLM output. Perceived latency drops dramatically. Users tolerate 3-5 second total response if they see progress from 500ms in.
Parallelize retrieval and embedding
Embedding the query doesn't require the retrieval infrastructure. Start both in parallel:
- Kick off query embedding
- While embedding, do any non-vector-dependent work (logging, auth, query parsing)
- When embedding is done, start vector search
Saves 50-150ms if done right.
Parallelize retrieval sources
If using hybrid (dense + sparse) or multi-query, run all retrievals in parallel. Total latency is max of all, not sum.
Async by default
Build the pipeline with async/await from day one. Retrofitting async is painful. Every I/O call (embedding API, vector DB, LLM) should be awaitable.
The "first meaningful response" metric
Track this specifically:
- Time to first token (from user request to first LLM token arriving)
- Time to complete response
- Time to critical information (when the actual answer appears, not just preamble)
First-token latency is the UX-critical number. Total latency matters for backend cost and throughput.
Caching layers
Query embedding cache
Hash the query, cache the embedding for 24 hours. Hit rate can be 20-40% on common query patterns. See caching.
Retrieval cache
Cache full top-k results per query. Invalidate on index updates. Hit rate depends on query repetition.
Full response cache
For exact-match queries, cache the full generated response. Rarely applicable to RAG (queries vary too much) but useful for FAQ-style systems.
Prompt caching
If using Claude or Gemini, cache the system prompt (which contains retrieved context) so it doesn't re-tokenize on identical prompts within a session. Saves 30-50% of generation time for multi-turn conversations.
Model routing
Simple queries to fast models, complex queries to slow models:
- Classify query complexity
- Easy → gpt-4o-mini / haiku / flash
- Hard → gpt-4o / sonnet / gemini pro
70% of queries can often use the fast model with no quality loss. Average latency drops significantly.
Content-level optimization
- Shorter chunks → faster embedding, less context per query
- Pre-summarized content in metadata → generator can use summaries for first-pass reasoning
- Fewer retrieved chunks passed to generator (top-5 vs top-20)
The tail latency problem
p50 latency might be 1.5s but p99 is 15s. Tail latency kills UX:
- Slow embedding API calls (timeout and fall back to self-hosted)
- Slow LLM calls (provider-side slow responses)
- Vector DB slow queries (ef_search too high, huge candidate sets)
- Cold indexes being loaded
Set timeouts at every layer. Fall back gracefully (smaller model, cached response, error message). The goal: p99 within 2-3x p50, not 10x.
Measurement
Instrument every stage. You can't optimize what you can't see.
- Histogram of latency per pipeline stage
- p50, p95, p99 per stage
- Per-query traces for debugging
- Alerts on p99 regression
See observability.
The rule of thumb
For a chat-style RAG system, target:
- First token: under 1 second
- Total response: 2-4 seconds
- p99 total: under 8 seconds
Much faster, users won't notice. Much slower, users will leave.
What to do with this
- Turn on streaming first. Biggest perceived-latency win for zero real cost.
- Instrument per-stage latency so optimization isn't guessing.
- Cap tail latency with per-stage timeouts and graceful fallbacks.
Next: Caching strategies.