Latency optimization

A notebook RAG can take 10 seconds per query. A production RAG has a budget: 500ms, 2 seconds, maybe 5. Here's where latency goes in a typical system and the levers I pull to cut it.

The typical latency breakdown

For a standard RAG pipeline (hybrid retrieval + reranking + generation):

Query embedding:          30-150ms
Vector search:            20-100ms
BM25 search:              10-50ms
Fusion:                   <5ms
Rerank (cross-encoder):   100-400ms
LLM generation (first token): 300-1500ms
LLM generation (total):   1000-5000ms
--------
Total to first token:     500-2000ms
Total end-to-end:         1500-5500ms

Generation dominates. Everything else combined is usually less than the LLM call.

Optimization priorities

1. Generation (biggest lever)

2. Embedding

3. Retrieval

4. Reranking

Streaming is essential

A 3-second total response time feels fast if the first token arrives in 500ms. Same total feels slow if nothing appears until 3 seconds.

Stream the LLM output. Perceived latency drops dramatically. Users tolerate 3-5 second total response if they see progress from 500ms in.

Parallelize retrieval and embedding

Embedding the query doesn't require the retrieval infrastructure. Start both in parallel:

  1. Kick off query embedding
  2. While embedding, do any non-vector-dependent work (logging, auth, query parsing)
  3. When embedding is done, start vector search

Saves 50-150ms if done right.

Parallelize retrieval sources

If using hybrid (dense + sparse) or multi-query, run all retrievals in parallel. Total latency is max of all, not sum.

Async by default

Build the pipeline with async/await from day one. Retrofitting async is painful. Every I/O call (embedding API, vector DB, LLM) should be awaitable.

The "first meaningful response" metric

Track this specifically:

First-token latency is the UX-critical number. Total latency matters for backend cost and throughput.

Caching layers

Query embedding cache

Hash the query, cache the embedding for 24 hours. Hit rate can be 20-40% on common query patterns. See caching.

Retrieval cache

Cache full top-k results per query. Invalidate on index updates. Hit rate depends on query repetition.

Full response cache

For exact-match queries, cache the full generated response. Rarely applicable to RAG (queries vary too much) but useful for FAQ-style systems.

Prompt caching

If using Claude or Gemini, cache the system prompt (which contains retrieved context) so it doesn't re-tokenize on identical prompts within a session. Saves 30-50% of generation time for multi-turn conversations.

Model routing

Simple queries to fast models, complex queries to slow models:

  1. Classify query complexity
  2. Easy → gpt-4o-mini / haiku / flash
  3. Hard → gpt-4o / sonnet / gemini pro

70% of queries can often use the fast model with no quality loss. Average latency drops significantly.

Content-level optimization

The tail latency problem

p50 latency might be 1.5s but p99 is 15s. Tail latency kills UX:

Set timeouts at every layer. Fall back gracefully (smaller model, cached response, error message). The goal: p99 within 2-3x p50, not 10x.

Measurement

Instrument every stage. You can't optimize what you can't see.

See observability.

The rule of thumb

For a chat-style RAG system, target:

Much faster, users won't notice. Much slower, users will leave.

Next: Caching strategies.