Home›Expertise›RAGS to Riches›Latency optimization

Latency optimization

📖 5 min readUpdated 2026-04-18

A notebook RAG can take 10 seconds per query. A production RAG has a budget: 500ms, 2 seconds, maybe 5. Here's where latency goes in a typical system and the levers I pull to cut it.

The typical latency breakdown

For a standard RAG pipeline (hybrid retrieval + reranking + generation):

Query embedding:          30-150ms
Vector search:            20-100ms
BM25 search:              10-50ms
Fusion:                   <5ms
Rerank (cross-encoder):   100-400ms
LLM generation (first token): 300-1500ms
LLM generation (total):   1000-5000ms
--------
Total to first token:     500-2000ms
Total end-to-end:         1500-5500ms

Generation dominates. Everything else combined is usually less than the LLM call.

Optimization priorities

1. Generation (biggest lever)

Use a faster model (Haiku over Sonnet, 4o-mini over 4o) when quality permits
Stream the response so user sees output within 500ms of first token
Reduce context size (less reranked top-k, tighter prompts)
Use provider's low-latency modes where available
Prompt caching for common system prompts (Claude, Gemini support this)

2. Embedding

Cache query embeddings (same query → same embedding, TTL by how often the model updates)
Self-host for lower latency than API calls (save 50-150ms)
Smaller models (text-embedding-3-small over large)
Batch when possible (doesn't help per-query latency, helps throughput)

3. Retrieval

Co-locate vector DB with application (network latency)
Tune HNSW ef_search lower for faster queries
Pre-warm hot content in memory
Cache top-k results for popular queries

4. Reranking

Fewer candidates to rerank (top-20 vs top-100)
Smaller reranker model (MiniLM-L-6 vs electra-base)
Skip reranking for high-confidence retrievals
Run reranker on GPU for batch efficiency

Streaming is essential

A 3-second total response time feels fast if the first token arrives in 500ms. Same total feels slow if nothing appears until 3 seconds.

Stream the LLM output. Perceived latency drops dramatically. Users tolerate 3-5 second total response if they see progress from 500ms in.

Parallelize retrieval and embedding

Embedding the query doesn't require the retrieval infrastructure. Start both in parallel:

Kick off query embedding
While embedding, do any non-vector-dependent work (logging, auth, query parsing)
When embedding is done, start vector search

Saves 50-150ms if done right.

Parallelize retrieval sources

If using hybrid (dense + sparse) or multi-query, run all retrievals in parallel. Total latency is max of all, not sum.

Async by default

Build the pipeline with async/await from day one. Retrofitting async is painful. Every I/O call (embedding API, vector DB, LLM) should be awaitable.

The "first meaningful response" metric

Track this specifically:

Time to first token (from user request to first LLM token arriving)
Time to complete response
Time to critical information (when the actual answer appears, not just preamble)

First-token latency is the UX-critical number. Total latency matters for backend cost and throughput.

Caching layers

Query embedding cache

Hash the query, cache the embedding for 24 hours. Hit rate can be 20-40% on common query patterns. See caching.

Retrieval cache

Cache full top-k results per query. Invalidate on index updates. Hit rate depends on query repetition.

Full response cache

For exact-match queries, cache the full generated response. Rarely applicable to RAG (queries vary too much) but useful for FAQ-style systems.

Prompt caching

If using Claude or Gemini, cache the system prompt (which contains retrieved context) so it doesn't re-tokenize on identical prompts within a session. Saves 30-50% of generation time for multi-turn conversations.

Model routing

Simple queries to fast models, complex queries to slow models:

Classify query complexity
Easy → gpt-4o-mini / haiku / flash
Hard → gpt-4o / sonnet / gemini pro

70% of queries can often use the fast model with no quality loss. Average latency drops significantly.

Content-level optimization

Shorter chunks → faster embedding, less context per query
Pre-summarized content in metadata → generator can use summaries for first-pass reasoning
Fewer retrieved chunks passed to generator (top-5 vs top-20)

The tail latency problem

p50 latency might be 1.5s but p99 is 15s. Tail latency kills UX:

Slow embedding API calls (timeout and fall back to self-hosted)
Slow LLM calls (provider-side slow responses)
Vector DB slow queries (ef_search too high, huge candidate sets)
Cold indexes being loaded

Set timeouts at every layer. Fall back gracefully (smaller model, cached response, error message). The goal: p99 within 2-3x p50, not 10x.

Measurement

Instrument every stage. You can't optimize what you can't see.

Histogram of latency per pipeline stage
p50, p95, p99 per stage
Per-query traces for debugging
Alerts on p99 regression

See observability.

The rule of thumb

For a chat-style RAG system, target:

First token: under 1 second
Total response: 2-4 seconds
p99 total: under 8 seconds

Much faster, users won't notice. Much slower, users will leave.

What to do with this

Turn on streaming first. Biggest perceived-latency win for zero real cost.
Instrument per-stage latency so optimization isn't guessing.
Cap tail latency with per-stage timeouts and graceful fallbacks.

Next: Caching strategies.