Multi-query + fusion
📖 4 min readUpdated 2026-04-18
A single query embedding represents one angle on what the user wants. Multi-query retrieval generates several variations, retrieves for each, and fuses the results. It's a robust way to improve recall on tricky queries at the cost of more compute.
The pattern
- Take the user's original query
- Use an LLM to generate N variations (paraphrases, sub-questions, step-back questions, HyDE passages)
- Run retrieval for each variation independently
- Fuse the result lists (RRF or similar)
- Take the top-K merged results
- Optionally rerank
Kinds of variations
Paraphrases
Same meaning, different words.
- "How do I reset my password?"
- "What's the process for changing login credentials?"
- "Steps to recover account access when locked out"
Sub-questions
Decompose into parts.
- "Why is our API slow?" → "What's the current API latency?" + "What are common causes of API latency?" + "How do we measure API performance?"
Step-back
More general framing.
- "Why did OAuth token expire?" → "How does OAuth token expiration work?"
Step-forward
More specific framing.
- "How do I integrate your API?" → "How do I integrate your API in Node.js?" + "How do I integrate your API in Python?"
HyDE-style answers
Hypothetical answer passages. See HyDE.
Fusion
Same as hybrid retrieval fusion, RRF is the default.
For each document d:
rrf_score(d) = sum over all queries q: 1 / (k + rank_q(d))
Sort documents by rrf_score. Take top-K.
Documents that appear in multiple query variations' results get boosted. Documents that only appear in one get retained at lower ranks.
The RAG-Fusion technique
A specific multi-query pattern popularized around 2023:
- Generate 4-5 paraphrases of the original query
- Retrieve top-k for each paraphrase
- RRF fusion
Robust improvement over single-query retrieval on queries with vocabulary mismatch.
Parallel vs sequential retrieval
All query variations can run in parallel. With async retrieval, total latency is (LLM variation generation) + (longest single retrieval), not the sum.
With 4 variations and ~100ms each retrieval, parallel retrieval adds roughly 100ms total latency, not 400ms.
Cost tradeoffs
- N variations = N retrieval calls. At high QPS this adds up.
- LLM call to generate variations: 100-400ms, modest cost with cheap models.
- Reranking cost increases too (more candidates to rerank).
In return: typically 5-15% recall improvement, higher for short or ambiguous queries.
When multi-query is overkill
- Queries that are already specific and verbose
- When hybrid retrieval already covers the recall gap
- Cost-sensitive applications where extra LLM calls aren't justified
- Latency-sensitive applications where the extra 100ms matters
The pragmatic recipe
For a production RAG system that wants best-in-class retrieval:
- Generate 3 paraphrases of the original query (using a fast model)
- For each of the 4 queries (original + 3 paraphrases), run hybrid retrieval for top-50
- RRF-fuse across all 4 result lists
- Rerank top-50 merged → top-10
- Pass to generator
This adds about 150-300ms of latency and roughly 4x retrieval cost. For queries that benefited from it, quality is noticeably better. For queries that didn't, you've paid the cost without improvement.
Routing: when to use multi-query
Not every query benefits. A lightweight classifier or prompt can decide:
- Short queries (< 5 words): use multi-query
- Long detailed queries: single-query is sufficient
- Vague queries with ambiguous intent: use multi-query
- Queries with specific terminology and clear intent: single-query
Skip multi-query when it doesn't help. Use it when it does. Measurement tells you which is which.
Next: Agentic RAG.