Multi-hop questions require answers that combine information from multiple independent documents. "Who manages the team that ships the feature the CEO mentioned last quarter?" needs three retrievals, not one. Vanilla RAG handles single-hop. Multi-hop needs orchestration.
Single-hop: "What's our refund policy?" → retrieve the refund policy doc → answer.
Multi-hop: "What's the refund policy for the product Alice launched last quarter?"
Needs:
One-shot retrieval on the original query retrieves either Alice-related docs or refund-related docs, not the specific intersection.
Break the query into sub-questions, retrieve for each, combine.
Start with partial info, retrieve more based on what you found, repeat.
SYSTEM: Break the following question into simpler sub-questions that can each be answered with a single document lookup. List each sub-question on a new line. USER: [multi-hop question] A: 1. [sub-question 1] 2. [sub-question 2] 3. [sub-question 3]
[Retrieval round 1 results shown] SYSTEM: Based on the retrieved documents, do you have enough information to answer the user's question? If yes, answer. If no, what additional information do you need? Output either a final answer or a follow-up search query. USER: [original question] Retrieved: [docs from round 1]
A middle-ground pattern: ask the LLM to plan retrievals before executing them.
Plans are easier to debug than free-form iterative loops and cheaper than full agent-based orchestration.
LLM produces sub-questions that don't actually decompose the problem. Retrievals don't answer what's needed.
Mitigation: few-shot examples of good decompositions in the prompt. Validate decompositions for obvious problems (e.g., more than 5 sub-questions is suspicious).
Round 2 retrieval doesn't use what round 1 learned. The follow-up query is too general.
Mitigation: explicitly include round 1 findings in the round 2 query construction. "Given that Alice launched Product X, retrieve the refund policy for Product X."
Agent keeps retrieving without converging. Often because the information truly isn't in the corpus.
Mitigation: max iterations (3-5), timeout, cost budget per query.
Academic benchmarks specifically for multi-hop RAG:
These are useful for comparing strategies but don't necessarily reflect your production query distribution. Build a domain-specific eval set if multi-hop is important.
Multi-hop RAG is expensive:
Total: 3-10x single-shot RAG cost. Only use it when the question actually needs it.
The adaptive pattern (classify first, route to multi-hop only when needed) keeps cost reasonable across a mixed query distribution.
Next: Why evaluation is critical.