Users don't phrase queries the way documents do. They use pronouns, they're brief, they ask follow-ups that depend on context. Query rewriting uses an LLM to rewrite the user's query into a form better suited for retrieval. It's one of the lowest-effort quality wins in modern RAG.
User: "how does it handle rate limiting?"
The embedding of "how does it handle rate limiting" doesn't know what "it" is. Retrieved chunks may be about rate limiting in general, missing the specific product the user meant.
Turn 1: "Tell me about Pinecone's pricing."
Turn 2: "What about performance?"
Turn 2 alone embeds as a generic performance question. With rewriting: "How does Pinecone perform?"
"refund policy" vs "what is our customer refund policy including eligibility, timeframe, and process"
The expanded version retrieves better because it provides more signal to match against.
User says "login not working"; docs say "authentication failures." Rewriting to include synonyms bridges the gap.
Rewrite the query to resolve pronouns and implicit context from conversation history.
Given the conversation history: - User: Tell me about Pinecone pricing - Assistant: [response about Pinecone] - User: What about performance? Rewrite the last user query as a standalone search query. → "How does Pinecone perform?"
Add synonyms, related terms, or rephrasings to improve lexical matching.
"login issues" → "login authentication sign-in access issues problems failures"
Break a compound query into sub-queries. Retrieve for each separately.
"What's our refund policy and how do I escalate a billing dispute?" → ["What is the refund policy?", "How to escalate a billing dispute?"]
Generate a more general version of the query to retrieve broader context.
"Why did my OAuth token expire after 2 hours?" → step-back: "How does OAuth token expiration work?"
Generate a fake answer, embed that, retrieve based on its embedding. See HyDE.
SYSTEM: You are a search query optimizer. Given a user query, rewrite it as a clear, self-contained search query that would retrieve the most relevant documents. Expand abbreviations, resolve pronouns, and include relevant synonyms. Return only the rewritten query. USER: [raw query]
Query rewriting doesn't need your best reasoning model. A fast, cheap model (GPT-4o-mini, Claude Haiku, Gemini Flash) does this well. Latency 50-200ms per rewrite. Cost is a fraction of the generation step.
Generate multiple rewrites and run retrieval for each. Union and dedupe. More coverage, more cost. See multi-query + fusion.
Compare retrieval quality with and without rewriting on the same eval set. Rewriting should improve hit rate, especially on:
If rewriting doesn't help, your initial retrieval already handles query variety well (or your rewriting prompt is wrong). Test both hypotheses.
The failure mode of query rewriting: the rewrite subtly changes what the user asked. Guard against this by:
Next: HyDE.