Reranking takes a retrieved candidate set and reorders it with a more accurate (but slower) model. It's the single most impactful addition you can make to a naive RAG system. Skipping it is the most common reason production RAG systems underperform their potential.
Initial retrieval (dense or hybrid) uses independent embeddings, one for the query, one for each document. The similarity score is a good approximation but not a deep match. A cross-encoder reranker considers query and document together, producing a much more accurate relevance score, at much higher cost per pair.
The pattern: retrieve cheaply, rerank precisely.
1. Initial retrieval returns top-50 or top-100 candidates 2. Reranker scores each (query, candidate) pair 3. Re-sort by reranker score 4. Take top-5 or top-10 to send to generator
You retrieve wide (to catch the right answer somewhere in the set), then narrow precisely (so the generator sees clean context).
Embed query and documents separately. Compare via cosine similarity. Fast at retrieval time because document embeddings are precomputed.
Concatenate query + document, pass through a transformer, output a score. Much more accurate because the model can attend jointly to both. Cannot be precomputed, score must be computed at query time for each candidate.
Pass the top-20 candidates to a small LLM with a prompt to re-rank. Highest quality in some cases, but slower and more expensive per query. Worth trying when nothing else works.
Cross-encoder reranking adds latency:
For real-time RAG, choose the reranker that fits your latency budget. For batch or high-stakes queries, prefer higher quality.
On standard benchmarks, adding a reranker typically improves top-10 relevance by 15-30%. In domain-specific applications, gains can be higher.
On a real RAG system, the user-visible effect is usually: fewer irrelevant chunks in the context, so the generator has cleaner input and produces fewer hallucinations or off-topic answers.
Tradeoffs:
For very large initial candidate sets, chain rerankers of increasing precision:
Each stage is cheap in aggregate because the candidate set shrinks rapidly. Used at Google-scale search for decades.
Rerankers can take more than just text similarity into account:
Typical approach: combine the reranker score with metadata-derived boosts in a weighted sum. This is where reranking starts looking like a classical learning-to-rank system.
Rare but real cases:
For production RAG:
This stack handles the 80% case well. It's also what I'd A/B test against any naive "just vector search" baseline.
Next: Query rewriting.