Reranking

Reranking takes a retrieved candidate set and reorders it with a more accurate (but slower) model. It's the single most impactful addition you can make to a naive RAG system. Skipping it is the most common reason production RAG systems underperform their potential.

Why reranking exists

Initial retrieval (dense or hybrid) uses independent embeddings, one for the query, one for each document. The similarity score is a good approximation but not a deep match. A cross-encoder reranker considers query and document together, producing a much more accurate relevance score, at much higher cost per pair.

The pattern: retrieve cheaply, rerank precisely.

The flow

1. Initial retrieval returns top-50 or top-100 candidates
2. Reranker scores each (query, candidate) pair
3. Re-sort by reranker score
4. Take top-5 or top-10 to send to generator

You retrieve wide (to catch the right answer somewhere in the set), then narrow precisely (so the generator sees clean context).

Bi-encoders vs cross-encoders

Bi-encoder (initial retrieval)

Embed query and documents separately. Compare via cosine similarity. Fast at retrieval time because document embeddings are precomputed.

Cross-encoder (reranker)

Concatenate query + document, pass through a transformer, output a score. Much more accurate because the model can attend jointly to both. Cannot be precomputed, score must be computed at query time for each candidate.

Reranker options

Open-source cross-encoders

Commercial rerankers

LLM-as-reranker

Pass the top-20 candidates to a small LLM with a prompt to re-rank. Highest quality in some cases, but slower and more expensive per query. Worth trying when nothing else works.

Latency

Cross-encoder reranking adds latency:

For real-time RAG, choose the reranker that fits your latency budget. For batch or high-stakes queries, prefer higher quality.

The quality gain

On standard benchmarks, adding a reranker typically improves top-10 relevance by 15-30%. In domain-specific applications, gains can be higher.

On a real RAG system, the user-visible effect is usually: fewer irrelevant chunks in the context, so the generator has cleaner input and produces fewer hallucinations or off-topic answers.

How many candidates to rerank

Tradeoffs:

Multi-stage reranking

For very large initial candidate sets, chain rerankers of increasing precision:

  1. Retrieve top-500 with fast sparse+dense hybrid
  2. Rerank with fast cross-encoder (MiniLM) → top-50
  3. Rerank with high-quality cross-encoder (bge-reranker-large or Cohere) → top-10
  4. Pass to generator

Each stage is cheap in aggregate because the candidate set shrinks rapidly. Used at Google-scale search for decades.

Reranking with additional signal

Rerankers can take more than just text similarity into account:

Typical approach: combine the reranker score with metadata-derived boosts in a weighted sum. This is where reranking starts looking like a classical learning-to-rank system.

When reranking doesn't help

Rare but real cases:

My default setup

For production RAG:

  1. Hybrid retrieval (dense + BM25) → top-50
  2. RRF fusion of the two result sets
  3. Cross-encoder rerank (bge-reranker or Cohere) → top-10
  4. Pass top-10 to generator

This stack handles the 80% case well. It's also what I'd A/B test against any naive "just vector search" baseline.

Next: Query rewriting.