Reranking

📖 5 min readUpdated 2026-04-18

Reranking takes a retrieved candidate set and reorders it with a more accurate (but slower) model. It's the single most impactful addition you can make to a naive RAG system. Skipping it is the most common reason production RAG systems underperform their potential.

Why reranking exists

Initial retrieval (dense or hybrid) uses independent embeddings, one for the query, one for each document. The similarity score is a good approximation but not a deep match. A cross-encoder reranker considers query and document together, producing a much more accurate relevance score, at much higher cost per pair.

The pattern: retrieve cheaply, rerank precisely.

The flow

1. Initial retrieval returns top-50 or top-100 candidates
2. Reranker scores each (query, candidate) pair
3. Re-sort by reranker score
4. Take top-5 or top-10 to send to generator

You retrieve wide (to catch the right answer somewhere in the set), then narrow precisely (so the generator sees clean context).

Bi-encoders vs cross-encoders

Bi-encoder (initial retrieval)

Embed query and documents separately. Compare via cosine similarity. Fast at retrieval time because document embeddings are precomputed.

Cross-encoder (reranker)

Concatenate query + document, pass through a transformer, output a score. Much more accurate because the model can attend jointly to both. Cannot be precomputed, score must be computed at query time for each candidate.

Reranker options

Open-source cross-encoders

cross-encoder/ms-marco-MiniLM-L-6-v2: fast, decent quality. Good baseline.
cross-encoder/ms-marco-electra-base: higher quality, more compute.
BGE reranker (bge-reranker-large, bge-reranker-v2-m3): strong open-source rerankers.
Jina reranker: competitive, available as API or open weights.

Commercial rerankers

Cohere Rerank 3: widely used, strong quality. API-only.
Voyage Rerank 2: comparable to Cohere.
Mixedbread mxbai-rerank: commercial or self-hosted.

LLM-as-reranker

Pass the top-20 candidates to a small LLM with a prompt to re-rank. Highest quality in some cases, but slower and more expensive per query. Worth trying when nothing else works.

Latency

ms-marco-MiniLM on top-50: 30-100ms on GPU
bge-reranker-large on top-50: 100-300ms on GPU
Cohere Rerank 3 API on top-50: 100-400ms (network + inference)
LLM-based rerank on top-20: 500-2000ms

For real-time RAG, choose the reranker that fits your latency budget. For batch or high-stakes queries, prefer higher quality.

The quality gain

On standard benchmarks, adding a reranker typically improves top-10 relevance by 15-30%. In domain-specific applications, gains can be higher.

On a real RAG system, the user-visible effect is usually: fewer irrelevant chunks in the context, so the generator has cleaner input and produces fewer hallucinations or off-topic answers.

How many candidates to rerank

Tradeoffs:

Top-20 rerank: fast, catches most of the gain
Top-50 rerank: typical sweet spot
Top-100 rerank: higher recall, more cost
Top-200+: diminishing returns; your initial retrieval is broken if the right answer is at rank 150

Multi-stage reranking

For very large initial candidate sets, chain rerankers of increasing precision:

Retrieve top-500 with fast sparse+dense hybrid
Rerank with fast cross-encoder (MiniLM) → top-50
Rerank with high-quality cross-encoder (bge-reranker-large or Cohere) → top-10
Pass to generator

Each stage is cheap in aggregate because the candidate set shrinks rapidly. Used at Google-scale search for decades.

Reranking with additional signal

Rerankers can take more than just text similarity into account:

Recency: boost documents published recently
Authority: boost canonical sources over derived
User history: boost documents the user has previously found useful
Metadata match: boost documents matching inferred query intent

Typical approach: combine the reranker score with metadata-derived boosts in a weighted sum. This is where reranking starts looking like a classical learning-to-rank system.

When reranking doesn't help

Rare but real cases:

Initial retrieval is so bad that the right answer isn't in top-100 (reranker can't save you)
Corpus is small enough that top-5 retrieval is already near-optimal
Query-document relevance is purely metadata-based (then prefer classical LTR over embedding-based rerank)

My default setup

For production RAG:

Hybrid retrieval (dense + BM25) → top-50
RRF fusion of the two result sets
Cross-encoder rerank (bge-reranker or Cohere) → top-10
Pass top-10 to generator

This stack handles the 80% case well. It's also what I'd A/B test against any naive "just vector search" baseline.

What to do with this

Default: hybrid retrieve top-50 + cross-encoder rerank to top-10.
Match reranker to latency budget. Not every query needs an LLM reranker.
Chain rerankers (fast→slow) when candidate sets are huge.