Dense embeddings are great at synonyms and paraphrases but weak at exact matches, rare entities, and specific codes. Keyword search (BM25) is the opposite. Hybrid search runs both and combines the results. For production RAG, hybrid retrieval consistently outperforms dense-only. Most teams ship dense-only first and then regret it.
Consider a user query: "error E-47 in the invoice processor"
The hybrid result is almost always more useful.
Embed query, find nearest neighbors. Covered throughout this section.
Classical information retrieval. Score documents by term frequency, inverse document frequency, and document length. Fast, exact, no training required. See BM25 and sparse retrieval.
The default and usually best-performing fusion method. For each document, compute its score as:
RRF_score(d) = Σ 1 / (k + rank_i(d))
i
where rank_i(d) is the document's rank in the i-th retrieval method's result list, and k is a small constant (typically 60). Sum over all retrievers.
RRF normalizes differently-scored retrievers (BM25 scores can be 0-30, vector cosine 0-1) without any tuning. Just combines ranks.
Min-max or z-score normalize each retriever's scores, then weighted sum:
final_score = α * dense_score + (1 - α) * sparse_score
Requires tuning α. Typical sweet spot: α = 0.5 to 0.7 (slightly favor dense).
Train a small model to combine scores from multiple retrievers. Best quality, most complexity.
A different approach: embed every token in queries and documents, match at the token level. Stronger for precise retrieval but more expensive to store and serve.
Weaviate, Qdrant, OpenSearch, Elasticsearch, Vespa, and Pinecone (with sparse vectors) all support hybrid natively. Cleanest architecture.
Run BM25 in Elasticsearch/OpenSearch, dense in a vector DB. At query time, call both, fuse results with RRF. More infrastructure but maximum flexibility.
Store dense vectors in a vector DB, use its full-text capabilities for BM25 (pgvector + Postgres full-text, for example). Works, less optimized than dedicated hybrid solutions.
If BM25 returns top-20 and dense returns top-100, the union is skewed toward dense results. Match top-k across retrievers, or normalize the retrieval depth.
BM25 tokenization (stopwords, stemming, lowercase) must match what you expect. A query "reset my password" with stopword removal becomes "reset password" and matches differently than with stopwords preserved.
A chunk appearing in both retriever results should be combined, not duplicated. RRF handles this. Weighted sum doesn't without explicit dedup logic.
Even hybrid retrieval benefits from a cross-encoder rerank on top. See reranking.
On standard benchmarks, hybrid RRF typically outperforms dense-only by 5-20% on retrieval metrics. On domain-specific corpora with lots of proper nouns, technical terms, or rare identifiers, the gain can be 30%+.
Cost: roughly 2x retrieval latency (you run two searches), negligible extra storage (sparse indexes are small).
Any production RAG system should use hybrid retrieval. Dense-only is a prototype-stage choice. The engineering cost is modest, the quality gain is real, and the failure modes are asymmetric, hybrid handles the edge cases (rare terms, IDs, codes) that pure dense can't.
Next: Metadata filtering.