Home›Expertise›RAGS to Riches›BM25 and sparse retrieval

BM25 and sparse retrieval

📖 4 min readUpdated 2026-04-18

BM25 is a classical keyword-based retrieval algorithm, older than most of your AI infrastructure. It's also essential in modern RAG. The reason: dense embeddings have blind spots that BM25 fills in cleanly. Any production RAG that isn't using BM25 somewhere is leaving 10-20% of retrieval quality on the floor.

How BM25 works

BM25 scores how well a query matches a document based on:

Term frequency: how often query terms appear in the document
Inverse document frequency: how rare those terms are across the corpus
Document length normalization: longer documents naturally have more term occurrences; BM25 accounts for this

The score rewards documents that have the query's rare terms at reasonable frequency, normalized by length.

When BM25 wins over dense

Exact identifier matches (SKU-12345, error code E-47, UUID)
Proper nouns not well-represented in embedding training data (company-specific terms, acronyms)
Negation ("excluding X") where embeddings confuse similar intents
Very short queries (1-2 words) where embeddings struggle with context
Queries where users type technical jargon verbatim

When dense wins over BM25

Paraphrased queries ("how do I reset my password" vs docs talking about "password recovery")
Cross-lingual retrieval
Conceptual matches where query and doc share no exact words
Queries that imply intent rather than stating keywords

BM25 in production

Elasticsearch / OpenSearch

Gold standard for BM25. Full-text search with tokenization, stemming, analyzers, boosting. If you're serious about search, this is a natural fit.

Postgres full-text

Reasonable BM25 support via tsvector and ts_rank_cd. Good when you already have Postgres and don't need Elasticsearch-level search features.

Native in vector DBs

Weaviate, Qdrant (via sparse vectors), Vespa, Pinecone (sparse vectors), Milvus all support BM25 or similar sparse search natively. Removes the need for a separate search system.

Library: rank-bm25 (Python)

In-memory BM25 for small corpora or prototyping. Not production-scale.

Tokenization matters

BM25 quality depends heavily on tokenization:

Lowercase normalization
Stopword removal (configurable; sometimes hurts precision)
Stemming ("running" → "run"; helps recall, can hurt precision)
Lemmatization (like stemming but linguistically aware)
N-grams for phrase matching
Language-specific tokenizers for non-English

Your BM25 quality is capped by tokenization choices. Default English tokenizers work for most cases. Specialized domains (law, medicine, chemistry) often need custom tokenizers.

BM25F and field-weighted search

BM25F extends BM25 to weight different fields differently. A match in the title might be worth 3x a match in the body. For documents with structure (title, abstract, body), this is valuable.

Elasticsearch supports this via multi_match queries. Many vector DBs don't, another reason to consider ES/OS for structured text search.

Learned sparse: SPLADE

A newer approach: use a transformer to learn sparse vector representations. Each token contributes to a high-dimensional sparse vector, with the model learning which tokens matter and expanding queries with learned synonyms.

Benefits:

Bridges the gap between BM25 and dense embeddings
Automatic term expansion (query "laptop" matches documents about "notebook computers")
Uses existing sparse-vector infrastructure

Drawbacks:

More compute at index time and query time
Less mature than BM25 in production tooling

Models: SPLADE v3, naver/splade. Supported by recent Qdrant, Vespa, OpenSearch.

Hybrid is the answer

In almost every production RAG system, BM25 + dense hybrid outperforms either alone. The sparse and dense vectors capture complementary signal. The fusion (RRF or similar) combines them cleanly. See hybrid retrieval.

The old-school lesson

BM25 is fast, predictable, debuggable, and free of GPU dependencies. When your vector search is broken, your metrics are confusing, or your embeddings go stale, BM25 still works. Keep it in the stack as a fallback if nothing else.

What to do with this

Keep BM25 in your stack even if you have dense. The blind spots are asymmetric.
For specialized domains, tune the tokenizer; defaults will miss technical terms.
Try SPLADE if you want learned sparse without running two retrievers.