Home›Expertise›RAGS to Riches›Vector similarity search

Vector similarity search

📖 4 min readUpdated 2026-04-18

Vector similarity search is the baseline retrieval operation in RAG: embed the query, find the nearest neighbors in your index, return the top-k. It's deceptively simple. The nuances are in top-k choice, similarity metrics, and what you do with the results.

The basic flow

User submits query
Embed query with the same model used for the corpus
Search vector index for top-k nearest neighbors
Return chunks, with scores, for downstream use (rerank, generation)

Top-k choice

How many chunks to retrieve? Tradeoffs:

Too few: if the right chunk is at rank 8 and you retrieve top-5, you miss it.
Too many: more noise, more cost, diluted attention in the generator.

Typical production values:

Without reranker: top-5 to top-10
With reranker: top-50 to top-100 retrieved, rerank to top-5 to top-10

The reranker lets you cast a wider net at retrieval, then filter precisely. Without one, you're stuck relying on the embedding model's own ranking, which is noisier.

Similarity metrics

Cosine similarity

Measures angle between vectors, ignores magnitude. Most common. Ranges from -1 (opposite) to 1 (identical).

Dot product

When vectors are normalized (length = 1), dot product equals cosine. Many modern embedding models output pre-normalized vectors, so dot product is equivalent and slightly faster.

Euclidean distance

Straight-line distance. Rarely used for text because it's sensitive to magnitude, which doesn't carry semantic meaning in text embeddings.

Match your similarity metric to what the embedding model was trained with. Mismatches degrade retrieval silently.

Score thresholds

Beyond top-k, you can set a minimum score threshold. Chunks below the threshold aren't returned even if they're in top-k.

Useful for:

Distinguishing "we found nothing relevant" from "we found something weak"
Avoiding hallucination-inducing irrelevant context
Reducing noise in low-signal queries

Setting the threshold: run your eval set, look at score distributions for known-good vs known-bad pairs. Threshold is usually somewhere around 0.6-0.75 cosine for decent matches.

The "lost in the middle" problem

LLMs given long contexts pay more attention to the beginning and end than the middle. If you pass 20 retrieved chunks, the chunks in positions 8-13 often get under-weighted during generation.

Mitigations:

Retrieve fewer chunks and use reranking
Order chunks strategically: most relevant first and last
Summarize long contexts before generation

Diversity and MMR

Top-k pure similarity can return 5 chunks that are near-duplicates of each other. The user wanted 5 different perspectives; they got 1 perspective repeated.

Maximum Marginal Relevance (MMR): trade some similarity for diversity. After each chunk is selected, penalize chunks similar to ones already selected.

Most vector DBs support MMR or similar diversification strategies. Worth using when you retrieve from corpora with near-duplicate content.

Multi-vector queries

Some systems retrieve with multiple query variations (the original query plus paraphrases, query decompositions, HyDE outputs) and union the results. See multi-query + fusion.

The common first mistake

Teams ship RAG v1 with top-5 vector search, no reranking, and wonder why quality is mediocre. The fix, top-50 retrieval with a reranker, is usually a 10-20% quality improvement for minimal additional latency.

Vector search is the foundation. Everything after it is where the quality wins come from.

What to do with this

Always match the similarity metric your embedding model expects.
Retrieve wide (top-50+), rerank narrow (top-5-10). Not top-5 alone.
Set a score threshold to distinguish "nothing found" from "something weak."