Vector similarity search is the baseline retrieval operation in RAG: embed the query, find the nearest neighbors in your index, return the top-k. It's deceptively simple. The nuances are in top-k choice, similarity metrics, and what you do with the results.
How many chunks to retrieve? Tradeoffs:
Typical production values:
The reranker lets you cast a wider net at retrieval, then filter precisely. Without one, you're stuck relying on the embedding model's own ranking, which is noisier.
Measures angle between vectors, ignores magnitude. Most common. Ranges from -1 (opposite) to 1 (identical).
When vectors are normalized (length = 1), dot product equals cosine. Many modern embedding models output pre-normalized vectors, so dot product is equivalent and slightly faster.
Straight-line distance. Rarely used for text because it's sensitive to magnitude, which doesn't carry semantic meaning in text embeddings.
Match your similarity metric to what the embedding model was trained with. Mismatches degrade retrieval silently.
Beyond top-k, you can set a minimum score threshold. Chunks below the threshold aren't returned even if they're in top-k.
Useful for:
Setting the threshold: run your eval set, look at score distributions for known-good vs known-bad pairs. Threshold is usually somewhere around 0.6-0.75 cosine for decent matches.
LLMs given long contexts pay more attention to the beginning and end than the middle. If you pass 20 retrieved chunks, the chunks in positions 8-13 often get under-weighted during generation.
Mitigations:
Top-k pure similarity can return 5 chunks that are near-duplicates of each other. The user wanted 5 different perspectives; they got 1 perspective repeated.
Maximum Marginal Relevance (MMR): trade some similarity for diversity. After each chunk is selected, penalize chunks similar to ones already selected.
Most vector DBs support MMR or similar diversification strategies. Worth using when you retrieve from corpora with near-duplicate content.
Some systems retrieve with multiple query variations (the original query plus paraphrases, query decompositions, HyDE outputs) and union the results. See multi-query + fusion.
Teams ship RAG v1 with top-5 vector search, no reranking, and wonder why quality is mediocre. The fix, top-50 retrieval with a reranker, is usually a 10-20% quality improvement for minimal additional latency.
Vector search is the foundation. Everything after it is where the quality wins come from.
Next: BM25 and sparse retrieval.