Vector similarity search

Vector similarity search is the baseline retrieval operation in RAG: embed the query, find the nearest neighbors in your index, return the top-k. It's deceptively simple. The nuances are in top-k choice, similarity metrics, and what you do with the results.

The basic flow

  1. User submits query
  2. Embed query with the same model used for the corpus
  3. Search vector index for top-k nearest neighbors
  4. Return chunks, with scores, for downstream use (rerank, generation)

Top-k choice

How many chunks to retrieve? Tradeoffs:

Typical production values:

The reranker lets you cast a wider net at retrieval, then filter precisely. Without one, you're stuck relying on the embedding model's own ranking, which is noisier.

Similarity metrics

Cosine similarity

Measures angle between vectors, ignores magnitude. Most common. Ranges from -1 (opposite) to 1 (identical).

Dot product

When vectors are normalized (length = 1), dot product equals cosine. Many modern embedding models output pre-normalized vectors, so dot product is equivalent and slightly faster.

Euclidean distance

Straight-line distance. Rarely used for text because it's sensitive to magnitude, which doesn't carry semantic meaning in text embeddings.

Match your similarity metric to what the embedding model was trained with. Mismatches degrade retrieval silently.

Score thresholds

Beyond top-k, you can set a minimum score threshold. Chunks below the threshold aren't returned even if they're in top-k.

Useful for:

Setting the threshold: run your eval set, look at score distributions for known-good vs known-bad pairs. Threshold is usually somewhere around 0.6-0.75 cosine for decent matches.

The "lost in the middle" problem

LLMs given long contexts pay more attention to the beginning and end than the middle. If you pass 20 retrieved chunks, the chunks in positions 8-13 often get under-weighted during generation.

Mitigations:

Diversity and MMR

Top-k pure similarity can return 5 chunks that are near-duplicates of each other. The user wanted 5 different perspectives; they got 1 perspective repeated.

Maximum Marginal Relevance (MMR): trade some similarity for diversity. After each chunk is selected, penalize chunks similar to ones already selected.

Most vector DBs support MMR or similar diversification strategies. Worth using when you retrieve from corpora with near-duplicate content.

Multi-vector queries

Some systems retrieve with multiple query variations (the original query plus paraphrases, query decompositions, HyDE outputs) and union the results. See multi-query + fusion.

The common first mistake

Teams ship RAG v1 with top-5 vector search, no reranking, and wonder why quality is mediocre. The fix, top-50 retrieval with a reranker, is usually a 10-20% quality improvement for minimal additional latency.

Vector search is the foundation. Everything after it is where the quality wins come from.

Next: BM25 and sparse retrieval.