Exact nearest-neighbor search over millions of vectors is too slow for real-time use. Vector databases use approximate algorithms (ANN) that trade a few percent of recall for 10-100x speedups. The three dominant algorithms are HNSW, IVF, and PQ. You don't need to implement them, but you need to understand the tradeoffs.
The dominant algorithm for most production vector databases.
HNSW builds a multi-layer graph where each vector is a node connected to its neighbors. The top layer has few nodes with long-range connections; lower layers have more nodes with shorter connections. Search starts at the top, greedily follows edges toward the query, and descends layer by layer.
M: max connections per node. Higher = better recall, more memory. Typical 16-32.ef_construction: how many candidates to explore during index build. Higher = slower build, better index. Typical 100-200.ef_search: how many candidates to explore during query. Higher = better recall, slower queries. Typical 50-200.Older algorithm, still useful especially at enormous scale or memory-constrained.
Cluster all vectors into K centroids (via k-means). For each query, find the nearest centroid(s), then exhaustively search only the vectors assigned to those clusters.
nlist: number of clusters. Rule of thumb: sqrt(N) for N vectors.nprobe: how many clusters to search per query. Higher = better recall, slower.A compression technique, usually combined with IVF or HNSW.
Split each vector into M sub-vectors. Train a separate codebook per sub-vector position (typically 256 codes per position). Store each sub-vector as a single byte index into its codebook. A 768-dim float vector (3072 bytes) becomes ~48 bytes. Storage reduction: ~60x.
Default for most production systems up to ~10M vectors. Highest quality.
Cluster + compress. Scales to billions. Lower recall than HNSW but much cheaper storage. Used at Meta, Google scale.
Graph structure for navigation, PQ-compressed vectors for distance calculation. Good balance. Qdrant's default at scale.
Simpler than PQ. Each float reduced to int8 or a single bit. Combined with HNSW in many modern DBs. Qdrant and Milvus both support binary quantization natively.
You get to pick two of three: recall, speed, memory. Tuning parameters move you along the frontier:
For systems below ~10M vectors, accept the defaults of your vector DB and move on. Index strategy rarely bottlenecks RAG quality at this scale.
Above 10M vectors, budget time to tune. Run your eval set at different parameter settings. Look at recall@10 vs query latency. Find the parameter sweet spot for your workload.
Above 100M vectors, index strategy is a serious engineering effort. Expect to iterate on quantization and sharding over months.
Indexes don't update forever gracefully:
Next: Hybrid search.