HNSW, IVF, PQ

Exact nearest-neighbor search over millions of vectors is too slow for real-time use. Vector databases use approximate algorithms (ANN) that trade a few percent of recall for 10-100x speedups. The three dominant algorithms are HNSW, IVF, and PQ. You don't need to implement them, but you need to understand the tradeoffs.

HNSW (Hierarchical Navigable Small World)

The dominant algorithm for most production vector databases.

How it works

HNSW builds a multi-layer graph where each vector is a node connected to its neighbors. The top layer has few nodes with long-range connections; lower layers have more nodes with shorter connections. Search starts at the top, greedily follows edges toward the query, and descends layer by layer.

Strengths

Weaknesses

Key tuning parameters

IVF (Inverted File)

Older algorithm, still useful especially at enormous scale or memory-constrained.

How it works

Cluster all vectors into K centroids (via k-means). For each query, find the nearest centroid(s), then exhaustively search only the vectors assigned to those clusters.

Strengths

Weaknesses

Key parameters

PQ (Product Quantization)

A compression technique, usually combined with IVF or HNSW.

How it works

Split each vector into M sub-vectors. Train a separate codebook per sub-vector position (typically 256 codes per position). Store each sub-vector as a single byte index into its codebook. A 768-dim float vector (3072 bytes) becomes ~48 bytes. Storage reduction: ~60x.

Strengths

Weaknesses

The common combinations

HNSW alone

Default for most production systems up to ~10M vectors. Highest quality.

IVF-PQ

Cluster + compress. Scales to billions. Lower recall than HNSW but much cheaper storage. Used at Meta, Google scale.

HNSW + PQ

Graph structure for navigation, PQ-compressed vectors for distance calculation. Good balance. Qdrant's default at scale.

Scalar / binary quantization

Simpler than PQ. Each float reduced to int8 or a single bit. Combined with HNSW in many modern DBs. Qdrant and Milvus both support binary quantization natively.

The recall-speed-memory triangle

You get to pick two of three: recall, speed, memory. Tuning parameters move you along the frontier:

The pragmatic approach

For systems below ~10M vectors, accept the defaults of your vector DB and move on. Index strategy rarely bottlenecks RAG quality at this scale.

Above 10M vectors, budget time to tune. Run your eval set at different parameter settings. Look at recall@10 vs query latency. Find the parameter sweet spot for your workload.

Above 100M vectors, index strategy is a serious engineering effort. Expect to iterate on quantization and sharding over months.

When to rebuild

Indexes don't update forever gracefully:

Next: Hybrid search.