Home›Expertise›RAGS to Riches›HNSW, IVF, PQ

HNSW, IVF, PQ

📖 6 min readUpdated 2026-04-18

Exact nearest-neighbor search over millions of vectors is too slow for real-time use. Vector databases use approximate algorithms (ANN) that trade a few percent of recall for 10-100x speedups. The three dominant algorithms are HNSW, IVF, and PQ. You don't need to implement them, but you need to understand the tradeoffs.

HNSW (Hierarchical Navigable Small World)

The dominant algorithm for most production vector databases.

How it works

HNSW builds a multi-layer graph where each vector is a node connected to its neighbors. The top layer has few nodes with long-range connections; lower layers have more nodes with shorter connections. Search starts at the top, greedily follows edges toward the query, and descends layer by layer.

Strengths

Very high recall at low latency
Supports streaming inserts well
Well-tuned across many vector DBs

Weaknesses

Memory-heavy: the graph structure significantly exceeds raw vector storage
Index construction is slow and CPU-intensive
Deletion is awkward, usually lazy (mark as deleted, filter out at query)

Key tuning parameters

M: max connections per node. Higher = better recall, more memory. Typical 16-32.
ef_construction: how many candidates to explore during index build. Higher = slower build, better index. Typical 100-200.
ef_search: how many candidates to explore during query. Higher = better recall, slower queries. Typical 50-200.

IVF (Inverted File)

Older algorithm, still useful especially at enormous scale or memory-constrained.

How it works

Cluster all vectors into K centroids (via k-means). For each query, find the nearest centroid(s), then exhaustively search only the vectors assigned to those clusters.

Strengths

Low memory (just store cluster assignments)
Fast queries when you probe few clusters
Easier to distribute across shards

Weaknesses

Lower recall than HNSW at equivalent speed
Quality degrades if cluster choice is wrong
Rebuilding required when distribution shifts

Key parameters

nlist: number of clusters. Rule of thumb: sqrt(N) for N vectors.
nprobe: how many clusters to search per query. Higher = better recall, slower.

PQ (Product Quantization)

A compression technique, usually combined with IVF or HNSW.

How it works

Split each vector into M sub-vectors. Train a separate codebook per sub-vector position (typically 256 codes per position). Store each sub-vector as a single byte index into its codebook. A 768-dim float vector (3072 bytes) becomes ~48 bytes. Storage reduction: ~60x.

Strengths

Dramatic storage reduction
Fast distance computation via table lookups
Works well combined with IVF or HNSW for hybrid indexes

Weaknesses

Lossy, introduces quantization error
Codebook training is one-time cost, doesn't adapt to new data well
Recall gap from exact search

The common combinations

HNSW alone

Default for most production systems up to ~10M vectors. Highest quality.

IVF-PQ

Cluster + compress. Scales to billions. Lower recall than HNSW but much cheaper storage. Used at Meta, Google scale.

HNSW + PQ

Graph structure for navigation, PQ-compressed vectors for distance calculation. Good balance. Qdrant's default at scale.

Scalar / binary quantization

Simpler than PQ. Each float reduced to int8 or a single bit. Combined with HNSW in many modern DBs. Qdrant and Milvus both support binary quantization natively.

The recall-speed-memory triangle

You get to pick two of three: recall, speed, memory. Tuning parameters move you along the frontier:

High recall + fast: HNSW with large ef, lots of memory
Fast + low memory: IVF-PQ with aggressive compression, sacrifice recall
High recall + low memory: not really achievable at scale; you pay with query time

The pragmatic approach

For systems below ~10M vectors, accept the defaults of your vector DB and move on. Index strategy rarely bottlenecks RAG quality at this scale.

Above 10M vectors, budget time to tune. Run your eval set at different parameter settings. Look at recall@10 vs query latency. Find the parameter sweet spot for your workload.

Above 100M vectors, index strategy is a serious engineering effort. Expect to iterate on quantization and sharding over months.

When to rebuild

Indexes don't update forever gracefully:

HNSW: degrades slowly with high churn. Rebuild quarterly or annually at scale.
IVF: degrades if distribution shifts. Rebuild when recall drops or when you re-embed.
PQ: codebook stale if new data differs from training. Retrain codebook as data evolves.

What to do with this

Below 10M vectors, accept defaults; index tuning isn't your bottleneck.
Above 10M, tune ef_search against your eval set. Plot recall vs latency.
Above 100M, budget engineering months for quantization + sharding.

HNSW, IVF, PQ

HNSW (Hierarchical Navigable Small World)

How it works

Strengths

Weaknesses

Key tuning parameters

IVF (Inverted File)

How it works

Strengths

Weaknesses

Key parameters

PQ (Product Quantization)

How it works

Strengths

Weaknesses

The common combinations

HNSW alone

IVF-PQ

HNSW + PQ

Scalar / binary quantization

The recall-speed-memory triangle

The pragmatic approach

When to rebuild

What to do with this

Further reading