Dimensions, cost, and MRL

Every dimension in your embedding vector costs storage, search time, and network bandwidth. For small corpora it doesn't matter. For 100M+ vectors, the difference between 768-dim and 3072-dim vectors is the difference between an affordable vector store and a cost-prohibitive one. Matryoshka embeddings are a recent innovation that gives you both options from one model.

The cost math

Every vector is stored as 32-bit floats (or 16-bit, or quantized). Storage per vector:

For 100 million vectors:

Vector databases charge per stored dimension. Search latency also scales roughly with dimensions. The difference between picking a 768-dim and 3072-dim model is material.

The quality-vs-dim tradeoff

Higher dimensions generally capture more information, which means better retrieval quality. But the relationship is sub-linear, going from 768 to 1536 often gives only a few percent quality improvement. Going from 1536 to 3072 gives even less.

For cost-conscious deployments, the sweet spot is often 1024-1536 dimensions. Higher is usually not worth the cost.

Matryoshka Representation Learning (MRL)

MRL trains models so that the first N dimensions of the vector are themselves a usable embedding. You can truncate a 3072-dim vector to 768 dim and still have a functional (if slightly lower quality) embedding.

This is huge for RAG:

Models that support MRL

Quantization

Beyond dimension reduction, you can reduce precision:

Some vector databases support native quantization (Qdrant's scalar and binary quantization, for example). At scale, the storage and speed wins are dramatic.

The combined cost strategy

For a large-scale RAG system:

  1. Start with a high-quality high-dim model (e.g., text-embedding-3-large at 3072 dim)
  2. Use MRL to store truncated 1024-dim vectors in the index
  3. Use binary quantization for coarse first-pass retrieval
  4. Rerank top-100 candidates with full-precision, full-dimension vectors (or with a cross-encoder)

This gives most of the quality of the full model with 1/30th the storage and significantly faster search.

When this matters

Below 1M vectors, ignore this entire page. Just pick a reasonable model and move on.

Between 1M and 100M, dimensions start to matter for cost and latency.

Above 100M, dimensions and quantization strategy dominate your infrastructure bill.

The operational cost

Re-embedding 100M vectors is not free. If you pick a 3072-dim model and later want to change, budget days of compute time. Build your pipeline so you can swap models without full reindexes where possible, store raw chunks separately from their vectors, so you can re-embed without re-parsing.

Next: Fine-tuning embeddings.