Home›Expertise›RAGS to Riches›Dimensions, cost, and MRL

Dimensions, cost, and MRL

📖 5 min readUpdated 2026-04-18

Every dimension in your embedding vector costs storage, search time, and network bandwidth. For small corpora it doesn't matter. For 100M+ vectors, the difference between 768-dim and 3072-dim vectors is the difference between an affordable vector store and a cost-prohibitive one. Matryoshka embeddings are a recent innovation that gives you both options from one model.

The cost math

Every vector is stored as 32-bit floats (or 16-bit, or quantized). Storage per vector:

768 dim × 4 bytes = 3 KB
1536 dim × 4 bytes = 6 KB
3072 dim × 4 bytes = 12 KB

For 100 million vectors:

768 dim = 300 GB
1536 dim = 600 GB
3072 dim = 1.2 TB

Vector databases charge per stored dimension. Search latency also scales roughly with dimensions. The difference between picking a 768-dim and 3072-dim model is material.

The quality-vs-dim tradeoff

Higher dimensions generally capture more information, which means better retrieval quality. But the relationship is sub-linear, going from 768 to 1536 often gives only a few percent quality improvement. Going from 1536 to 3072 gives even less.

For cost-conscious deployments, the sweet spot is often 1024-1536 dimensions. Higher is usually not worth the cost.

Matryoshka Representation Learning (MRL)

MRL trains models so that the first N dimensions of the vector are themselves a usable embedding. You can truncate a 3072-dim vector to 768 dim and still have a functional (if slightly lower quality) embedding.

This is huge for RAG:

Store full 3072-dim vectors only for the top-tier retrieval set
Use truncated 768-dim for a fast first-pass filter
Use even smaller truncations for extreme-scale problems
All from the same model, no retraining

Models that support MRL

OpenAI text-embedding-3-small (default 1536, truncatable to 256, 512, 1024)
OpenAI text-embedding-3-large (default 3072, truncatable)
Cohere embed-v3 (supports dim reduction via its dimensions parameter)
nomic-embed-v1.5 (768 max, but MRL-trained for smaller sizes)

Quantization

Beyond dimension reduction, you can reduce precision:

float32 → float16: 50% storage savings, minimal quality loss
float32 → int8: 75% savings, 1-5% quality loss depending on model
Binary quantization: 32x savings, 5-15% quality loss, works well with reranking to recover quality

Some vector databases support native quantization (Qdrant's scalar and binary quantization, for example). At scale, the storage and speed wins are dramatic.

The combined cost strategy

For a large-scale RAG system:

Start with a high-quality high-dim model (e.g., text-embedding-3-large at 3072 dim)
Use MRL to store truncated 1024-dim vectors in the index
Use binary quantization for coarse first-pass retrieval
Rerank top-100 candidates with full-precision, full-dimension vectors (or with a cross-encoder)

This gives most of the quality of the full model with 1/30th the storage and significantly faster search.

When this matters

Below 1M vectors, ignore this entire page. Just pick a reasonable model and move on.

Between 1M and 100M, dimensions start to matter for cost and latency.

Above 100M, dimensions and quantization strategy dominate your infrastructure bill.

The operational cost

Re-embedding 100M vectors is not free. If you pick a 3072-dim model and later want to change, budget days of compute time. Build your pipeline so you can swap models without full reindexes where possible, store raw chunks separately from their vectors, so you can re-embed without re-parsing.

What to do with this

Below 1M vectors: skip this page; pick a decent default.
At 10M+ vectors: test MRL truncation + int8 quantization.
At 100M+: combine MRL + binary quantization + reranker recovery.