A naive production RAG system scales cost linearly with corpus size and query volume. At 100M vectors and millions of monthly queries, that cost becomes real. Here are the cost levers, ranked by impact.
Use Matryoshka truncation to store 1024-dim vectors from a 3072-dim model. 3x storage savings, typically <5% quality loss. See dimensions and cost.
int8 quantization: 4x savings, minimal quality loss. Binary quantization: 32x savings, larger quality loss but recoverable with reranking.
Many corpora have duplicate or near-duplicate content. Deduplicate at ingestion time. For news, blog archives, or documentation with multiple versions, this often removes 20-40% of the index.
Separate recent/hot data from cold historical data. Put cold data on cheaper tiered storage or take it out of the active index. Serverless vector DBs (Turbopuffer, Pinecone serverless) price partly on access frequency.
Common queries repeat. Cache the embedding of each query for 24 hours or until you re-embed the corpus. See caching.
For deterministic queries against an unchanged index, cache top-k results. Especially valuable for popular questions in customer-facing RAG.
Use a lightweight classifier or the LLM itself to decide whether the query needs RAG. Trivial queries ("hi", "thanks", "can you help") don't require retrieval, serve them directly.
Passing top-5 instead of top-20 cuts reranker and generation costs. Only increase top-k when reranking proves top-5 isn't enough.
Don't re-embed unchanged chunks. Hash each chunk's content; only re-embed when hash changes.
Old archives, long-tail content, or low-volume segments can use cheaper embedding models. Hot content can use the premium model.
At high volume, self-hosted embedding models beat API pricing. Even if you use an API for queries, batch ingestion can run on self-hosted infrastructure.
Reranking top-50 is cheaper than top-200. If quality is adequate at smaller candidate sets, use them.
cross-encoder/ms-marco-MiniLM vs cross-encoder/ms-marco-electra-base: meaningful cost difference, smaller quality difference.
If dense retrieval returns a very high-scoring result (cosine > 0.85, say), you may not need reranking. Test the threshold against your eval set.
Use GPT-4o-mini, Claude Haiku, Gemini Flash for simple queries. Reserve large models for complex reasoning.
If retrieved chunks contain boilerplate, trim it before sending to the generator. Every 100 tokens saved is real money at scale.
When retrieved context is very long, a cheap summarization pass can produce a more focused input for the final answer generation.
Many teams over-provision. Measure actual query QPS and scale down. Serverless options scale to zero.
Multiple small indexes have overhead. One index with metadata filters can be cheaper than ten tiny indexes.
The crossover point for Pinecone vs Qdrant self-hosted is usually around $500-1500/month in usage. Do the math.
For a typical production RAG system at moderate scale, the cost breakdown looks like:
Optimize from biggest to smallest. No point shaving 10% off vector DB costs if generation is 10x bigger.
Every cost optimization has a quality cost. Maintain an eval set. Measure quality before and after each optimization. Ship only changes where the quality loss is acceptable.
Without an eval set, cost optimization is gambling, you're reducing bills but also reducing quality invisibly.
Next: Choosing a vector DB.