Home›Expertise›RAGS to Riches›Cost optimization

Cost optimization

📖 5 min readUpdated 2026-04-18

A naive production RAG system scales cost linearly with corpus size and query volume. At 100M vectors and millions of monthly queries, that cost becomes real. Here are the cost levers, ranked by impact.

Storage costs

1. Reduce dimensions

Use Matryoshka truncation to store 1024-dim vectors from a 3072-dim model. 3x storage savings, typically <5% quality loss. See dimensions and cost.

2. Quantize

int8 quantization: 4x savings, minimal quality loss. Binary quantization: 32x savings, larger quality loss but recoverable with reranking.

3. Prune redundant chunks

Many corpora have duplicate or near-duplicate content. Deduplicate at ingestion time. For news, blog archives, or documentation with multiple versions, this often removes 20-40% of the index.

4. Archive cold data

Separate recent/hot data from cold historical data. Put cold data on cheaper tiered storage or take it out of the active index. Serverless vector DBs (Turbopuffer, Pinecone serverless) price partly on access frequency.

Query costs

5. Cache query embeddings

Common queries repeat. Cache the embedding of each query for 24 hours or until you re-embed the corpus. See caching.

6. Cache full retrieval results

For deterministic queries against an unchanged index, cache top-k results. Especially valuable for popular questions in customer-facing RAG.

7. Skip retrieval when possible

Use a lightweight classifier or the LLM itself to decide whether the query needs RAG. Trivial queries ("hi", "thanks", "can you help") don't require retrieval, serve them directly.

8. Reduce top-k when possible

Passing top-5 instead of top-20 cuts reranker and generation costs. Only increase top-k when reranking proves top-5 isn't enough.

Embedding costs

9. Avoid needless reindexing

Don't re-embed unchanged chunks. Hash each chunk's content; only re-embed when hash changes.

10. Use cheaper embeddings for lower-value segments

Old archives, long-tail content, or low-volume segments can use cheaper embedding models. Hot content can use the premium model.

11. Self-host for bulk ingestion

At high volume, self-hosted embedding models beat API pricing. Even if you use an API for queries, batch ingestion can run on self-hosted infrastructure.

Reranker costs

12. Rerank fewer candidates

Reranking top-50 is cheaper than top-200. If quality is adequate at smaller candidate sets, use them.

13. Use lighter rerankers

cross-encoder/ms-marco-MiniLM vs cross-encoder/ms-marco-electra-base: meaningful cost difference, smaller quality difference.

14. Skip reranking for high-confidence results

If dense retrieval returns a very high-scoring result (cosine > 0.85, say), you may not need reranking. Test the threshold against your eval set.

Generation costs

15. Route to smaller models

Use GPT-4o-mini, Claude Haiku, Gemini Flash for simple queries. Reserve large models for complex reasoning.

16. Trim context

If retrieved chunks contain boilerplate, trim it before sending to the generator. Every 100 tokens saved is real money at scale.

17. Summarize long contexts

When retrieved context is very long, a cheap summarization pass can produce a more focused input for the final answer generation.

Infrastructure costs

18. Right-size the vector DB

Many teams over-provision. Measure actual query QPS and scale down. Serverless options scale to zero.

19. Consolidate indexes

Multiple small indexes have overhead. One index with metadata filters can be cheaper than ten tiny indexes.

20. Use managed when small, self-hosted when big

The crossover point for Pinecone vs Qdrant self-hosted is usually around $500-1500/month in usage. Do the math.

The budget hierarchy

For a typical production RAG system at moderate scale, the cost breakdown looks like:

40-60%: LLM generation (largest single cost for most systems)
15-25%: embedding generation (especially at ingestion and on reindexes)
10-20%: vector DB storage and queries
5-15%: reranker inference
5-10%: infrastructure (ingestion workers, monitoring, etc.)

Optimize from biggest to smallest. No point shaving 10% off vector DB costs if generation is 10x bigger.

The quality-cost frontier

Every cost optimization has a quality cost. Maintain an eval set. Measure quality before and after each optimization. Ship only changes where the quality loss is acceptable.

Without an eval set, cost optimization is gambling, you're reducing bills but also reducing quality invisibly.

What to do with this

Optimize generation before anything else; it's 40-60% of spend.
Always measure quality against an eval set before and after a cost cut.
Skip retrieval for trivial queries; it's free savings.