RAG has a cost structure that surprises most teams. At 100 queries per day it's invisible. At 100,000 queries per day, it's a meaningful monthly bill. Planning for this from the start saves expensive rework when the system actually scales.
Usually the biggest line item. Priced per input token and output token, with output tokens typically 3-5x more expensive.
Rough costs (per million tokens, 2026 pricing):
Priced per input token. Compounds with reindexes.
Application servers, ingestion workers, monitoring, logging. Usually 5-15% of total cost.
A typical production query, no optimization:
Query embedding: 512 tokens × $0.02/M = $0.00001 Vector search: (included in DB cost) Reranker (Cohere): 50 candidates = $0.00005 Generation (GPT-4o): input: 2000 tokens × $5/M = $0.01 output: 300 tokens × $15/M = $0.0045 Total: ~$0.015 per query
At 100K queries/day: $1,500/day = ~$45,000/month.
Use cheap models where possible. Classify queries, 70% can use gpt-4o-mini or haiku. Saves 80-90% on those queries.
Every 100 input tokens saved is money. Trim retrieved chunks before sending. Reduce top-k. Use shorter prompts.
Every cache hit is money. See caching.
Cache the system prompt on Claude or Gemini. 30-90% savings on input tokens for common prompts.
For async workloads (document enrichment, bulk processing), batch API calls. Most providers give 50% discount on batch APIs.
At scale (millions of queries/month), self-hosting becomes cheaper. The crossover is often 10-50M tokens/day.
Use MRL to reduce vector dimensions. 3x savings on vector DB storage.
Track cost per:
This surfaces cost anomalies (one tenant is 10x expensive; one query type is dominating) and informs pricing decisions.
Protect against runaway costs:
Every optimization has a quality cost. Track both:
Ship optimizations only when quality impact is acceptable. Measure against your eval set.
Embedding 10M new documents: ~$1000-2000 on commercial APIs. Self-hosted: compute only but takes days.
Changing embedding models means re-embedding everything. Budget accordingly.
Running evals, A/B tests, debugging production issues, each involves LLM calls. Can be 10-20% of operational spend.
Your engineers testing changes. Often more expensive than production if not monitored.
Essential metrics:
Premature cost optimization kills RAG projects:
Before adding a feature that increases cost (agentic RAG, GraphRAG, multi-query), ask: what's the quality gain, and what's the cost delta? If quality improves 5% and cost doubles, that's probably the wrong tradeoff. If quality improves 30% and cost increases 50%, it may be worth it.
Measurement first. Cost decisions second.