Home›Expertise›RAGS to Riches›Cost management

Cost management

📖 5 min readUpdated 2026-04-18

RAG has a cost structure that surprises most teams. At 100 queries per day it's invisible. At 100,000 queries per day, it's a meaningful monthly bill. Planning for this from the start saves expensive rework when the system actually scales.

The cost components

1. LLM generation

Usually the biggest line item. Priced per input token and output token, with output tokens typically 3-5x more expensive.

Rough costs (per million tokens, 2026 pricing):

GPT-4o: ~$5 in / $15 out
GPT-4o-mini: ~$0.15 in / $0.60 out
Claude Sonnet: ~$3 in / $15 out
Claude Haiku: ~$0.25 in / $1.25 out
Gemini Pro: ~$3.50 in / $10.50 out
Open-source self-hosted: compute only

2. Embedding generation

Priced per input token. Compounds with reindexes.

OpenAI text-embedding-3-small: $0.02/M tokens
OpenAI text-embedding-3-large: $0.13/M tokens
Cohere embed-v3: ~$0.10/M tokens
Open-source self-hosted: compute only

3. Vector database

Managed (Pinecone): $0.05-0.50 per million queries + storage per vector
Self-hosted: cloud infrastructure costs
Scales with vector count and query volume

4. Reranker

Cohere Rerank: ~$1 per 1000 searches (significant at volume)
Self-hosted: GPU inference cost

5. Infrastructure

Application servers, ingestion workers, monitoring, logging. Usually 5-15% of total cost.

Per-query cost example

A typical production query, no optimization:

Query embedding:      512 tokens × $0.02/M    = $0.00001
Vector search:        (included in DB cost)
Reranker (Cohere):    50 candidates           = $0.00005
Generation (GPT-4o):
  input: 2000 tokens × $5/M                   = $0.01
  output: 300 tokens × $15/M                  = $0.0045
Total:                                          ~$0.015 per query

At 100K queries/day: $1,500/day = ~$45,000/month.

The optimization frontier

Model routing

Use cheap models where possible. Classify queries, 70% can use gpt-4o-mini or haiku. Saves 80-90% on those queries.

Context trimming

Every 100 input tokens saved is money. Trim retrieved chunks before sending. Reduce top-k. Use shorter prompts.

Caching

Every cache hit is money. See caching.

Prompt caching (provider feature)

Cache the system prompt on Claude or Gemini. 30-90% savings on input tokens for common prompts.

Batching

For async workloads (document enrichment, bulk processing), batch API calls. Most providers give 50% discount on batch APIs.

Self-hosted inference

At scale (millions of queries/month), self-hosting becomes cheaper. The crossover is often 10-50M tokens/day.

Smaller embeddings

Use MRL to reduce vector dimensions. 3x savings on vector DB storage.

Cost allocation

Track cost per:

Tenant (multi-tenant systems)
User
Feature / use case
Query type

This surfaces cost anomalies (one tenant is 10x expensive; one query type is dominating) and informs pricing decisions.

Budget guardrails

Protect against runaway costs:

Per-user rate limits
Per-tenant cost caps
Global alerts on daily spend
Max tokens per query
Max retries on agentic flows
Timeouts at every stage

Cost-quality tradeoffs

Every optimization has a quality cost. Track both:

Cost per query (average, p95)
Quality metrics (retrieval, generation, user feedback)

Ship optimizations only when quality impact is acceptable. Measure against your eval set.

The hidden costs

Ingestion

Embedding 10M new documents: ~$1000-2000 on commercial APIs. Self-hosted: compute only but takes days.

Reindexing

Changing embedding models means re-embedding everything. Budget accordingly.

Experimentation

Running evals, A/B tests, debugging production issues, each involves LLM calls. Can be 10-20% of operational spend.

Development

Your engineers testing changes. Often more expensive than production if not monitored.

Cost monitoring dashboard

Essential metrics:

Daily spend by provider
Daily spend by service (embedding, generation, rerank, vector DB)
Cost per query (mean, p95)
Token volume (input, output)
Cache hit rate savings
Cost trend (day over day, week over week)

When to optimize vs when to ship

Premature cost optimization kills RAG projects:

At < $500/month total spend: don't optimize. Ship.
At $500-5000/month: monitor, plan optimizations
At $5000-50000/month: implement caching, model routing, prompt caching
Above $50000/month: serious optimization effort worth dedicated engineering

The ROI check

Before adding a feature that increases cost (agentic RAG, GraphRAG, multi-query), ask: what's the quality gain, and what's the cost delta? If quality improves 5% and cost doubles, that's probably the wrong tradeoff. If quality improves 30% and cost increases 50%, it may be worth it.

Measurement first. Cost decisions second.

What to do with this

At <$500/mo, ship. Optimization first is premature. Track trend.
Biggest lever: model routing. Most queries don't need your premium model.
Always A/B optimization changes against your eval set.