Cost management

RAG has a cost structure that surprises most teams. At 100 queries per day it's invisible. At 100,000 queries per day, it's a meaningful monthly bill. Planning for this from the start saves expensive rework when the system actually scales.

The cost components

1. LLM generation

Usually the biggest line item. Priced per input token and output token, with output tokens typically 3-5x more expensive.

Rough costs (per million tokens, 2026 pricing):

2. Embedding generation

Priced per input token. Compounds with reindexes.

3. Vector database

4. Reranker

5. Infrastructure

Application servers, ingestion workers, monitoring, logging. Usually 5-15% of total cost.

Per-query cost example

A typical production query, no optimization:

Query embedding:      512 tokens × $0.02/M    = $0.00001
Vector search:        (included in DB cost)
Reranker (Cohere):    50 candidates           = $0.00005
Generation (GPT-4o):
  input: 2000 tokens × $5/M                   = $0.01
  output: 300 tokens × $15/M                  = $0.0045
Total:                                          ~$0.015 per query

At 100K queries/day: $1,500/day = ~$45,000/month.

The optimization frontier

Model routing

Use cheap models where possible. Classify queries, 70% can use gpt-4o-mini or haiku. Saves 80-90% on those queries.

Context trimming

Every 100 input tokens saved is money. Trim retrieved chunks before sending. Reduce top-k. Use shorter prompts.

Caching

Every cache hit is money. See caching.

Prompt caching (provider feature)

Cache the system prompt on Claude or Gemini. 30-90% savings on input tokens for common prompts.

Batching

For async workloads (document enrichment, bulk processing), batch API calls. Most providers give 50% discount on batch APIs.

Self-hosted inference

At scale (millions of queries/month), self-hosting becomes cheaper. The crossover is often 10-50M tokens/day.

Smaller embeddings

Use MRL to reduce vector dimensions. 3x savings on vector DB storage.

Cost allocation

Track cost per:

This surfaces cost anomalies (one tenant is 10x expensive; one query type is dominating) and informs pricing decisions.

Budget guardrails

Protect against runaway costs:

Cost-quality tradeoffs

Every optimization has a quality cost. Track both:

Ship optimizations only when quality impact is acceptable. Measure against your eval set.

The hidden costs

Ingestion

Embedding 10M new documents: ~$1000-2000 on commercial APIs. Self-hosted: compute only but takes days.

Reindexing

Changing embedding models means re-embedding everything. Budget accordingly.

Experimentation

Running evals, A/B tests, debugging production issues, each involves LLM calls. Can be 10-20% of operational spend.

Development

Your engineers testing changes. Often more expensive than production if not monitored.

Cost monitoring dashboard

Essential metrics:

When to optimize vs when to ship

Premature cost optimization kills RAG projects:

The ROI check

Before adding a feature that increases cost (agentic RAG, GraphRAG, multi-query), ask: what's the quality gain, and what's the cost delta? If quality improves 5% and cost doubles, that's probably the wrong tradeoff. If quality improves 30% and cost increases 50%, it may be worth it.

Measurement first. Cost decisions second.

Next: Security and prompt injection.