Cost control

📖 3 min readUpdated 2026-04-19

Agent costs can sneak up on you in a way single-call costs don't. A stuck session can cost $10 instead of $0.10. One bug multiplied across a thousand users can burn $10K in an afternoon. Cost control isn't a nice-to-have; it's what separates agents-in-production from agents-that-got-their-team-yelled-at. This page is the playbook.

Where the cost actually goes

LLM tokens dominate almost every agent's bill. Input tokens grow with context length; output tokens grow with how much reasoning + response the model produces. Optimize those first.

Six cost-reduction techniques (in order of ROI)

Prompt caching: do this first

Anthropic's Claude, Google's Gemini, and OpenAI all support caching stable parts of the prompt. Your system prompt is the same every turn; cache it. On Anthropic, a cached token is 10% of the normal input cost, and cache hits are typical on multi-turn sessions. For agents with a 2K-token system prompt and 15-turn sessions, this alone can cut your bill in half.

Model routing: cheap for easy, premium for hard

Not every task needs the flagship model. A lightweight classifier (or even a rule) decides: "This looks simple → cheap model. This looks complex → premium." Can cut costs 70-80% on query-heavy agents where most requests are easy. Build escalation too: cheap model tries first, escalates if stuck.

Context trimming

A 20-turn agent session that sends 80K tokens every turn is mostly paying to resend old context. Summarize old turns into a few hundred tokens when context passes a threshold. Can drop per-turn cost 40-70% on long sessions.

Tool output limits

A tool that returns 50K of text blows up the next LLM call. Cap every tool's output at a reasonable size (e.g., 4K tokens). If the agent needs more, it can call a follow-up tool with narrower scope.

Per-session budget (the safety net)

A hard cap at $0.50 or whatever fits your product. When hit, gracefully terminate. Doesn't reduce average cost but prevents the $10 runaway from happening at all. This is the thing you absolutely cannot ship without.

Caching expensive tool results

Same search query within the session → cached result. Same LLM prompt with deterministic settings → cached. Especially valuable when your agent naturally repeats queries (research, debugging, customer lookup).

The four-layer budget stack

Per-session: $X cap per agent run.
Per-user per-day: one user can't eat the whole budget.
Per-tenant per-month: one customer can't blow the SaaS P&L.
Global rate limit: catches systemic issues before they compound.

Layered so a single failure at one layer doesn't cascade.

Measure cost per outcome, not per request

A task that takes 3 sessions to complete costs 3× what a single session shows. Track cost-per-resolved-task (or cost-per-successful-outcome). That's the number that matters for unit economics, not cost-per-request.

A worked example

Team's support agent costs average $0.28 per ticket. Monthly spend: $14K at 50K tickets. Actions:

Turn on prompt caching. System prompt is 3K tokens, avg 8 turns. Drops cost to $0.18/ticket. Monthly: $9K.
Add model routing: 60% of tickets go to cheap model. Drops to $0.11/ticket. Monthly: $5.5K.
Trim context on long tickets. p95 cost drops 40%. Average drops to $0.09. Monthly: $4.5K.

$10K/month back with a week of engineering. No quality drop (verified against eval set).

Cost alerts

Set alerts on:

Average cost per task rising 20%+ week-over-week.
p99 session cost exceeding a threshold.
% of sessions hitting the budget cap (spike means something changed).
Model mix drifting (cheap-to-premium ratio shifting).

Pitfalls

No budget cap. One bug = one enormous bill.
Caching off. Leaving 40-50% savings on the table.
Premium model for everything. Cost 5× what you needed.
No cost-per-outcome tracking. You're optimizing the wrong metric.
Silent quality drops when cutting costs. Always re-run eval after any cost optimization.

What to do with this

Turn on prompt caching today. It's free savings.
Add a per-session budget cap if you don't have one. Non-negotiable.
Read observability + tracing to see where your tokens go.
Read latency optimization, which overlaps heavily with cost work.