Cost control

Agent costs can sneak up on you in a way single-call costs don't. A stuck session can cost $10 instead of $0.10. One bug multiplied across a thousand users can burn $10K in an afternoon. Cost control isn't a nice-to-have; it's what separates agents-in-production from agents-that-got-their-team-yelled-at. This page is the playbook.

Where the cost actually goes

LLM tokens dominate almost every agent's bill. Input tokens grow with context length; output tokens grow with how much reasoning + response the model produces. Optimize those first.

Six cost-reduction techniques (in order of ROI)

Prompt caching: do this first

Anthropic's Claude, Google's Gemini, and OpenAI all support caching stable parts of the prompt. Your system prompt is the same every turn; cache it. On Anthropic, a cached token is 10% of the normal input cost, and cache hits are typical on multi-turn sessions. For agents with a 2K-token system prompt and 15-turn sessions, this alone can cut your bill in half.

Model routing: cheap for easy, premium for hard

Not every task needs the flagship model. A lightweight classifier (or even a rule) decides: "This looks simple → cheap model. This looks complex → premium." Can cut costs 70-80% on query-heavy agents where most requests are easy. Build escalation too: cheap model tries first, escalates if stuck.

Context trimming

A 20-turn agent session that sends 80K tokens every turn is mostly paying to resend old context. Summarize old turns into a few hundred tokens when context passes a threshold. Can drop per-turn cost 40-70% on long sessions.

Tool output limits

A tool that returns 50K of text blows up the next LLM call. Cap every tool's output at a reasonable size (e.g., 4K tokens). If the agent needs more, it can call a follow-up tool with narrower scope.

Per-session budget (the safety net)

A hard cap at $0.50 or whatever fits your product. When hit, gracefully terminate. Doesn't reduce average cost but prevents the $10 runaway from happening at all. This is the thing you absolutely cannot ship without.

Caching expensive tool results

Same search query within the session → cached result. Same LLM prompt with deterministic settings → cached. Especially valuable when your agent naturally repeats queries (research, debugging, customer lookup).

The four-layer budget stack

Layered so a single failure at one layer doesn't cascade.

Measure cost per outcome, not per request

A task that takes 3 sessions to complete costs 3× what a single session shows. Track cost-per-resolved-task (or cost-per-successful-outcome). That's the number that matters for unit economics, not cost-per-request.

A worked example

Team's support agent costs average $0.28 per ticket. Monthly spend: $14K at 50K tickets. Actions:

$10K/month back with a week of engineering. No quality drop (verified against eval set).

Cost alerts

Set alerts on:

Pitfalls

What to do with this