Home›Framework›Patterns›Agent evaluation

Agent evaluation

📖 5 min readUpdated 2026-04-18

"It works on my laptop" is not an evaluation. "The demo was great" is not an evaluation. The single biggest reason agents fail in production is that nobody measured whether they worked before shipping. This page is about how to actually evaluate an agent, which is harder than it sounds for several good reasons, and what a real eval harness looks like.

Why evaluating agents is hard.

If agents could be evaluated like normal software (give input, check output matches expected), this would be a short page. They can't. Four things make agent evaluation genuinely tricky:

~ why evaluating agents is hard ~

The eval taxonomy. Six kinds, used together.

There isn't one eval. A mature agent has a stack of them, each catching different failure modes.

Unit evals. The deterministic parts of your agent (parsing, formatting, schema validation) tested like normal code. Fast, reliable, should run on every commit.
End-to-end evals. Full-agent runs on a labeled set of test cases, scored against expected behavior. The core of your agent testing.
Regression evals. A frozen suite of cases that must always pass. Run before every prompt or model change to catch regressions.
Adversarial evals. Intentionally tricky inputs designed to expose failure modes: prompt injection, ambiguous requests, edge cases, confusing tool results.
Production shadow evals. Run the agent on real live traffic in parallel with the current production system (without acting) and compare. The closest thing to reality without the risk.
Human evals. Real people rate a sample of outputs on a rubric. Slow, expensive, indispensable for anything touching quality judgment.

The minimum viable eval harness.

You don't need a fancy platform. You need four parts.

~ minimum viable eval harness ~

That's it. A test cases file (even 20 cases to start). A small runner script. A grader (either exact-match for structured outputs, or an LLM-as-judge for free text). A report at the end that tells you what passed, what failed, and how today compares to yesterday.

Build this in an afternoon. Start running it. The insights come fast.

LLM-as-judge. The pragmatic middle ground.

For outputs without a single right answer ("did the agent write a good email?"), you can't use exact-match grading. Human eval works but is slow and expensive. The pragmatic middle: use a strong LLM as the judge. Give it the rubric and the candidate output; get back a score.

Rules that keep LLM-as-judge honest:

The judge should be different from (or stronger than) the agent being graded. Using Sonnet to grade Sonnet is better than using Sonnet to grade itself, and using Opus to grade Sonnet is better still.
The rubric must be explicit. "Rate accuracy 1-5" is useless. "5 = every claim is directly supported by the cited source; 4 = one minor inference beyond sources; 3 = some unverified claims..." - define each level.
Run the judge multiple times and average. Judges are noisy too. Three runs and an average is measurably more stable than one run.
Spot-check with human eval monthly. Is the judge agreeing with humans on the tough cases? If it's drifting, recalibrate the rubric or swap the judge model.

The metrics that actually matter.

Once you have a harness running, track these. The individual numbers matter less than the trends.

~ metrics, ranked by importance ~

Don't just look at the averages. Look at the distribution. An agent with 98% success but that fails badly on the remaining 2% is worse than an agent with 92% success that fails gracefully. Tails matter.

The process that actually works.

Write 10 hand-crafted test cases on day 1. Real cases from your target task. You already have them in your head; write them down.
Run them before every prompt change. Even small edits. Regression is real. 15 seconds of eval run is cheap insurance.
Every bug you find in production → add a test case. This is how your suite grows. 10 cases becomes 50 becomes 200, each one earned.
Run the full suite weekly as models, tools, and context drift. Scheduled. Automated. No exceptions.
Track over time. A single run is just a number. The trend line is where insight lives. Is pass rate going up or down? Are tails getting worse? Is cost per task drifting?

The eval is the product. If your agent is hard to evaluate, it's probably too complex - simplify it. A better agent is one whose behavior you can measure. The teams that win ship simpler agents with tighter evals, not fancier agents with vibes-based testing.

One more thing: evals vs. eval-ops.

Building the harness is the easy part. Keeping it useful over months is the hard part. Plan for:

Test-case rot. Cases that passed for the wrong reason, or that no longer reflect real usage, need pruning.
Rubric drift. Your sense of "good" evolves. Update the rubric; re-score historical runs so trends stay apples-to-apples.
Model updates. When Anthropic ships a new Claude, re-run your entire suite. Sometimes new models pass things old ones couldn't; sometimes they fail things old ones could. Always check.

The best agent teams I've seen treat eval as a first-class engineering practice, with its own owner, its own sprint time, its own retros. It's unglamorous, and it's the difference between agents that ship and agents that keep almost-shipping.

Agent evaluation

Why evaluating agents is hard.

The eval taxonomy. Six kinds, used together.

The minimum viable eval harness.

LLM-as-judge. The pragmatic middle ground.

The metrics that actually matter.

The process that actually works.

One more thing: evals vs. eval-ops.

Further reading

Watch