Agent evaluation
📖 5 min readUpdated 2026-04-18
"It works" is not an evaluation. An agent without evals is a prototype. The #1 reason agents fail in production is nobody measured whether they worked before shipping.
Why evaluating agents is hard
- Nondeterminism. Same input → different output. Single-run correctness is meaningless.
- No ground truth for many tasks. "Write a good email" has no single right answer.
- Multi-step. Output quality depends on the whole trajectory, not just the final answer.
- Environment coupling. Agents act on systems. Those systems have state. Repeatability is hard.
The eval taxonomy
- Unit evals, specific input → expected output (or property). Deterministic parts of the agent (parsing, formatting).
- End-to-end evals, full-agent runs on labeled test cases, scored against expected behavior.
- Regression evals, a fixed suite you run on every change. Catches regressions when you change the prompt.
- Adversarial evals, intentionally tricky inputs to surface failure modes.
- Production shadow evals, run the agent on live traffic in shadow mode, compare to incumbent.
- Human eval, humans rate agent output on defined rubrics.
Building an eval harness
Minimum viable eval harness:
- A JSON/YAML file of test cases. Each case has input, expected behavior (free text or structured), optional expected output.
- A runner that executes the agent on each case, captures outputs.
- A grader that scores outputs. For structured outputs, exact match. For free text, use another LLM call (LLM-as-judge) with a rubric.
- A report showing pass rates, broken cases, and drift from last run.
LLM-as-judge
Use a strong model (Opus-class) to grade another model's outputs against a rubric. Cheaper than human eval, often good enough.
Rules:
- Judge should be different (or at least stronger) than the model being graded
- Rubric must be explicit. "rate 1-5 on accuracy" is not enough; define each score level
- Run judge multiple times and average, reduces variance
- Spot-check with human eval monthly to catch judge drift
Metrics that matter
- Task success rate. % of cases completed correctly.
- Tool-call count. How many steps to finish? Fewer is usually better.
- Cost per task. Tokens × rate. Track it, especially as you change prompts.
- Latency. P50, P95. User-perceived speed.
- Error rate. Tools that errored; model calls that returned malformed output.
- Safety metric. How often did the agent do something in the deny list (that slipped through)?
Process
- Start with 10 hand-written test cases. Run them before every prompt change.
- Grow the set. Every bug found in prod → add a test case.
- Run regularly. On every meaningful change. At least weekly in prod.
- Track over time. A single number means less than the delta.
The eval is the product. If your agent is hard to eval, it's too complex, simplify. A better agent is one whose behavior you can measure.