Why eval agents
📖 3 min readUpdated 2026-04-19
Evaluating agents is fundamentally harder than evaluating a single LLM call. A call has one output to score. An agent has a whole trajectory: reasoning steps, tool calls, errors, retries. Any of those can go wrong while the final answer still looks fine. Without agent-specific eval, your quality will erode, and you won't know until users complain. Having an eval set is the #1 thing that separates hobby agents from production agents.
Why single-call eval isn't enough
- The agent produced a correct final answer, but took 25 tool calls when it should have taken 4.
- The agent is correct 60% of the time on the same input. Non-determinism is invisible if you only look at one run.
- The agent sometimes hits budget limits or loops forever on specific inputs.
- The agent has occasional unsafe behaviors (tool misuse, data leaks) that rarely surface in ad-hoc testing.
- The model changed under you. Your "working" agent silently started costing 3× and making different choices.
Each of these fails in a way a single-call eval can't catch. Agent eval is about the whole trace, not just the answer.
Five dimensions to measure
Task completion is the headline. The other four are how you catch silent regressions.
What to put in the eval set
A good agent eval set has five kinds of cases:
- Happy path. The common cases your agent will see every day.
- Edge cases. Unusual inputs, empty fields, very long inputs, very short inputs.
- Adversarial. Prompt-injection attempts, contradictory requests, attempts to exfiltrate data.
- Error paths. Cases where tools deliberately fail. Does the agent recover?
- Long-tail. Rare but important cases. Low volume, high stakes.
Start with 30 cases. Add cases every time you see a real production bug. The set gets more useful the longer it exists.
A worked example: before and after eval
Team ships an agent. "Works great" on hand-testing. Three weeks later, users complain about bad answers. Team reads traces: the agent is hitting its step cap on complex queries and returning garbage. They add eval cases for complex queries. The cases fail. They fix the loop. Cases pass. Regression caught before the next similar bug.
Without the eval, that cycle takes months and the whole team gets defensive. With the eval, it takes a morning and everyone's calmer.
Automated vs human eval
- Automated: checks that can be expressed as code or a judge LLM. "Final answer matches expected." "Agent didn't call forbidden tool." "Trajectory under N steps." Scales. Run every change.
- LLM-as-judge: a stronger model grades open-ended outputs. Works for "is this a good response?" type questions. Introduces judge bias; validate against human rating periodically.
- Human review: a person reads a sample. Catches the weird stuff automation misses. Slow but essential for high-stakes agents.
Production agent programs run all three, at different frequencies.
The regression problem
Every change can regress something: prompt edits, model upgrades, tool description tweaks, new tools, a library change. Without an eval suite you find out when users complain. With an eval suite you find out in CI, before the change ships.
Run the eval on every meaningful change. Ship only if pass-rate holds steady or improves. That one rule prevents most of the "agent suddenly got worse" mysteries.
Pitfalls
- No eval set. You're flying blind. Every change is a coin flip.
- Eval set too small. 5 cases don't catch long-tail problems.
- Eval set never updated. Your cases from a year ago don't reflect current production.
- Only measuring final answer. You miss trajectory problems, cost blowouts, and slow regressions.
- Running eval once a month. Bugs ship before the next eval runs.
What to do with this