Why eval agents

Evaluating agents is fundamentally harder than evaluating a single LLM call. A call has one output to score. An agent has a whole trajectory: reasoning steps, tool calls, errors, retries. Any of those can go wrong while the final answer still looks fine. Without agent-specific eval, your quality will erode, and you won't know until users complain. Having an eval set is the #1 thing that separates hobby agents from production agents.

Why single-call eval isn't enough

Each of these fails in a way a single-call eval can't catch. Agent eval is about the whole trace, not just the answer.

Five dimensions to measure

Task completion is the headline. The other four are how you catch silent regressions.

What to put in the eval set

A good agent eval set has five kinds of cases:

Start with 30 cases. Add cases every time you see a real production bug. The set gets more useful the longer it exists.

A worked example: before and after eval

Team ships an agent. "Works great" on hand-testing. Three weeks later, users complain about bad answers. Team reads traces: the agent is hitting its step cap on complex queries and returning garbage. They add eval cases for complex queries. The cases fail. They fix the loop. Cases pass. Regression caught before the next similar bug.

Without the eval, that cycle takes months and the whole team gets defensive. With the eval, it takes a morning and everyone's calmer.

Automated vs human eval

Production agent programs run all three, at different frequencies.

The regression problem

Every change can regress something: prompt edits, model upgrades, tool description tweaks, new tools, a library change. Without an eval suite you find out when users complain. With an eval suite you find out in CI, before the change ships.

Run the eval on every meaningful change. Ship only if pass-rate holds steady or improves. That one rule prevents most of the "agent suddenly got worse" mysteries.

Pitfalls

What to do with this