Why eval agents

Agents are much harder to eval than single LLM calls. A single call has one output to score. An agent has a trajectory: a sequence of reasoning steps and tool calls, any of which can go wrong without changing the final answer. Without agent-specific eval, quality erodes invisibly.

Why single-call eval isn't enough

What to measure

Task completion

Did the agent accomplish what was asked? Binary or graded.

Trajectory quality

Was the path efficient? Right tools? Right order?

Cost

Tokens and tool-call count per task.

Latency

Total time from request to completion.

Safety

Did the agent do anything unsafe (harmful tool calls, leaked data)?

The eval set

A diverse set of representative tasks:

Automated vs human eval

Automated evals scale. LLM-as-judge works for many dimensions. Human review catches what automation misses. Real agent programs run both.

The regression problem

Change your system prompt, switch models, update a tool description, all can regress agent quality. Without an eval suite, you don't know until users complain.

Run the eval on every meaningful change. Ship only if quality holds.