Agents are much harder to eval than single LLM calls. A single call has one output to score. An agent has a trajectory: a sequence of reasoning steps and tool calls, any of which can go wrong without changing the final answer. Without agent-specific eval, quality erodes invisibly.
Did the agent accomplish what was asked? Binary or graded.
Was the path efficient? Right tools? Right order?
Tokens and tool-call count per task.
Total time from request to completion.
Did the agent do anything unsafe (harmful tool calls, leaked data)?
A diverse set of representative tasks:
Automated evals scale. LLM-as-judge works for many dimensions. Human review catches what automation misses. Real agent programs run both.
Change your system prompt, switch models, update a tool description, all can regress agent quality. Without an eval suite, you don't know until users complain.
Run the eval on every meaningful change. Ship only if quality holds.