Trajectory evaluation

Trajectory evaluation looks at how the agent got to its answer, not just whether it got there. Two agents with the same final answer can differ by 10x in cost, or produce the answer via completely different reasoning paths.

What to measure in trajectories

Efficiency

Tool selection

Reasoning quality

Error recovery

Automated checks

Some trajectory properties can be checked automatically:

LLM-as-judge on trajectories

Give a judge LLM the full trajectory + expected answer. Ask:

Comparative trajectory eval

Compare trajectories across model versions or prompt changes. Even if task-completion rate is the same, shorter/cheaper trajectories indicate improvement.