Trajectory evaluation
📖 3 min readUpdated 2026-04-19
Trajectory evaluation looks at how the agent got to its answer, not just whether it got there. Two agents with the same final answer can differ by 10x in cost, or produce the answer via completely different reasoning paths.
What to measure in trajectories
Efficiency
- Number of tool calls
- Number of reasoning steps
- Total tokens used
- Time elapsed
Tool selection
- Did it use the right tools?
- Did it use tools it shouldn't have (safety)?
- Did it use tools redundantly?
Reasoning quality
- Did the reasoning follow logically from observations?
- Were any steps circular or contradictory?
- Did the agent reason about its own uncertainty?
Error recovery
- When tools failed, did the agent respond sensibly?
- Did it try alternatives?
- Did it loop forever?
Automated checks
Some trajectory properties can be checked automatically:
- Step count (should be below threshold)
- No forbidden tools called
- No tool called more than N times with same args (loop detection)
- Final answer in expected shape
LLM-as-judge on trajectories
Give a judge LLM the full trajectory + expected answer. Ask:
- Did the agent take a reasonable path?
- Were there obvious inefficiencies?
- Any warning signs in reasoning?
Comparative trajectory eval
Compare trajectories across model versions or prompt changes. Even if task-completion rate is the same, shorter/cheaper trajectories indicate improvement.