Trajectory evaluation looks at how the agent got to its answer, not just whether it got there. Two agents can give the exact same final answer while differing by 10× in cost, using completely different tools, making different safety decisions along the way. Task completion tells you if it worked. Trajectory eval tells you how well it worked. Both matter. The second is how you catch silent cost spikes and weird model regressions.
The most common silent regression is efficiency. Completion rate stays at 95%. Nobody notices anything. But the average task now takes 12 tool calls instead of 5, and your bill doubled. Efficiency metrics catch this immediately:
Track these as trend lines. A sudden jump after a prompt change is your signal.
For each eval case, note which tools the "ideal" trajectory would use. Compare to what the agent actually did:
Tool-selection regressions are often caused by prompt edits or tool-description changes. They're easy to catch if you look for them.
Harder to measure, but important. Use an LLM-as-judge pass on the trajectory:
Flag trajectories with low reasoning scores for human review. You'll learn a lot about where your agent struggles.
Almost every production agent trajectory has at least one tool error. The question is how the agent handled it. Measure:
Not every trajectory property needs an LLM judge. Many can be checked with simple rules:
assert trajectory.step_count <= 15 assert "delete_production_data" not in trajectory.tools_called assert no_duplicate_calls(trajectory, threshold=3) assert final_answer.matches_schema(expected_schema)
These run in microseconds and catch the worst regressions cheaply.
Without trajectory eval, the cost regression ships and nobody notices until the next billing cycle.
Give a judge LLM the full trajectory (or summary) plus the expected outcome. Ask:
Works well for nuanced judgment. Expensive, so sample rather than running on every case.
When you change anything (prompt, model, tool), compare trajectories across versions. Same completion rate + shorter trajectory = improvement. Same completion rate + longer trajectory = silent regression you should fix before shipping.