Trajectory evaluation

Trajectory evaluation looks at how the agent got to its answer, not just whether it got there. Two agents can give the exact same final answer while differing by 10× in cost, using completely different tools, making different safety decisions along the way. Task completion tells you if it worked. Trajectory eval tells you how well it worked. Both matter. The second is how you catch silent cost spikes and weird model regressions.

Four categories to measure

Efficiency: usually the first regression

The most common silent regression is efficiency. Completion rate stays at 95%. Nobody notices anything. But the average task now takes 12 tool calls instead of 5, and your bill doubled. Efficiency metrics catch this immediately:

Track these as trend lines. A sudden jump after a prompt change is your signal.

Tool selection: did it pick right?

For each eval case, note which tools the "ideal" trajectory would use. Compare to what the agent actually did:

Tool-selection regressions are often caused by prompt edits or tool-description changes. They're easy to catch if you look for them.

Reasoning quality

Harder to measure, but important. Use an LLM-as-judge pass on the trajectory:

Flag trajectories with low reasoning scores for human review. You'll learn a lot about where your agent struggles.

Error recovery: the hidden quality signal

Almost every production agent trajectory has at least one tool error. The question is how the agent handled it. Measure:

Automatable checks

Not every trajectory property needs an LLM judge. Many can be checked with simple rules:

assert trajectory.step_count <= 15
assert "delete_production_data" not in trajectory.tools_called
assert no_duplicate_calls(trajectory, threshold=3)
assert final_answer.matches_schema(expected_schema)

These run in microseconds and catch the worst regressions cheaply.

A worked example: catching a silent cost regression

  1. Team updates the system prompt to be more detailed.
  2. Completion rate on eval stays at 95%. "Ship it."
  3. Trajectory metrics: average steps jumped from 5.2 to 8.7. Cost per task up 40%.
  4. CI blocks the change. Team investigates.
  5. Turns out the new prompt encourages over-exploration. They trim it. Steps drop back to 5.3. Ship.

Without trajectory eval, the cost regression ships and nobody notices until the next billing cycle.

LLM-as-judge on trajectories

Give a judge LLM the full trajectory (or summary) plus the expected outcome. Ask:

Works well for nuanced judgment. Expensive, so sample rather than running on every case.

Comparative trajectory eval

When you change anything (prompt, model, tool), compare trajectories across versions. Same completion rate + shorter trajectory = improvement. Same completion rate + longer trajectory = silent regression you should fix before shipping.

Pitfalls

What to do with this