Home›Expertise›AI Agents›Trajectory evaluation

Trajectory evaluation

📖 3 min readUpdated 2026-04-19

Trajectory evaluation looks at how the agent got to its answer, not just whether it got there. Two agents can give the exact same final answer while differing by 10× in cost, using completely different tools, making different safety decisions along the way. Task completion tells you if it worked. Trajectory eval tells you how well it worked. Both matter. The second is how you catch silent cost spikes and weird model regressions.

Four categories to measure

Efficiency: usually the first regression

The most common silent regression is efficiency. Completion rate stays at 95%. Nobody notices anything. But the average task now takes 12 tool calls instead of 5, and your bill doubled. Efficiency metrics catch this immediately:

Average steps per task, by eval case category.
Average tokens per task.
p95 latency (the tail is where regressions hide).
Cost per task.

Track these as trend lines. A sudden jump after a prompt change is your signal.

Tool selection: did it pick right?

For each eval case, note which tools the "ideal" trajectory would use. Compare to what the agent actually did:

Did it use the expected tools?
Did it use tools it shouldn't have? (safety signal)
Did it call the same tool redundantly?
Did it miss an obvious-better tool?

Tool-selection regressions are often caused by prompt edits or tool-description changes. They're easy to catch if you look for them.

Reasoning quality

Harder to measure, but important. Use an LLM-as-judge pass on the trajectory:

Did each step follow logically from the last observation?
Were there circular or contradictory reasoning steps?
Did the agent recognize when it was uncertain and act on that uncertainty (search, ask, verify)?
Did it jump to conclusions without adequate evidence?

Flag trajectories with low reasoning scores for human review. You'll learn a lot about where your agent struggles.

Error recovery: the hidden quality signal

Almost every production agent trajectory has at least one tool error. The question is how the agent handled it. Measure:

% of runs with at least one error.
Of those, % that recovered vs failed.
Average steps spent recovering.
Did the agent loop on the same error?

Automatable checks

Not every trajectory property needs an LLM judge. Many can be checked with simple rules:

assert trajectory.step_count <= 15
assert "delete_production_data" not in trajectory.tools_called
assert no_duplicate_calls(trajectory, threshold=3)
assert final_answer.matches_schema(expected_schema)

These run in microseconds and catch the worst regressions cheaply.

A worked example: catching a silent cost regression

Team updates the system prompt to be more detailed.
Completion rate on eval stays at 95%. "Ship it."
Trajectory metrics: average steps jumped from 5.2 to 8.7. Cost per task up 40%.
CI blocks the change. Team investigates.
Turns out the new prompt encourages over-exploration. They trim it. Steps drop back to 5.3. Ship.

Without trajectory eval, the cost regression ships and nobody notices until the next billing cycle.

LLM-as-judge on trajectories

Give a judge LLM the full trajectory (or summary) plus the expected outcome. Ask:

Did the agent take a reasonable path?
Were there obvious inefficiencies you can describe?
Any warning signs in the reasoning?
Rate overall trajectory quality 1-5.

Works well for nuanced judgment. Expensive, so sample rather than running on every case.

Comparative trajectory eval

When you change anything (prompt, model, tool), compare trajectories across versions. Same completion rate + shorter trajectory = improvement. Same completion rate + longer trajectory = silent regression you should fix before shipping.

Pitfalls

Only looking at final answer. You miss cost and tool-selection regressions.
No trajectory baseline. You can't detect regressions against nothing.
Judge overhead. LLM-as-judge on every case is expensive. Sample.
Metrics without alerts. Collecting and never looking is worse than not collecting; step spikes should page someone.

What to do with this

Start tracking step count and token count per run. Trend-line them. Baseline is the first thing you need.
Read observability + tracing for the instrumentation that makes all of this possible.
Read regression testing for making trajectory eval a ship blocker.