Task completion

Task completion is the primary agent metric. Forget trajectory, forget cost, did the agent achieve the user's goal? If not, nothing else matters. Measuring task completion reliably is harder than it sounds.

Grading approaches

Binary

Did it work? Yes/no. Works when the task has a clear success criterion (test passed, file created with expected contents, correct answer to a question).

Graded

Scale 1-5. Useful for tasks where "correct" has degrees (writing quality, helpfulness, thoroughness).

Ground-truth comparison

Compare agent's output to a known-correct answer. Direct match, semantic similarity, or structured equivalence.

LLM-as-judge

Use a strong LLM to grade agent output. Scales, but introduces judge bias.

What to watch out for

Verifier design

For each eval case, the verifier should check the actual outcome, not just the agent's claim. Agent says "I deleted the file", verifier checks the file is actually gone.

Run multiple times

Agents are non-deterministic. Run each eval case 3-5 times. If it passes 5/5 you have high confidence. 3/5 is a reliability problem.