Task completion is the primary agent metric. Forget trajectory, forget cost, did the agent achieve the user's goal? If not, nothing else matters. Measuring task completion reliably is harder than it sounds.
Did it work? Yes/no. Works when the task has a clear success criterion (test passed, file created with expected contents, correct answer to a question).
Scale 1-5. Useful for tasks where "correct" has degrees (writing quality, helpfulness, thoroughness).
Compare agent's output to a known-correct answer. Direct match, semantic similarity, or structured equivalence.
Use a strong LLM to grade agent output. Scales, but introduces judge bias.
For each eval case, the verifier should check the actual outcome, not just the agent's claim. Agent says "I deleted the file", verifier checks the file is actually gone.
Agents are non-deterministic. Run each eval case 3-5 times. If it passes 5/5 you have high confidence. 3/5 is a reliability problem.