Task completion

Task completion is the headline metric for any agent. Forget cost, forget trajectory, forget the fancy dashboards, did the agent actually do what the user wanted? If not, nothing else matters. Measuring it reliably is harder than it looks. The agent will cheerfully say "done!" when it hasn't finished, or produce a correct-looking answer that's subtly wrong. A good completion eval looks at the outcome, not the agent's self-report.

Four grading approaches

Prefer binary when you can

Binary is cheap, reproducible, non-debatable. "Did the PR get merged? Yes/No." "Did the SQL return the right row count? Yes/No." If you can express success as a boolean, do. Graded scales introduce judge variance and make regressions harder to spot.

The four failure modes you must catch

Verifier design: check the outcome, not the claim

This is the single biggest mistake in agent eval. Don't trust what the agent says happened. Check what actually happened. A few examples:

A worked example: a "fix the failing test" agent

Naive eval: "Did the agent say it fixed the test?" This catches almost nothing. Agents say they fixed things they didn't.

Good eval:

  1. Verifier runs the test suite before the agent runs. Record failures.
  2. Agent runs.
  3. Verifier runs the test suite after. Record failures.
  4. Pass if the specific test that was failing is now passing and no other test regressed.

That catches false successes ("I fixed it" but the test still fails), partial completion (fixed one test, broke another), and silent failures (agent never actually changed any code).

Non-determinism: always run multiple times

Agents are stochastic. The same eval case can pass once, fail twice, pass again. Single-run pass/fail is noisy. Always run each case 3-5 times:

LLM-as-judge: useful, biased

When outputs are free-form (writing, summaries, conversations), use a strong LLM to grade: "Does this output accomplish the task?" Judge outputs a score and rationale. This scales. But the judge has its own biases; validate every few weeks against human grading on a sample.

Pitfalls

What to do with this