Home›Expertise›AI Agents›Task completion

Task completion

📖 2 min readUpdated 2026-04-19

Task completion is the headline metric for any agent. Forget cost, forget trajectory, forget the fancy dashboards, did the agent actually do what the user wanted? If not, nothing else matters. Measuring it reliably is harder than it looks. The agent will cheerfully say "done!" when it hasn't finished, or produce a correct-looking answer that's subtly wrong. A good completion eval looks at the outcome, not the agent's self-report.

Four grading approaches

Prefer binary when you can

Binary is cheap, reproducible, non-debatable. "Did the PR get merged? Yes/No." "Did the SQL return the right row count? Yes/No." If you can express success as a boolean, do. Graded scales introduce judge variance and make regressions harder to spot.

The four failure modes you must catch

False success. Agent says "done!" but didn't actually finish. The most dangerous mode, because the agent sounds confident.
Partial completion. Did half the task, stopped. Often looks plausible because half the work was fine.
Silent failure. A tool errored, got swallowed, agent fabricated the answer. Output looks reasonable; it's wrong.
Over-completion. Did what was asked and also did things not asked. Scope creep; in some domains (deletes, sends) this is catastrophic.

Verifier design: check the outcome, not the claim

This is the single biggest mistake in agent eval. Don't trust what the agent says happened. Check what actually happened. A few examples:

Agent: "I deleted the file." Verifier: actually check the file is gone.
Agent: "I sent the email." Verifier: check the email log.
Agent: "The customer has been refunded." Verifier: check the refund in the payment system.
Agent: "I answered the question correctly." Verifier: compare the answer to ground truth, don't take the agent's word for it.

A worked example: a "fix the failing test" agent

Naive eval: "Did the agent say it fixed the test?" This catches almost nothing. Agents say they fixed things they didn't.

Good eval:

Verifier runs the test suite before the agent runs. Record failures.
Agent runs.
Verifier runs the test suite after. Record failures.
Pass if the specific test that was failing is now passing and no other test regressed.

That catches false successes ("I fixed it" but the test still fails), partial completion (fixed one test, broke another), and silent failures (agent never actually changed any code).

Non-determinism: always run multiple times

Agents are stochastic. The same eval case can pass once, fail twice, pass again. Single-run pass/fail is noisy. Always run each case 3-5 times:

LLM-as-judge: useful, biased

When outputs are free-form (writing, summaries, conversations), use a strong LLM to grade: "Does this output accomplish the task?" Judge outputs a score and rationale. This scales. But the judge has its own biases; validate every few weeks against human grading on a sample.

Pitfalls

Trusting the agent's self-report. Always verify outcomes independently.
One-run scoring. Non-determinism makes single runs misleading.
Grading only the final answer. Agent might have taken 20 steps to produce a 4-step answer; you're missing the cost signal.
Bad ground truth. Your "expected answer" was itself wrong; every run fails even when the agent is correct.
Over-strict matching. Exact-string match fails for semantically equivalent answers; use semantic similarity or an LLM judge.

What to do with this

For every agent you care about, write a verifier that checks outcomes independently.
Run each eval case 3-5 times. Report pass rate, not a single pass/fail.
Read trajectory evaluation for the "how it got there" side.
Read regression testing for turning this into a CI loop.