Agent evaluation

"It works" is not an evaluation. An agent without evals is a prototype. The #1 reason agents fail in production is nobody measured whether they worked before shipping.

Why evaluating agents is hard

The eval taxonomy

  1. Unit evals, specific input → expected output (or property). Deterministic parts of the agent (parsing, formatting).
  2. End-to-end evals, full-agent runs on labeled test cases, scored against expected behavior.
  3. Regression evals, a fixed suite you run on every change. Catches regressions when you change the prompt.
  4. Adversarial evals, intentionally tricky inputs to surface failure modes.
  5. Production shadow evals, run the agent on live traffic in shadow mode, compare to incumbent.
  6. Human eval, humans rate agent output on defined rubrics.

Building an eval harness

Minimum viable eval harness:

  1. A JSON/YAML file of test cases. Each case has input, expected behavior (free text or structured), optional expected output.
  2. A runner that executes the agent on each case, captures outputs.
  3. A grader that scores outputs. For structured outputs, exact match. For free text, use another LLM call (LLM-as-judge) with a rubric.
  4. A report showing pass rates, broken cases, and drift from last run.

LLM-as-judge

Use a strong model (Opus-class) to grade another model's outputs against a rubric. Cheaper than human eval, often good enough.

Rules:

Metrics that matter

Process

  1. Start with 10 hand-written test cases. Run them before every prompt change.
  2. Grow the set. Every bug found in prod → add a test case.
  3. Run regularly. On every meaningful change. At least weekly in prod.
  4. Track over time. A single number means less than the delta.
The eval is the product. If your agent is hard to eval, it's too complex, simplify. A better agent is one whose behavior you can measure.