Agent evaluation

"It works on my laptop" is not an evaluation. "The demo was great" is not an evaluation. The single biggest reason agents fail in production is that nobody measured whether they worked before shipping. This page is about how to actually evaluate an agent, which is harder than it sounds for several good reasons, and what a real eval harness looks like.

Why evaluating agents is hard.

If agents could be evaluated like normal software (give input, check output matches expected), this would be a short page. They can't. Four things make agent evaluation genuinely tricky:

~ why evaluating agents is hard ~

The eval taxonomy. Six kinds, used together.

There isn't one eval. A mature agent has a stack of them, each catching different failure modes.

  1. Unit evals. The deterministic parts of your agent (parsing, formatting, schema validation) tested like normal code. Fast, reliable, should run on every commit.
  2. End-to-end evals. Full-agent runs on a labeled set of test cases, scored against expected behavior. The core of your agent testing.
  3. Regression evals. A frozen suite of cases that must always pass. Run before every prompt or model change to catch regressions.
  4. Adversarial evals. Intentionally tricky inputs designed to expose failure modes: prompt injection, ambiguous requests, edge cases, confusing tool results.
  5. Production shadow evals. Run the agent on real live traffic in parallel with the current production system (without acting) and compare. The closest thing to reality without the risk.
  6. Human evals. Real people rate a sample of outputs on a rubric. Slow, expensive, indispensable for anything touching quality judgment.

The minimum viable eval harness.

You don't need a fancy platform. You need four parts.

~ minimum viable eval harness ~

That's it. A test cases file (even 20 cases to start). A small runner script. A grader (either exact-match for structured outputs, or an LLM-as-judge for free text). A report at the end that tells you what passed, what failed, and how today compares to yesterday.

Build this in an afternoon. Start running it. The insights come fast.

LLM-as-judge. The pragmatic middle ground.

For outputs without a single right answer ("did the agent write a good email?"), you can't use exact-match grading. Human eval works but is slow and expensive. The pragmatic middle: use a strong LLM as the judge. Give it the rubric and the candidate output; get back a score.

Rules that keep LLM-as-judge honest:

The metrics that actually matter.

Once you have a harness running, track these. The individual numbers matter less than the trends.

~ metrics, ranked by importance ~

Don't just look at the averages. Look at the distribution. An agent with 98% success but that fails badly on the remaining 2% is worse than an agent with 92% success that fails gracefully. Tails matter.

The process that actually works.

  1. Write 10 hand-crafted test cases on day 1. Real cases from your target task. You already have them in your head; write them down.
  2. Run them before every prompt change. Even small edits. Regression is real. 15 seconds of eval run is cheap insurance.
  3. Every bug you find in production → add a test case. This is how your suite grows. 10 cases becomes 50 becomes 200, each one earned.
  4. Run the full suite weekly as models, tools, and context drift. Scheduled. Automated. No exceptions.
  5. Track over time. A single run is just a number. The trend line is where insight lives. Is pass rate going up or down? Are tails getting worse? Is cost per task drifting?
The eval is the product. If your agent is hard to evaluate, it's probably too complex - simplify it. A better agent is one whose behavior you can measure. The teams that win ship simpler agents with tighter evals, not fancier agents with vibes-based testing.

One more thing: evals vs. eval-ops.

Building the harness is the easy part. Keeping it useful over months is the hard part. Plan for:

The best agent teams I've seen treat eval as a first-class engineering practice, with its own owner, its own sprint time, its own retros. It's unglamorous, and it's the difference between agents that ship and agents that keep almost-shipping.