Regression testing

Every prompt edit, every model upgrade, every new tool can break something that used to work. Regression testing is the CI loop that catches those breaks before they ship. For traditional software, unit tests catch regressions. For agents, a well-curated eval set run on every change does the same job. The gap between "we have evals" and "evals gate our merges" is the gap between research-quality agents and production-quality agents.

Your eval set is your regression suite

If you've built out the eval set from the previous pages, you already have a regression suite. Run it before any meaningful agent change:

If pass rate drops, don't ship. Investigate.

What belongs in the regression set

The CI loop

  1. Developer proposes change. Commit, PR, or ready-for-review.
  2. CI runs regression suite. Full suite on main-branch merges, fast subset on PRs.
  3. Compare to baseline. Pass rate on the new config vs last release.
  4. Ship only if pass rate holds. Any drop: investigate before merging.
  5. Post-deploy: run full suite on production config weekly. Catches model-side drift you didn't deploy.

The flaky-test problem

Agents are non-deterministic. A case that passes 4/5 times and fails 1/5 isn't a regression; it's noise. You have to classify cases:

Track the pass rate for each flaky case over time. A flaky case that drops from 4/5 to 2/5 is a regression even if individual runs look inconsistent.

The cost problem

A 100-case regression suite running 3-5 times per case per change can cost real money. Strategies:

A worked example: catching a prompt-edit regression

  1. Team edits the system prompt to add a new safety instruction.
  2. PR triggers fast smoke suite: 20 cases, ~$2, runs in 3 minutes. All pass.
  3. PR triggers full regression: 150 cases, ~$15, runs overnight. 4 cases drop from pass to fail.
  4. Team reads the traces. New instruction caused the agent to over-refuse on some legitimate requests.
  5. Revise prompt. Re-run. 150/150 pass. Ship.

Without the regression suite, 4 bug reports would have hit the queue the next week, each requiring its own investigation.

Maintenance

A regression set rots. Maintain it deliberately:

Drift: the silent killer

Even without a code change, the model can change under you. Provider pushes an update, your agent starts behaving differently. Without a scheduled regression run, you find out when users complain. Run the full suite weekly on production config regardless of whether you shipped anything. A mysterious pass-rate drop is usually drift.

Pitfalls

What to do with this