Regression testing

Every tweak to a system prompt, every model upgrade, every tool change can break something that used to work. Regression testing catches these breaks before they hit production.

The eval set as regression suite

Your eval set doubles as your regression tests. Run it before shipping any agent change. If pass rate drops, investigate.

What to include

Running regression

Set up as CI-like pipeline:

  1. Developer proposes change (prompt edit, new tool, model upgrade)
  2. Run full regression suite against the new config
  3. Compare pass rate to baseline
  4. Ship only if no meaningful regression

The flaky-test problem

Agents are non-deterministic. A test might pass 4/5 runs and fail 1/5. Classify:

Cost of regression runs

Running a 100-case eval suite on every change can add up. Strategies:

Eval set maintenance