Regression testing
📖 2 min readUpdated 2026-04-19
Every tweak to a system prompt, every model upgrade, every tool change can break something that used to work. Regression testing catches these breaks before they hit production.
The eval set as regression suite
Your eval set doubles as your regression tests. Run it before shipping any agent change. If pass rate drops, investigate.
What to include
- All known-good behaviors (the "this used to work" cases)
- Past bugs (so they don't come back)
- Adversarial inputs (prompt injection, edge cases)
- Safety-critical cases (never-do-this boundaries)
Running regression
Set up as CI-like pipeline:
- Developer proposes change (prompt edit, new tool, model upgrade)
- Run full regression suite against the new config
- Compare pass rate to baseline
- Ship only if no meaningful regression
The flaky-test problem
Agents are non-deterministic. A test might pass 4/5 runs and fail 1/5. Classify:
- Consistent pass → fine
- Consistent fail → definite regression
- Flaky → run more times, calculate real pass rate, may need to accept statistical noise
Cost of regression runs
Running a 100-case eval suite on every change can add up. Strategies:
- Tiered: fast subset on every change, full suite before release
- Smart selection: run tests most likely to be affected by the change
- Monthly full runs with smaller per-change runs
Eval set maintenance
- Add cases for every production bug
- Remove cases that no longer reflect reality
- Quarterly audit to make sure it still represents real use