Regression testing
📖 2 min readUpdated 2026-04-19
Every prompt edit, every model upgrade, every new tool can break something that used to work. Regression testing is the CI loop that catches those breaks before they ship. For traditional software, unit tests catch regressions. For agents, a well-curated eval set run on every change does the same job. The gap between "we have evals" and "evals gate our merges" is the gap between research-quality agents and production-quality agents.
Your eval set is your regression suite
If you've built out the eval set from the previous pages, you already have a regression suite. Run it before any meaningful agent change:
- System prompt edits.
- New or removed tools.
- Tool-description changes.
- Model upgrades.
- Orchestration changes (loop limits, retries, etc).
- Dependencies (a library update can change behavior).
If pass rate drops, don't ship. Investigate.
What belongs in the regression set
The CI loop
- Developer proposes change. Commit, PR, or ready-for-review.
- CI runs regression suite. Full suite on main-branch merges, fast subset on PRs.
- Compare to baseline. Pass rate on the new config vs last release.
- Ship only if pass rate holds. Any drop: investigate before merging.
- Post-deploy: run full suite on production config weekly. Catches model-side drift you didn't deploy.
The flaky-test problem
Agents are non-deterministic. A case that passes 4/5 times and fails 1/5 isn't a regression; it's noise. You have to classify cases:
- Consistent pass (5/5): everything's fine.
- Consistent fail (0/5): clean regression, block merge.
- Flaky (say 3/5): investigate, don't block automatically. Run more times to get a real pass rate and compare it to baseline pass rate for that same case.
Track the pass rate for each flaky case over time. A flaky case that drops from 4/5 to 2/5 is a regression even if individual runs look inconsistent.
The cost problem
A 100-case regression suite running 3-5 times per case per change can cost real money. Strategies:
- Tiered runs. Fast 20-case smoke suite on every PR. Full 200-case suite on pre-release. Scheduled weekly run for drift detection.
- Smart selection. If the change only affects the billing agent, only run billing-related cases at PR time.
- Cheap surrogate. For some cases, a cheap model evaluator gives 80% of the signal at 10% of the cost.
A worked example: catching a prompt-edit regression
- Team edits the system prompt to add a new safety instruction.
- PR triggers fast smoke suite: 20 cases, ~$2, runs in 3 minutes. All pass.
- PR triggers full regression: 150 cases, ~$15, runs overnight. 4 cases drop from pass to fail.
- Team reads the traces. New instruction caused the agent to over-refuse on some legitimate requests.
- Revise prompt. Re-run. 150/150 pass. Ship.
Without the regression suite, 4 bug reports would have hit the queue the next week, each requiring its own investigation.
Maintenance
A regression set rots. Maintain it deliberately:
- Add a case for every production bug. So it doesn't come back.
- Remove cases that no longer reflect reality. Your product changed; some cases are obsolete.
- Audit quarterly. Are the cases representative of current production? Add underrepresented categories.
- Version the set. Tag cases with when they were added; lets you compare "regression on stable cases" vs "regression on recent additions."
Drift: the silent killer
Even without a code change, the model can change under you. Provider pushes an update, your agent starts behaving differently. Without a scheduled regression run, you find out when users complain. Run the full suite weekly on production config regardless of whether you shipped anything. A mysterious pass-rate drop is usually drift.
Pitfalls
- No baseline. "Pass rate dropped" means nothing without a reference number.
- Hand-running before ship. Humans skip the eval when they're in a rush. Automate it.
- Treating flakiness as failure. You'll reject legitimate changes because of noise.
- Never adding new cases. Your regression set becomes irrelevant to current production.
- Skipping eval on "small" changes. A one-word prompt edit can shift behavior substantially.
What to do with this
- Wire your eval set into CI this week. Start with a fast smoke subset blocking merges.
- Run full regression weekly regardless of changes to catch drift.
- Read observability + tracing for the production-side counterpart.
- Revisit task completion and trajectory eval for what the suite should measure.