Every agent change can quietly break something. Regression tests catch regressions before production.
The most important agent metric. Did it do what was asked? Here's how to measure it.
Beyond whether the agent got the right answer, did it get there the right way?
Agents fail in subtle ways. Without evaluation, you ship a system that works in demos and silently fails in production.