Observability + tracing

Agents fail in ways traditional apps don't. Non-deterministic reasoning, cascading tool errors, quality drift you only spot statistically. Without deep tracing, these failures are invisible until a customer complains, and by then you're debugging blind. Observability is the difference between "we have an agent in production" and "we operate an agent in production." Don't ship without it.

What to log, per session

Traces are replayable debugging artifacts

A good trace is not just logs. It's structured, timestamped, correlated with a session ID, and complete enough that you can:

If your trace is just free-text logs, you can't do the last step. Invest in structured tracing from day one.

Tooling options

Pick one that fits your stack and move on. All of them are better than rolling your own.

Sampling strategy

Full tracing at high throughput is expensive. Sample:

The alerts you need

Alerts without dashboards are annoying. Dashboards without alerts get ignored. You need both.

The feedback loop: production → eval

  1. A production trace reveals a failure (from alerts or support tickets).
  2. You capture the failing input and expected outcome.
  3. Add it to the eval set as a new regression case.
  4. Fix the root cause in your prompt/tools/loop.
  5. Re-run eval: this case now passes.
  6. Deploy. Bug stays fixed, protected by the new regression case.

This loop is how agent systems get better over time. Without it, you chase the same bugs repeatedly.

A worked example: finding a silent bug

Alert fires: p99 latency up from 8s to 23s. Traces show 90% of slow sessions have the same pattern: a specific tool is timing out, then the agent retries three times before giving up. Root cause: a provider changed their API, old timeout was too short.

Without tracing: users complain about slowness, team spends a week guessing. With tracing: diagnosed in an afternoon, fixed, regression case added.

Pitfalls

What to do with this