Observability + tracing
📖 3 min readUpdated 2026-04-19
Agents fail in ways traditional apps don't. Non-deterministic reasoning, cascading tool errors, quality drift you only spot statistically. Without deep tracing, these failures are invisible until a customer complains, and by then you're debugging blind. Observability is the difference between "we have an agent in production" and "we operate an agent in production." Don't ship without it.
What to log, per session
Traces are replayable debugging artifacts
A good trace is not just logs. It's structured, timestamped, correlated with a session ID, and complete enough that you can:
- See the exact input and output of every LLM and tool call.
- Step through the reasoning turn-by-turn.
- Spot where the agent went off the rails.
- Reproduce the failure locally by replaying the trace.
If your trace is just free-text logs, you can't do the last step. Invest in structured tracing from day one.
Tooling options
- Langfuse: open-source LLM observability, strong tracing UI.
- LangSmith: hosted; works best with LangChain/LangGraph stacks.
- Phoenix (Arize): open-source, integrates with eval workflows.
- Weave (W&B): good for teams already using W&B for ML.
- Helicone: focused on cost and latency; proxy-based setup is fast.
- OpenTelemetry: general distributed tracing; pairs with any of the above for full-stack traces.
Pick one that fits your stack and move on. All of them are better than rolling your own.
Sampling strategy
Full tracing at high throughput is expensive. Sample:
The alerts you need
- Error rate crossing threshold (1%, 5%, depending on product).
- Cost per session climbing 20%+ week-over-week.
- p99 latency regressing.
- Task completion rate dropping on production traffic.
- Specific tool failing above its baseline rate.
- Budget hits spiking (signal of runaway agents).
Alerts without dashboards are annoying. Dashboards without alerts get ignored. You need both.
The feedback loop: production → eval
- A production trace reveals a failure (from alerts or support tickets).
- You capture the failing input and expected outcome.
- Add it to the eval set as a new regression case.
- Fix the root cause in your prompt/tools/loop.
- Re-run eval: this case now passes.
- Deploy. Bug stays fixed, protected by the new regression case.
This loop is how agent systems get better over time. Without it, you chase the same bugs repeatedly.
A worked example: finding a silent bug
Alert fires: p99 latency up from 8s to 23s. Traces show 90% of slow sessions have the same pattern: a specific tool is timing out, then the agent retries three times before giving up. Root cause: a provider changed their API, old timeout was too short.
Without tracing: users complain about slowness, team spends a week guessing. With tracing: diagnosed in an afternoon, fixed, regression case added.
Pitfalls
- Logs without session correlation. You can't reconstruct a whole trace from scattered log lines.
- No structured data. Free-text logs don't let you aggregate or alert meaningfully.
- 100% sampling at scale. Costs blow up. Sample thoughtfully.
- Alerts without runbooks. Alert fires, nobody knows what to do. Document the response.
- Not redacting PII. Traces contain user input; don't write sensitive fields verbatim.
What to do with this
- Add per-session tracing to your agent this week if you don't have it. Start with Langfuse or Helicone; whichever takes the least config.
- Define 3-5 alerts on top of the trace data. Wire them to PagerDuty or Slack.
- Read cost control and latency optimization for the two dashboards observability enables.