Observability + tracing
📖 3 min readUpdated 2026-04-19
Agents fail in ways that traditional apps don't: non-deterministic reasoning, cascading tool errors, quality drift. Without comprehensive tracing, these failures are invisible until a customer complains. Observability isn't optional.
What to log per session
- User input (with identifiers)
- System prompt version
- Model version
- Every LLM call: full input, full output, token counts, latency
- Every tool call: name, args, result, latency, error (if any)
- Memory reads/writes
- Session cost accumulator
- Final output
- Termination reason (completed, budget hit, error, timeout)
Traces as debugging artifacts
A trace is the replayable record of an agent session. Given a trace you should be able to:
- See the exact input/output of every LLM and tool call
- Step through reasoning turn-by-turn
- Spot where things went wrong
- Reproduce the failure locally
Tools
- Langfuse, open-source LLM observability with tracing
- LangSmith, LangChain's hosted tracing
- Phoenix (Arize), open-source, strong eval integration
- Weights & Biases Weave, W&B's tracing
- Helicone, LLM observability, focused on cost and latency
- OpenTelemetry, general distributed tracing, integrates with above
Sampling
Full tracing at high QPS gets expensive. Sample:
- Trace 1-5% of normal requests
- Trace 100% of errors
- Trace 100% of requests that hit cost budget
- Trace 100% of user-reported issues
Alerts
Alert on:
- Error rate exceeding threshold
- Average cost per session climbing
- p99 latency regressing
- Task completion rate dropping
- Specific tools failing repeatedly
The feedback loop
Observability feeds eval:
- Production trace reveals a failure
- Turn the failing case into an eval case
- Fix the issue
- Verify fix in eval; add to regression suite
Without this loop, the same bugs recur.