Observability and tracing
📖 5 min readUpdated 2026-04-18
RAG pipelines have many stages, each of which can fail independently or in subtle ways. Without observability, you find out quality is bad because users complain, and you can't tell which stage is responsible. Observability is what turns RAG from a black box into a debuggable system.
What to log per query
Inputs
- Original user query
- User ID, tenant ID, session ID
- User context (permissions, preferences)
- Request timestamp
Pipeline
- Rewritten query (if applicable)
- Query embedding (or embedding model version)
- Retrieved chunks with IDs, scores, metadata
- Reranked order with scores
- Final context sent to generator
- Tool calls (for agentic RAG)
Outputs
- Generated answer
- Token counts (input and output)
- Model version used
Performance
- Latency per stage
- Total latency
- Cache hits/misses per layer
User feedback
- Thumbs up/down if provided
- Follow-up query (implies dissatisfaction)
- Session outcome
Trace structure
Use distributed tracing. Each query gets a trace; each stage is a span within that trace.
trace: query_001
├── span: query_preprocessing (15ms)
├── span: embedding (80ms) [cache miss]
├── span: retrieval (45ms)
│ ├── span: vector_search (30ms)
│ └── span: bm25_search (15ms)
├── span: fusion (3ms)
├── span: reranking (150ms)
├── span: generation (1200ms)
└── span: post_processing (20ms)
Total: 1513ms
Traces let you see exactly where time goes per query.
Metrics to aggregate
Performance
- Latency distributions per stage (p50, p95, p99)
- Throughput (queries per second)
- Error rate per stage
- Timeout rate
Quality proxies
- Retrieval score distributions (dropping over time = distribution drift)
- Top-1 confidence scores
- Generation token counts (extremely long or short answers may be off)
- Empty retrieval rate (queries with no relevant results)
User signals
- Thumbs up/down rate
- Follow-up query rate
- Session abandonment rate
- Clicks on citations
Cost
- Tokens in / out per query
- API spend by provider
- Vector DB queries
- Embedding API calls
The tooling landscape
Generic tracing
- OpenTelemetry: standard for distributed tracing. Integrates with most backends.
- Datadog, Honeycomb, Jaeger: standard observability platforms.
LLM-specific observability
- Langfuse: open-source LLM observability with tracing and evaluation.
- LangSmith: LangChain's managed tracing platform.
- Phoenix (Arize): open-source LLM observability + eval.
- Helicone: observability for LLM API calls.
- W&B Weave: Weights & Biases' LLM tracing.
- Traceloop / OpenLLMetry: OpenTelemetry-based LLM observability.
Debugging workflows
"This query gave a bad answer"
- Find the trace for the query
- Look at retrieved chunks: were they relevant?
- If not: retrieval problem. Check embedding, BM25 scores, chunk boundaries.
- If yes: generation problem. Check the prompt, the model output, the context formatting.
"Latency spiked today"
- Look at per-stage latency distributions
- Identify the stage that regressed
- Correlate with deployments, traffic patterns, external API status
"Retrieval quality seems to be dropping"
- Sample recent queries; run them against your eval metrics
- Compare score distributions over time
- Check index freshness: is content being ingested?
- Check for embedding model drift
PII and compliance
User queries often contain PII. Logging choices matter:
- Hash user identifiers
- Consider redacting query content for regulated data
- Retention policies: how long do traces stay?
- Access controls on observability dashboards
Sampling
Full tracing at high QPS is expensive. Options:
- Sample 1-10% of queries for full tracing
- Always trace errors and slow queries (100%)
- Always trace queries with user feedback (100%)
- Aggregate metrics for all queries
Alerting
Configure alerts on:
- Latency p99 exceeding threshold
- Error rate above 1%
- Empty retrieval rate above threshold
- Cost per day exceeding budget
- Negative feedback rate climbing
Alerts should be actionable. "Retrieval is slow" → specific runbook for investigation.
The production vs offline gap
Offline eval tests a fixed query set. Production has distribution drift, new user patterns, edge cases your eval doesn't cover. Observability closes the gap, real user data flowing back into the eval set keeps tests grounded.
The loop: production → observability → sample queries for labeling → eval set expansion → next iteration.
Next: Cost management.