Observability and tracing

RAG pipelines have many stages, each of which can fail independently or in subtle ways. Without observability, you find out quality is bad because users complain, and you can't tell which stage is responsible. Observability is what turns RAG from a black box into a debuggable system.

What to log per query

Inputs

Pipeline

Outputs

Performance

User feedback

Trace structure

Use distributed tracing. Each query gets a trace; each stage is a span within that trace.

trace: query_001
├── span: query_preprocessing (15ms)
├── span: embedding (80ms) [cache miss]
├── span: retrieval (45ms)
│   ├── span: vector_search (30ms)
│   └── span: bm25_search (15ms)
├── span: fusion (3ms)
├── span: reranking (150ms)
├── span: generation (1200ms)
└── span: post_processing (20ms)
Total: 1513ms

Traces let you see exactly where time goes per query.

Metrics to aggregate

Performance

Quality proxies

User signals

Cost

The tooling landscape

Generic tracing

LLM-specific observability

Debugging workflows

"This query gave a bad answer"

  1. Find the trace for the query
  2. Look at retrieved chunks: were they relevant?
  3. If not: retrieval problem. Check embedding, BM25 scores, chunk boundaries.
  4. If yes: generation problem. Check the prompt, the model output, the context formatting.

"Latency spiked today"

  1. Look at per-stage latency distributions
  2. Identify the stage that regressed
  3. Correlate with deployments, traffic patterns, external API status

"Retrieval quality seems to be dropping"

  1. Sample recent queries; run them against your eval metrics
  2. Compare score distributions over time
  3. Check index freshness: is content being ingested?
  4. Check for embedding model drift

PII and compliance

User queries often contain PII. Logging choices matter:

Sampling

Full tracing at high QPS is expensive. Options:

Alerting

Configure alerts on:

Alerts should be actionable. "Retrieval is slow" → specific runbook for investigation.

The production vs offline gap

Offline eval tests a fixed query set. Production has distribution drift, new user patterns, edge cases your eval doesn't cover. Observability closes the gap, real user data flowing back into the eval set keeps tests grounded.

The loop: production → observability → sample queries for labeling → eval set expansion → next iteration.

What to do with this