Home›Expertise›RAGS to Riches›Observability and tracing

Observability and tracing

📖 5 min readUpdated 2026-04-18

RAG pipelines have many stages, each of which can fail independently or in subtle ways. Without observability, you find out quality is bad because users complain, and you can't tell which stage is responsible. Observability is what turns RAG from a black box into a debuggable system.

What to log per query

Inputs

Original user query
User ID, tenant ID, session ID
User context (permissions, preferences)
Request timestamp

Pipeline

Rewritten query (if applicable)
Query embedding (or embedding model version)
Retrieved chunks with IDs, scores, metadata
Reranked order with scores
Final context sent to generator
Tool calls (for agentic RAG)

Outputs

Generated answer
Token counts (input and output)
Model version used

Performance

Latency per stage
Total latency
Cache hits/misses per layer

User feedback

Thumbs up/down if provided
Follow-up query (implies dissatisfaction)
Session outcome

Trace structure

Use distributed tracing. Each query gets a trace; each stage is a span within that trace.

trace: query_001
├── span: query_preprocessing (15ms)
├── span: embedding (80ms) [cache miss]
├── span: retrieval (45ms)
│   ├── span: vector_search (30ms)
│   └── span: bm25_search (15ms)
├── span: fusion (3ms)
├── span: reranking (150ms)
├── span: generation (1200ms)
└── span: post_processing (20ms)
Total: 1513ms

Traces let you see exactly where time goes per query.

Metrics to aggregate

Performance

Latency distributions per stage (p50, p95, p99)
Throughput (queries per second)
Error rate per stage
Timeout rate

Quality proxies

Retrieval score distributions (dropping over time = distribution drift)
Top-1 confidence scores
Generation token counts (extremely long or short answers may be off)
Empty retrieval rate (queries with no relevant results)

User signals

Thumbs up/down rate
Follow-up query rate
Session abandonment rate
Clicks on citations

Cost

Tokens in / out per query
API spend by provider
Vector DB queries
Embedding API calls

The tooling landscape

Generic tracing

OpenTelemetry: standard for distributed tracing. Integrates with most backends.
Datadog, Honeycomb, Jaeger: standard observability platforms.

LLM-specific observability

Langfuse: open-source LLM observability with tracing and evaluation.
LangSmith: LangChain's managed tracing platform.
Phoenix (Arize): open-source LLM observability + eval.
Helicone: observability for LLM API calls.
W&B Weave: Weights & Biases' LLM tracing.
Traceloop / OpenLLMetry: OpenTelemetry-based LLM observability.

Debugging workflows

"This query gave a bad answer"

Find the trace for the query
Look at retrieved chunks: were they relevant?
If not: retrieval problem. Check embedding, BM25 scores, chunk boundaries.
If yes: generation problem. Check the prompt, the model output, the context formatting.

"Latency spiked today"

Look at per-stage latency distributions
Identify the stage that regressed
Correlate with deployments, traffic patterns, external API status

"Retrieval quality seems to be dropping"

Sample recent queries; run them against your eval metrics
Compare score distributions over time
Check index freshness: is content being ingested?
Check for embedding model drift

PII and compliance

User queries often contain PII. Logging choices matter:

Hash user identifiers
Consider redacting query content for regulated data
Retention policies: how long do traces stay?
Access controls on observability dashboards

Sampling

Full tracing at high QPS is expensive. Options:

Sample 1-10% of queries for full tracing
Always trace errors and slow queries (100%)
Always trace queries with user feedback (100%)
Aggregate metrics for all queries

Alerting

Configure alerts on:

Latency p99 exceeding threshold
Error rate above 1%
Empty retrieval rate above threshold
Cost per day exceeding budget
Negative feedback rate climbing

Alerts should be actionable. "Retrieval is slow" → specific runbook for investigation.

The production vs offline gap

Offline eval tests a fixed query set. Production has distribution drift, new user patterns, edge cases your eval doesn't cover. Observability closes the gap, real user data flowing back into the eval set keeps tests grounded.

The loop: production → observability → sample queries for labeling → eval set expansion → next iteration.

What to do with this

Trace every query end-to-end. Spans at each stage, sampled at low rates.
Always trace errors and slow queries at 100%.
Feed production traces back into your eval set monthly.

Observability and tracing

What to log per query

Inputs

Pipeline

Outputs

Performance

User feedback

Trace structure

Metrics to aggregate

Performance

Quality proxies

User signals

Cost

The tooling landscape

Generic tracing

LLM-specific observability

Debugging workflows

"This query gave a bad answer"

"Latency spiked today"

"Retrieval quality seems to be dropping"

PII and compliance

Sampling

Alerting

The production vs offline gap

What to do with this

Further reading