Why evaluation is critical

A RAG system without evaluation is a system that silently degrades. Models change. Corpora grow. Users' queries evolve. Without measurement, "our RAG is pretty good" becomes "our RAG used to be pretty good", and nobody knows when it happened. Evaluation is how you turn RAG from a project into an engineering discipline.

What breaks without evaluation

Silent regressions

You change chunk size, swap embedding models, update the reranker, or upgrade the LLM. Quality could be better or worse and you'd have no way to tell without running tests.

Invisible distribution drift

User query patterns change. New content gets added. Old content becomes stale. The queries that worked at launch break six months in.

Unjustified complexity

You add reranking, multi-query, agentic retrieval. Does each layer actually help? Without eval, you're shipping complexity on faith.

Stale mental models

Team members have mental models of what works. These models drift from reality. Eval data grounds decisions in current truth.

The evaluation stack

A production RAG evaluation stack has three layers:

1. Component eval

Measure each stage independently:

2. Regression eval

A stable test set that runs on every change. Catches regressions before they ship.

3. Production monitoring

Real-time metrics on live traffic. Catches drift, quality problems, and operational issues that offline eval misses.

The golden eval set

The foundation of everything: a curated set of representative queries with known-good answers and known-relevant chunks.

How to build one:

  1. Collect real user queries from logs
  2. For each, identify the chunks that should be retrieved
  3. For each, write an acceptable answer
  4. Review with subject matter experts

Size: 50-500 queries is usually enough for meaningful signal. More is better. Invest time in this.

See building eval datasets.

What to measure

Retrieval

Generation

End-to-end

The measurement rhythm

On every change

Run the regression eval set. If metrics drop, investigate before merging.

Weekly

Review production metrics: latency, retrieval quality proxies, user feedback trends.

Monthly

Full eval pass. Update the golden set with new representative queries.

Quarterly

Audit: are the metrics still measuring what matters? Has the system evolved in ways that need new metrics?

The cost of not evaluating

I've seen production RAG systems where:

In each case, eval was the difference between catching it in a day and catching it in months.

The evaluator-model question

Many RAG evaluation frameworks use an LLM to judge output (LLM-as-judge). This works but introduces noise and bias. Best practice:

The one-week eval investment

Teams that spend one focused week building an eval harness ship dramatically better RAG. The investment pays back in fewer regressions, faster iteration, and clearer decision-making. It's the single highest-leverage engineering investment in a RAG project.

Next: Retrieval metrics.