Home›Expertise›RAGS to Riches›Why evaluation is critical

Why evaluation is critical

📖 4 min readUpdated 2026-04-18

A RAG system without evaluation is a system that silently degrades. Models change. Corpora grow. Users' queries evolve. Without measurement, "our RAG is pretty good" becomes "our RAG used to be pretty good", and nobody knows when it happened. Evaluation is how you turn RAG from a project into an engineering discipline.

What breaks without evaluation

Silent regressions

You change chunk size, swap embedding models, update the reranker, or upgrade the LLM. Quality could be better or worse and you'd have no way to tell without running tests.

Invisible distribution drift

User query patterns change. New content gets added. Old content becomes stale. The queries that worked at launch break six months in.

Unjustified complexity

You add reranking, multi-query, agentic retrieval. Does each layer actually help? Without eval, you're shipping complexity on faith.

Stale mental models

Team members have mental models of what works. These models drift from reality. Eval data grounds decisions in current truth.

The evaluation stack

A production RAG evaluation stack has three layers:

1. Component eval

Measure each stage independently:

Retrieval quality (did we find the right chunks?)
Generation quality (did we answer correctly given the chunks?)
End-to-end quality (did the user get a good answer?)

2. Regression eval

A stable test set that runs on every change. Catches regressions before they ship.

3. Production monitoring

Real-time metrics on live traffic. Catches drift, quality problems, and operational issues that offline eval misses.

The golden eval set

The foundation of everything: a curated set of representative queries with known-good answers and known-relevant chunks.

How to build one:

Collect real user queries from logs
For each, identify the chunks that should be retrieved
For each, write an acceptable answer
Review with subject matter experts

Size: 50-500 queries is usually enough for meaningful signal. More is better. Invest time in this.

See building eval datasets.

What to measure

Retrieval

Hit rate@k: did the relevant chunk appear in the top-k retrieved?
MRR (Mean Reciprocal Rank): how high in the ranking did it appear?
Recall@k: how many relevant chunks made it into top-k?
NDCG: ranking quality, weighted by position

Generation

Faithfulness: does the answer actually reflect the retrieved context?
Answer relevance: does the answer address the question?
Correctness: is the answer factually right?

End-to-end

User feedback (thumbs up/down): the ultimate signal, noisy
Explicit scoring: ask users to rate answers occasionally
Task completion: did the user get what they needed?

The measurement rhythm

On every change

Run the regression eval set. If metrics drop, investigate before merging.

Weekly

Review production metrics: latency, retrieval quality proxies, user feedback trends.

Monthly

Full eval pass. Update the golden set with new representative queries.

Quarterly

Audit: are the metrics still measuring what matters? Has the system evolved in ways that need new metrics?

The cost of not evaluating

I've seen production RAG systems where:

A new embedding model dropped retrieval quality 15%, noticed 3 months later
A chunking change broke PDFs, users complained for weeks before anyone connected it
An LLM upgrade improved latency but hurt grounding, metrics showed it immediately, but no one was watching

In each case, eval was the difference between catching it in a day and catching it in months.

The evaluator-model question

Many RAG evaluation frameworks use an LLM to judge output (LLM-as-judge). This works but introduces noise and bias. Best practice:

Use a strong model (GPT-4, Claude) as the judge
Calibrate against human judgments periodically
Track inter-rater agreement between LLM judges over time
Don't trust absolute scores; trust relative comparisons

The one-week eval investment

Teams that spend one focused week building an eval harness ship dramatically better RAG. The investment pays back in fewer regressions, faster iteration, and clearer decision-making. It's the single highest-leverage engineering investment in a RAG project.

What to do with this

Block a full week for the eval harness before shipping prod changes.
Measure all three layers (retrieval, generation, end-to-end). One alone misleads.
Calibrate LLM-as-judge against human judgment periodically.