A RAG system without evaluation is a system that silently degrades. Models change. Corpora grow. Users' queries evolve. Without measurement, "our RAG is pretty good" becomes "our RAG used to be pretty good", and nobody knows when it happened. Evaluation is how you turn RAG from a project into an engineering discipline.
You change chunk size, swap embedding models, update the reranker, or upgrade the LLM. Quality could be better or worse and you'd have no way to tell without running tests.
User query patterns change. New content gets added. Old content becomes stale. The queries that worked at launch break six months in.
You add reranking, multi-query, agentic retrieval. Does each layer actually help? Without eval, you're shipping complexity on faith.
Team members have mental models of what works. These models drift from reality. Eval data grounds decisions in current truth.
A production RAG evaluation stack has three layers:
Measure each stage independently:
A stable test set that runs on every change. Catches regressions before they ship.
Real-time metrics on live traffic. Catches drift, quality problems, and operational issues that offline eval misses.
The foundation of everything: a curated set of representative queries with known-good answers and known-relevant chunks.
How to build one:
Size: 50-500 queries is usually enough for meaningful signal. More is better. Invest time in this.
Run the regression eval set. If metrics drop, investigate before merging.
Review production metrics: latency, retrieval quality proxies, user feedback trends.
Full eval pass. Update the golden set with new representative queries.
Audit: are the metrics still measuring what matters? Has the system evolved in ways that need new metrics?
I've seen production RAG systems where:
In each case, eval was the difference between catching it in a day and catching it in months.
Many RAG evaluation frameworks use an LLM to judge output (LLM-as-judge). This works but introduces noise and bias. Best practice:
Teams that spend one focused week building an eval harness ship dramatically better RAG. The investment pays back in fewer regressions, faster iteration, and clearer decision-making. It's the single highest-leverage engineering investment in a RAG project.
Next: Retrieval metrics.