Retrieval metrics tell you if the right chunks were found. Generation metrics tell you if the LLM used them to produce a good answer. Both matter. A system with great retrieval and bad generation is still bad. A system with bad retrieval and great generation hallucinates confidently.
Does the answer only say things supported by the retrieved context? Or does it hallucinate?
Measurement:
Low faithfulness = hallucinations. The single most important metric for trustworthy RAG.
Does the answer address what the user asked, regardless of correctness?
Measurement: LLM-as-judge scores the answer for relevance to the question on a 1-5 scale, or the answer is embedded and compared to the question.
Low answer relevance = the model is answering a different question or dodging.
Is the answer factually right? Requires ground-truth answers.
Measurement: compare answer to reference answer via LLM judge or exact match or semantic similarity.
This is harder to scale but the ultimate test of quality.
Most generation evaluation uses an LLM to judge outputs:
SYSTEM: You are evaluating the faithfulness of an AI-generated answer. Given a question, retrieved context, and an answer, judge whether every claim in the answer is supported by the context. Score: - 1: completely faithful - 0.5: mostly faithful, minor unsupported claims - 0: many unsupported claims or fabrications USER: Question: [q] Context: [retrieved chunks] Answer: [generated answer] Your score and reasoning:
Caveats:
Rather than rating the whole answer, break it into claims and check each:
Answer: "Our refund policy allows 30 days from purchase, with free returns on orders over $50, and no returns on final-sale items." Claims: 1. Refund policy allows 30 days from purchase. → Supported? Yes/No 2. Free returns on orders over $50. → Supported? Yes/No 3. No returns on final-sale items. → Supported? Yes/No Faithfulness = fraction supported
More granular, more reliable than whole-answer scoring.
A related metric: context relevance. Of the retrieved chunks, which ones actually contributed to the answer?
High context relevance = generator used context well.
Low context relevance = generator ignored context (relying on pretraining) or retrieval was noisy.
The answer string matches the reference exactly. Only works for short factual answers.
Embed both answers, compare cosine similarity. Handles paraphrasing.
Ask a judge model whether the answer is consistent with the reference. Most flexible.
For answers with structured parts (lists, JSON), compare per-field.
A composite RAG score:
composite = α × faithfulness + β × relevance + γ × correctness
With typical weights α=0.5, β=0.25, γ=0.25 (faithfulness matters most because hallucinations are the biggest risk).
Composite scores are useful for tracking a single "overall quality" number. For debugging, look at individual metrics.
The model is citing retrieved context accurately but answering the wrong question. Usually means the retrieved context didn't contain what was needed, and the generator faithfully summarized it.
The model is answering the question well but making up facts. Classic hallucination. The generator is relying on pretraining over retrieved context.
Both parts of the pipeline are broken. Debug retrieval first (is it returning usable chunks?), then generation.
The answer is faithful to retrieved context and relevant to the question, but still wrong. Probably means the retrieved context itself is wrong, you have bad data in your corpus.
Offline eval sets are limited. In production, sample a percentage of real queries for offline evaluation:
This catches distribution drift that your static eval set misses.
The best generation metric: did the user find the answer helpful?
User feedback is noisy per query but powerful in aggregate. Build it into your product.
Next: RAGAS, TruLens, ARES.