Generation metrics

Retrieval metrics tell you if the right chunks were found. Generation metrics tell you if the LLM used them to produce a good answer. Both matter. A system with great retrieval and bad generation is still bad. A system with bad retrieval and great generation hallucinates confidently.

The three metrics

Faithfulness

Does the answer only say things supported by the retrieved context? Or does it hallucinate?

Measurement:

  1. Break the answer into individual claims
  2. For each claim, check if the retrieved context supports it
  3. Faithfulness score = supported claims / total claims

Low faithfulness = hallucinations. The single most important metric for trustworthy RAG.

Answer relevance

Does the answer address what the user asked, regardless of correctness?

Measurement: LLM-as-judge scores the answer for relevance to the question on a 1-5 scale, or the answer is embedded and compared to the question.

Low answer relevance = the model is answering a different question or dodging.

Correctness

Is the answer factually right? Requires ground-truth answers.

Measurement: compare answer to reference answer via LLM judge or exact match or semantic similarity.

This is harder to scale but the ultimate test of quality.

LLM-as-judge

Most generation evaluation uses an LLM to judge outputs:

SYSTEM: You are evaluating the faithfulness of an AI-generated answer.
Given a question, retrieved context, and an answer, judge whether
every claim in the answer is supported by the context.

Score:
- 1: completely faithful
- 0.5: mostly faithful, minor unsupported claims
- 0: many unsupported claims or fabrications

USER:
Question: [q]
Context: [retrieved chunks]
Answer: [generated answer]

Your score and reasoning:

Caveats:

Claim-level faithfulness

Rather than rating the whole answer, break it into claims and check each:

Answer: "Our refund policy allows 30 days from purchase, with free
returns on orders over $50, and no returns on final-sale items."

Claims:
1. Refund policy allows 30 days from purchase. → Supported? Yes/No
2. Free returns on orders over $50. → Supported? Yes/No
3. No returns on final-sale items. → Supported? Yes/No

Faithfulness = fraction supported

More granular, more reliable than whole-answer scoring.

Retrieval-aware faithfulness

A related metric: context relevance. Of the retrieved chunks, which ones actually contributed to the answer?

High context relevance = generator used context well.

Low context relevance = generator ignored context (relying on pretraining) or retrieval was noisy.

Exact vs approximate correctness

Exact match

The answer string matches the reference exactly. Only works for short factual answers.

Semantic similarity

Embed both answers, compare cosine similarity. Handles paraphrasing.

LLM-as-judge correctness

Ask a judge model whether the answer is consistent with the reference. Most flexible.

Structured output match

For answers with structured parts (lists, JSON), compare per-field.

Combining metrics

A composite RAG score:

composite = α × faithfulness + β × relevance + γ × correctness

With typical weights α=0.5, β=0.25, γ=0.25 (faithfulness matters most because hallucinations are the biggest risk).

Composite scores are useful for tracking a single "overall quality" number. For debugging, look at individual metrics.

Common failure patterns

High faithfulness, low relevance

The model is citing retrieved context accurately but answering the wrong question. Usually means the retrieved context didn't contain what was needed, and the generator faithfully summarized it.

High relevance, low faithfulness

The model is answering the question well but making up facts. Classic hallucination. The generator is relying on pretraining over retrieved context.

Low faithfulness and low relevance

Both parts of the pipeline are broken. Debug retrieval first (is it returning usable chunks?), then generation.

High on both, low correctness

The answer is faithful to retrieved context and relevant to the question, but still wrong. Probably means the retrieved context itself is wrong, you have bad data in your corpus.

Measuring during production

Offline eval sets are limited. In production, sample a percentage of real queries for offline evaluation:

  1. Sample 1-5% of production queries
  2. Run the generation eval pipeline on them
  3. Track metrics over time
  4. Investigate queries that score poorly

This catches distribution drift that your static eval set misses.

User feedback as a metric

The best generation metric: did the user find the answer helpful?

User feedback is noisy per query but powerful in aggregate. Build it into your product.

Next: RAGAS, TruLens, ARES.