Retrieval metrics

Retrieval metrics measure whether your system found the right chunks. They don't tell you if the generator used them well, that's a separate measurement, but they tell you the upstream quality gate. If retrieval metrics are bad, nothing downstream can save you.

Hit rate @ k

Simplest metric: did the relevant chunk appear anywhere in the top-k results?

hit_rate@k = (queries where relevant chunk in top-k) / (total queries)

Usually measured at k = 5, 10, 50.

Good for: quick overall health check. If hit_rate@10 is 60%, 40% of your queries can't possibly be answered correctly.

Bad at: distinguishing "found at rank 1" from "found at rank 10." They count the same.

Mean Reciprocal Rank (MRR)

Average reciprocal of the rank of the first relevant chunk.

MRR = mean(1 / rank_of_first_relevant_chunk)

If relevant is at rank 1: contributes 1.0
If at rank 2: contributes 0.5
If at rank 5: contributes 0.2
If not found: contributes 0

Good for: rewarding getting the right answer at high ranks. Sensitive to position.

When to use: when you only pass top-5 or top-10 to the generator, and high ranks matter.

Recall @ k

What fraction of all relevant chunks made it into top-k?

recall@k = (relevant chunks in top-k) / (total relevant chunks)

Requires labeling all relevant chunks for each query, not just one.

Good for: multi-chunk questions where you want to find several relevant pieces. For single-chunk questions, hit rate@k is equivalent.

Precision @ k

What fraction of top-k results are relevant?

precision@k = (relevant chunks in top-k) / k

High precision = clean context for the generator. High recall = broad coverage but noisier context. Trade-off.

NDCG (Normalized Discounted Cumulative Gain)

Accounts for graded relevance (some chunks are more relevant than others) and rank position. More complex, more informative.

DCG@k = sum over top-k: relevance_score_i / log2(rank_i + 1)
NDCG@k = DCG@k / ideal_DCG@k

Ranges 0-1. Rewards both high relevance and high rank. Standard for information retrieval research.

When to use: when relevance is graded (not binary) and you care about ranking quality.

What metrics to start with

For most RAG projects:

NDCG, recall, etc. are useful additions but overkill for most early-stage projects.

Ground truth labeling

All of these metrics require knowing what's relevant for each query. You need labels.

Sources:

For an initial eval set, 50-200 human-labeled queries is usually enough to see meaningful signal.

The query-chunk pairing problem

What counts as "the relevant chunk"? Two types of queries:

Single-answer queries

One chunk has the answer. Label that chunk as relevant. Use hit rate, MRR.

Multi-answer queries

Multiple chunks contribute. Label several as relevant. Use recall, NDCG.

Ambiguous queries

Multiple legitimate answers exist. Label each chunk's relevance on a scale (0-3). Use NDCG.

Know which type your queries are before deciding which metrics to use.

Common interpretation pitfalls

Hit rate @ 50 = 95% is not great

If you're passing top-5 to the generator, hit rate @ 5 matters more than @ 50. Always measure at the k your production pipeline uses.

Small improvements aren't always real

On 100-query eval sets, a 2% change in hit rate is ~2 queries. Could be noise. Use larger eval sets or statistical tests to confirm real improvements.

Average masks variance

Average hit rate of 70% could be "every query has 70% chance" or "70% of queries always succeed, 30% always fail." The latter is worse, you have systematic blind spots. Look at the distribution, not just the mean.

Per-segment metrics

Aggregate metrics hide problems. Split by:

Segmented metrics surface specific problems that aggregate hides.

Next: Generation metrics.