Retrieval metrics measure whether your system found the right chunks. They don't tell you if the generator used them well, that's a separate measurement, but they tell you the upstream quality gate. If retrieval metrics are bad, nothing downstream can save you.
Simplest metric: did the relevant chunk appear anywhere in the top-k results?
hit_rate@k = (queries where relevant chunk in top-k) / (total queries)
Usually measured at k = 5, 10, 50.
Good for: quick overall health check. If hit_rate@10 is 60%, 40% of your queries can't possibly be answered correctly.
Bad at: distinguishing "found at rank 1" from "found at rank 10." They count the same.
Average reciprocal of the rank of the first relevant chunk.
MRR = mean(1 / rank_of_first_relevant_chunk) If relevant is at rank 1: contributes 1.0 If at rank 2: contributes 0.5 If at rank 5: contributes 0.2 If not found: contributes 0
Good for: rewarding getting the right answer at high ranks. Sensitive to position.
When to use: when you only pass top-5 or top-10 to the generator, and high ranks matter.
What fraction of all relevant chunks made it into top-k?
recall@k = (relevant chunks in top-k) / (total relevant chunks)
Requires labeling all relevant chunks for each query, not just one.
Good for: multi-chunk questions where you want to find several relevant pieces. For single-chunk questions, hit rate@k is equivalent.
What fraction of top-k results are relevant?
precision@k = (relevant chunks in top-k) / k
High precision = clean context for the generator. High recall = broad coverage but noisier context. Trade-off.
Accounts for graded relevance (some chunks are more relevant than others) and rank position. More complex, more informative.
DCG@k = sum over top-k: relevance_score_i / log2(rank_i + 1) NDCG@k = DCG@k / ideal_DCG@k
Ranges 0-1. Rewards both high relevance and high rank. Standard for information retrieval research.
When to use: when relevance is graded (not binary) and you care about ranking quality.
For most RAG projects:
NDCG, recall, etc. are useful additions but overkill for most early-stage projects.
All of these metrics require knowing what's relevant for each query. You need labels.
Sources:
For an initial eval set, 50-200 human-labeled queries is usually enough to see meaningful signal.
What counts as "the relevant chunk"? Two types of queries:
One chunk has the answer. Label that chunk as relevant. Use hit rate, MRR.
Multiple chunks contribute. Label several as relevant. Use recall, NDCG.
Multiple legitimate answers exist. Label each chunk's relevance on a scale (0-3). Use NDCG.
Know which type your queries are before deciding which metrics to use.
If you're passing top-5 to the generator, hit rate @ 5 matters more than @ 50. Always measure at the k your production pipeline uses.
On 100-query eval sets, a 2% change in hit rate is ~2 queries. Could be noise. Use larger eval sets or statistical tests to confirm real improvements.
Average hit rate of 70% could be "every query has 70% chance" or "70% of queries always succeed, 30% always fail." The latter is worse, you have systematic blind spots. Look at the distribution, not just the mean.
Aggregate metrics hide problems. Split by:
Segmented metrics surface specific problems that aggregate hides.
Next: Generation metrics.