Home›Expertise›RAGS to Riches›Retrieval metrics

Retrieval metrics

📖 4 min readUpdated 2026-04-18

Retrieval metrics measure whether your system found the right chunks. They don't tell you if the generator used them well, that's a separate measurement, but they tell you the upstream quality gate. If retrieval metrics are bad, nothing downstream can save you.

Hit rate @ k

Simplest metric: did the relevant chunk appear anywhere in the top-k results?

hit_rate@k = (queries where relevant chunk in top-k) / (total queries)

Usually measured at k = 5, 10, 50.

Good for: quick overall health check. If hit_rate@10 is 60%, 40% of your queries can't possibly be answered correctly.

Bad at: distinguishing "found at rank 1" from "found at rank 10." They count the same.

Mean Reciprocal Rank (MRR)

Average reciprocal of the rank of the first relevant chunk.

MRR = mean(1 / rank_of_first_relevant_chunk)

If relevant is at rank 1: contributes 1.0
If at rank 2: contributes 0.5
If at rank 5: contributes 0.2
If not found: contributes 0

Good for: rewarding getting the right answer at high ranks. Sensitive to position.

When to use: when you only pass top-5 or top-10 to the generator, and high ranks matter.

Recall @ k

What fraction of all relevant chunks made it into top-k?

recall@k = (relevant chunks in top-k) / (total relevant chunks)

Requires labeling all relevant chunks for each query, not just one.

Good for: multi-chunk questions where you want to find several relevant pieces. For single-chunk questions, hit rate@k is equivalent.

Precision @ k

What fraction of top-k results are relevant?

precision@k = (relevant chunks in top-k) / k

High precision = clean context for the generator. High recall = broad coverage but noisier context. Trade-off.

NDCG (Normalized Discounted Cumulative Gain)

Accounts for graded relevance (some chunks are more relevant than others) and rank position. More complex, more informative.

DCG@k = sum over top-k: relevance_score_i / log2(rank_i + 1)
NDCG@k = DCG@k / ideal_DCG@k

Ranges 0-1. Rewards both high relevance and high rank. Standard for information retrieval research.

When to use: when relevance is graded (not binary) and you care about ranking quality.

What metrics to start with

For most RAG projects:

Hit rate @ 10: overall retrieval health
MRR: rank quality, especially when top-5 matters
Precision @ 5 and @ 10: how clean is the context the generator sees?

NDCG, recall, etc. are useful additions but overkill for most early-stage projects.

Ground truth labeling

All of these metrics require knowing what's relevant for each query. You need labels.

Sources:

Manual labeling: a human goes through queries and flags relevant chunks. Expensive, high quality.
Synthetic labeling: generate synthetic queries from chunks (the chunk itself is then the known-relevant)
Click/feedback data: user interactions imply relevance. Noisy, abundant.

For an initial eval set, 50-200 human-labeled queries is usually enough to see meaningful signal.

The query-chunk pairing problem

What counts as "the relevant chunk"? Two types of queries:

Single-answer queries

One chunk has the answer. Label that chunk as relevant. Use hit rate, MRR.

Multi-answer queries

Multiple chunks contribute. Label several as relevant. Use recall, NDCG.

Ambiguous queries

Multiple legitimate answers exist. Label each chunk's relevance on a scale (0-3). Use NDCG.

Know which type your queries are before deciding which metrics to use.

Common interpretation pitfalls

Hit rate @ 50 = 95% is not great

If you're passing top-5 to the generator, hit rate @ 5 matters more than @ 50. Always measure at the k your production pipeline uses.

Small improvements aren't always real

On 100-query eval sets, a 2% change in hit rate is ~2 queries. Could be noise. Use larger eval sets or statistical tests to confirm real improvements.

Average masks variance

Average hit rate of 70% could be "every query has 70% chance" or "70% of queries always succeed, 30% always fail." The latter is worse, you have systematic blind spots. Look at the distribution, not just the mean.

Per-segment metrics

Aggregate metrics hide problems. Split by:

Query type (factual, multi-hop, synthesis)
Query length
Document type being queried (docs, code, tables)
Source system
Language
Time period (is recent content retrievable?)

Segmented metrics surface specific problems that aggregate hides.

What to do with this

Measure at the k your production pipeline actually uses, not a convenient number.
Segment by query type and source. Aggregates hide systematic failures.
On small eval sets, treat small wins with skepticism. 2% of 100 = 2 queries.