Home›Expertise›RAGS to Riches›Building eval datasets

Building eval datasets

📖 5 min readUpdated 2026-04-18

An evaluation set is the foundation of disciplined RAG development. It's also the part teams most often cut corners on. Here's how I actually build eval datasets that produce usable signal, including the shortcuts that work and the ones that don't.

The essential properties

Representative: reflects the queries users actually ask
Diverse: covers different query types, difficulties, and content
Labeled: known-good answers or known-relevant chunks
Stable: same queries over time, so you can measure trends
Expandable: grows as the system evolves

Where queries come from

Production logs (best)

If you have any production traffic, real user queries are gold. They show actual phrasing, actual intent, actual edge cases.

Process:

Sample random queries (or stratified by query type)
Strip PII
Label each with expected relevant chunks / correct answers

Synthetic from documents (fast start)

Before you have traffic, generate queries from documents.

Process:

Sample chunks from the corpus
Prompt an LLM: "Generate 3 questions a user might ask that this chunk answers"
Each chunk becomes a labeled query-chunk pair

Quality of synthetic queries depends on prompt quality. Iterate on the prompt until the queries feel like real user questions (not just rephrasings of the chunk).

SME (Subject Matter Expert) interviews

Ask experts: "What are the hardest/most common questions in this domain?" Produces high-value queries that real users might ask.

Support ticket analysis

If you have customer support data, extract common question patterns. Often the highest-quality source because these are real problems.

Competitor or benchmark queries

Public datasets for specific domains (BEIR, MS MARCO, HotpotQA) can provide a baseline. Limited applicability to your specific corpus but useful for sanity checking.

How many queries

Minimum viable: 30 queries. Enough to catch obvious regressions.
Reasonable baseline: 100-200 queries. Enough to detect meaningful improvements.
Serious eval: 500-2000 queries. Statistical power to detect subtle differences.
Research-grade: 10000+ queries. Academic benchmarks and very large production systems.

Start at 50-100, grow over time.

Diversity

Cover different query shapes:

Short (1-3 words) vs long (20+ words)
Specific (exact terms) vs general (conceptual)
Single-hop vs multi-hop
Different document types in your corpus
Different topics / sections
Different user personas (if you have multiple)
Edge cases (ambiguous queries, queries with no good answer)

Track query distribution across these dimensions. Imbalanced eval sets produce misleading metrics.

Labeling

For retrieval

Each query is paired with one or more chunks that should be retrieved.

Labeling options:

Manual: human reviews each query and flags relevant chunks. Time-consuming but highest quality.
Semi-automatic: start with retrieval results, have human confirm/correct labels.
Graded relevance: label on a scale (0-3) instead of binary. Used for NDCG.

For generation

Each query is paired with a reference answer.

Options:

Human-written reference answers
Accepted answers from past user feedback
Expert-generated gold standard

Negative examples

Don't just label what should retrieve. Label what should NOT retrieve or what the system should refuse to answer:

Out-of-scope queries ("what's the weather?")
Queries with no good answer in corpus
Queries requiring information behind access controls the user doesn't have
Queries where the correct answer is "I don't know"

These queries should return empty results or refusal, not hallucinated answers.

The labeling tool

For more than a few dozen queries, manual labeling gets tedious. Options:

Spreadsheet (simplest, works for 100-500 queries)
Argilla (open-source labeling platform for LLM data)
Label Studio (general-purpose)
Scale / Surge (outsourced human labelers for large sets)

Simple pattern: export eval set to CSV, reviewer fills in labels, import back to your eval harness.

Versioning

Eval sets evolve. Version them:

v1: initial 50 queries
v2: added 100 queries covering new features
v3: updated labels based on corpus changes

Report metrics against a specific version. This lets you track real improvements without eval-set drift.

Avoiding eval contamination

Don't use your eval queries as training examples for your system (fine-tuning, prompt examples, etc.). Eval contamination inflates metrics and hides real failures.

Hold out eval data strictly. Use separate query sets for training, development, and evaluation.

When to add queries

After every user-facing feature change (add queries for the new feature)
When you find a production bug (add the bug as an eval case so regressions get caught)
When your corpus changes materially (ensure eval coverage of new content)
Every quarter: sample fresh queries from production logs

Eval set should grow by 10-50 queries per month for an actively developed system.

The fast-start recipe

If you have nothing today:

Sample 100 chunks from your corpus
Prompt GPT-4 to generate 3 queries per chunk
Sanity-check and deduplicate
You now have 300 labeled query-chunk pairs
Run baseline retrieval metrics
Start improving against this baseline

Replace with real production queries as they become available. The synthetic set is a bootstrap, not a final answer.

What to do with this

Bootstrap with synthetic queries; replace with real production data as soon as possible.
Add every production bug as a new eval case so regressions get caught.
Segment your eval set by query shape so aggregate metrics don't hide blind spots.

Next: Latency optimization.