Building eval datasets
📖 5 min readUpdated 2026-04-18
An evaluation set is the foundation of disciplined RAG development. It's also the part teams most often cut corners on. Here's how I actually build eval datasets that produce usable signal, including the shortcuts that work and the ones that don't.
The essential properties
- Representative: reflects the queries users actually ask
- Diverse: covers different query types, difficulties, and content
- Labeled: known-good answers or known-relevant chunks
- Stable: same queries over time, so you can measure trends
- Expandable: grows as the system evolves
Where queries come from
Production logs (best)
If you have any production traffic, real user queries are gold. They show actual phrasing, actual intent, actual edge cases.
Process:
- Sample random queries (or stratified by query type)
- Strip PII
- Label each with expected relevant chunks / correct answers
Synthetic from documents (fast start)
Before you have traffic, generate queries from documents.
Process:
- Sample chunks from the corpus
- Prompt an LLM: "Generate 3 questions a user might ask that this chunk answers"
- Each chunk becomes a labeled query-chunk pair
Quality of synthetic queries depends on prompt quality. Iterate on the prompt until the queries feel like real user questions (not just rephrasings of the chunk).
SME (Subject Matter Expert) interviews
Ask experts: "What are the hardest/most common questions in this domain?" Produces high-value queries that real users might ask.
Support ticket analysis
If you have customer support data, extract common question patterns. Often the highest-quality source because these are real problems.
Competitor or benchmark queries
Public datasets for specific domains (BEIR, MS MARCO, HotpotQA) can provide a baseline. Limited applicability to your specific corpus but useful for sanity checking.
How many queries
- Minimum viable: 30 queries. Enough to catch obvious regressions.
- Reasonable baseline: 100-200 queries. Enough to detect meaningful improvements.
- Serious eval: 500-2000 queries. Statistical power to detect subtle differences.
- Research-grade: 10000+ queries. Academic benchmarks and very large production systems.
Start at 50-100, grow over time.
Diversity
Cover different query shapes:
- Short (1-3 words) vs long (20+ words)
- Specific (exact terms) vs general (conceptual)
- Single-hop vs multi-hop
- Different document types in your corpus
- Different topics / sections
- Different user personas (if you have multiple)
- Edge cases (ambiguous queries, queries with no good answer)
Track query distribution across these dimensions. Imbalanced eval sets produce misleading metrics.
Labeling
For retrieval
Each query is paired with one or more chunks that should be retrieved.
Labeling options:
- Manual: human reviews each query and flags relevant chunks. Time-consuming but highest quality.
- Semi-automatic: start with retrieval results, have human confirm/correct labels.
- Graded relevance: label on a scale (0-3) instead of binary. Used for NDCG.
For generation
Each query is paired with a reference answer.
Options:
- Human-written reference answers
- Accepted answers from past user feedback
- Expert-generated gold standard
Negative examples
Don't just label what should retrieve. Label what should NOT retrieve or what the system should refuse to answer:
- Out-of-scope queries ("what's the weather?")
- Queries with no good answer in corpus
- Queries requiring information behind access controls the user doesn't have
- Queries where the correct answer is "I don't know"
These queries should return empty results or refusal, not hallucinated answers.
The labeling tool
For more than a few dozen queries, manual labeling gets tedious. Options:
- Spreadsheet (simplest, works for 100-500 queries)
- Argilla (open-source labeling platform for LLM data)
- Label Studio (general-purpose)
- Scale / Surge (outsourced human labelers for large sets)
Simple pattern: export eval set to CSV, reviewer fills in labels, import back to your eval harness.
Versioning
Eval sets evolve. Version them:
- v1: initial 50 queries
- v2: added 100 queries covering new features
- v3: updated labels based on corpus changes
Report metrics against a specific version. This lets you track real improvements without eval-set drift.
Avoiding eval contamination
Don't use your eval queries as training examples for your system (fine-tuning, prompt examples, etc.). Eval contamination inflates metrics and hides real failures.
Hold out eval data strictly. Use separate query sets for training, development, and evaluation.
When to add queries
- After every user-facing feature change (add queries for the new feature)
- When you find a production bug (add the bug as an eval case so regressions get caught)
- When your corpus changes materially (ensure eval coverage of new content)
- Every quarter: sample fresh queries from production logs
Eval set should grow by 10-50 queries per month for an actively developed system.
The fast-start recipe
If you have nothing today:
- Sample 100 chunks from your corpus
- Prompt GPT-4 to generate 3 queries per chunk
- Sanity-check and deduplicate
- You now have 300 labeled query-chunk pairs
- Run baseline retrieval metrics
- Start improving against this baseline
Replace with real production queries as they become available. The synthetic set is a bootstrap, not a final answer.
Next: Latency optimization.