Building eval datasets

An evaluation set is the foundation of disciplined RAG development. It's also the part teams most often cut corners on. Here's how I actually build eval datasets that produce usable signal, including the shortcuts that work and the ones that don't.

The essential properties

  1. Representative: reflects the queries users actually ask
  2. Diverse: covers different query types, difficulties, and content
  3. Labeled: known-good answers or known-relevant chunks
  4. Stable: same queries over time, so you can measure trends
  5. Expandable: grows as the system evolves

Where queries come from

Production logs (best)

If you have any production traffic, real user queries are gold. They show actual phrasing, actual intent, actual edge cases.

Process:

  1. Sample random queries (or stratified by query type)
  2. Strip PII
  3. Label each with expected relevant chunks / correct answers

Synthetic from documents (fast start)

Before you have traffic, generate queries from documents.

Process:

  1. Sample chunks from the corpus
  2. Prompt an LLM: "Generate 3 questions a user might ask that this chunk answers"
  3. Each chunk becomes a labeled query-chunk pair

Quality of synthetic queries depends on prompt quality. Iterate on the prompt until the queries feel like real user questions (not just rephrasings of the chunk).

SME (Subject Matter Expert) interviews

Ask experts: "What are the hardest/most common questions in this domain?" Produces high-value queries that real users might ask.

Support ticket analysis

If you have customer support data, extract common question patterns. Often the highest-quality source because these are real problems.

Competitor or benchmark queries

Public datasets for specific domains (BEIR, MS MARCO, HotpotQA) can provide a baseline. Limited applicability to your specific corpus but useful for sanity checking.

How many queries

Start at 50-100, grow over time.

Diversity

Cover different query shapes:

Track query distribution across these dimensions. Imbalanced eval sets produce misleading metrics.

Labeling

For retrieval

Each query is paired with one or more chunks that should be retrieved.

Labeling options:

For generation

Each query is paired with a reference answer.

Options:

Negative examples

Don't just label what should retrieve. Label what should NOT retrieve or what the system should refuse to answer:

These queries should return empty results or refusal, not hallucinated answers.

The labeling tool

For more than a few dozen queries, manual labeling gets tedious. Options:

Simple pattern: export eval set to CSV, reviewer fills in labels, import back to your eval harness.

Versioning

Eval sets evolve. Version them:

Report metrics against a specific version. This lets you track real improvements without eval-set drift.

Avoiding eval contamination

Don't use your eval queries as training examples for your system (fine-tuning, prompt examples, etc.). Eval contamination inflates metrics and hides real failures.

Hold out eval data strictly. Use separate query sets for training, development, and evaluation.

When to add queries

Eval set should grow by 10-50 queries per month for an actively developed system.

The fast-start recipe

If you have nothing today:

  1. Sample 100 chunks from your corpus
  2. Prompt GPT-4 to generate 3 queries per chunk
  3. Sanity-check and deduplicate
  4. You now have 300 labeled query-chunk pairs
  5. Run baseline retrieval metrics
  6. Start improving against this baseline

Replace with real production queries as they become available. The synthetic set is a bootstrap, not a final answer.

Next: Latency optimization.