Skip to content

Golden Dataset Curation

Patterns intermediate 12 min
Sources verified Dec 23

Building high-quality evaluation datasets that anchor AI system quality - because an eval is only as good as its test cases.

A Golden Dataset is a curated collection of inputs and expected outputs used to evaluate AI system quality. Unlike unit tests with a few examples, golden datasets contain hundreds or thousands of cases representing real-world usage patterns, edge cases, and failure modes.

Why Golden Datasets Matter: An LLM-as-Judge eval is only as good as its test cases. If your dataset contains only easy examples, your eval will pass systems that fail on hard cases. If your dataset lacks edge cases, you'll miss exactly the failures that hurt users.

Source 1: Mine Git History

Your git history is a goldmine of real-world test cases. Bug fixes reveal edge cases that broke production. Complex PRs show challenging code patterns.

# Find bug fixes (likely edge cases)
git log --grep='fix' --grep='bug' --oneline

# Find complex changes (challenging patterns)
git log --diff-filter=M --stat | grep -E '^[^ ]+\s+\|\s+[0-9]{3,}'

# Extract before/after pairs from commits
git show COMMIT_SHA --format='' -- path/to/file

Source 2: Production Failures

Every production failure is a future test case. LangSmith and similar tools let you save bad outputs directly to datasets.

// LangSmith pattern: save failures to dataset
if (!evaluation.passed) {
  await langsmith.createExample({
    datasetId: 'golden-failures',
    inputs: { prompt: userPrompt },
    outputs: { badResponse: response },
    metadata: { failure_reason: evaluation.feedback }
  });
}

Source 3: Synthetic Generation

For edge cases that rarely occur naturally, generate them synthetically. Use a strong model to create challenging variations.

// Generate edge cases from known patterns
const edgeCasePrompt = `Generate 10 variations of this input that would be
challenging for an AI:

Original: ${originalInput}

Include:
- Adversarial rewording
- Missing/malformed fields
- Boundary values
- Unusual but valid formats`;

Dataset Structure

A well-structured dataset includes:

  • Input: The prompt or context
  • Expected Output: What a correct response looks like (or properties it should have)
  • Metadata: Difficulty level, category, source (production/synthetic/historical)
  • Version: Track changes to expected outputs over time

Dataset Hygiene

  • Version your datasets - git or dedicated versioning. When expected outputs change, track why.
  • Balance difficulty - 60% easy (baseline), 30% medium, 10% hard/adversarial
  • Detect drift - If pass rates drop over time, is the model worse or is the dataset outdated?
  • Remove duplicates - Near-identical examples inflate pass rates artificially
  • Refresh regularly - Production patterns change; datasets should too

Key Takeaways

  • Golden datasets are curated collections of inputs/outputs for evaluating AI systems
  • An eval is only as good as its test cases - weak datasets produce meaningless metrics
  • Source 1: Mine git history for bug fixes (edge cases) and complex PRs (hard patterns)
  • Source 2: Save production failures directly to datasets (every failure = future test case)
  • Source 3: Generate synthetic edge cases for rare scenarios
  • Structure includes input, expected output, metadata (source, difficulty, category), and version
  • Balance difficulty: 60% easy, 30% medium, 10% hard/adversarial
  • Version and refresh datasets regularly to avoid drift

In This Platform

This platform uses golden datasets for validation: the 'sources' collection is a dataset of claims with verified references, 'dimensions' contain question/answer pairs tested against the survey, and the build process validates all content against schemas (the schema IS the golden expectation).

Relevant Files:
  • sources/*.json
  • dimensions/*.json
  • build.js

Sources

Tempered AI Forged Through Practice, Not Hype

Keyboard Shortcuts

j
Next page
k
Previous page
h
Section home
/
Search
?
Show shortcuts
m
Toggle sidebar
Esc
Close modal
Shift+R
Reset all progress
? Keyboard shortcuts