Golden Dataset Curation
Building high-quality evaluation datasets that anchor AI system quality - because an eval is only as good as its test cases.
A Golden Dataset is a curated collection of inputs and expected outputs used to evaluate AI system quality. Unlike unit tests with a few examples, golden datasets contain hundreds or thousands of cases representing real-world usage patterns, edge cases, and failure modes.
Why Golden Datasets Matter: An LLM-as-Judge eval is only as good as its test cases. If your dataset contains only easy examples, your eval will pass systems that fail on hard cases. If your dataset lacks edge cases, you'll miss exactly the failures that hurt users.
Source 1: Mine Git History
Your git history is a goldmine of real-world test cases. Bug fixes reveal edge cases that broke production. Complex PRs show challenging code patterns.
# Find bug fixes (likely edge cases)
git log --grep='fix' --grep='bug' --oneline
# Find complex changes (challenging patterns)
git log --diff-filter=M --stat | grep -E '^[^ ]+\s+\|\s+[0-9]{3,}'
# Extract before/after pairs from commits
git show COMMIT_SHA --format='' -- path/to/file
Source 2: Production Failures
Every production failure is a future test case. LangSmith and similar tools let you save bad outputs directly to datasets.
// LangSmith pattern: save failures to dataset
if (!evaluation.passed) {
await langsmith.createExample({
datasetId: 'golden-failures',
inputs: { prompt: userPrompt },
outputs: { badResponse: response },
metadata: { failure_reason: evaluation.feedback }
});
}
Source 3: Synthetic Generation
For edge cases that rarely occur naturally, generate them synthetically. Use a strong model to create challenging variations.
// Generate edge cases from known patterns
const edgeCasePrompt = `Generate 10 variations of this input that would be
challenging for an AI:
Original: ${originalInput}
Include:
- Adversarial rewording
- Missing/malformed fields
- Boundary values
- Unusual but valid formats`;
Dataset Structure
A well-structured dataset includes:
- Input: The prompt or context
- Expected Output: What a correct response looks like (or properties it should have)
- Metadata: Difficulty level, category, source (production/synthetic/historical)
- Version: Track changes to expected outputs over time
Dataset Hygiene
- Version your datasets - git or dedicated versioning. When expected outputs change, track why.
- Balance difficulty - 60% easy (baseline), 30% medium, 10% hard/adversarial
- Detect drift - If pass rates drop over time, is the model worse or is the dataset outdated?
- Remove duplicates - Near-identical examples inflate pass rates artificially
- Refresh regularly - Production patterns change; datasets should too
Key Takeaways
- Golden datasets are curated collections of inputs/outputs for evaluating AI systems
- An eval is only as good as its test cases - weak datasets produce meaningless metrics
- Source 1: Mine git history for bug fixes (edge cases) and complex PRs (hard patterns)
- Source 2: Save production failures directly to datasets (every failure = future test case)
- Source 3: Generate synthetic edge cases for rare scenarios
- Structure includes input, expected output, metadata (source, difficulty, category), and version
- Balance difficulty: 60% easy, 30% medium, 10% hard/adversarial
- Version and refresh datasets regularly to avoid drift
In This Platform
This platform uses golden datasets for validation: the 'sources' collection is a dataset of claims with verified references, 'dimensions' contain question/answer pairs tested against the survey, and the build process validates all content against schemas (the schema IS the golden expectation).
- sources/*.json
- dimensions/*.json
- build.js