Build a Golden Dataset from Git History
Learn to mine git commit history for evaluation test cases, creating a robust dataset that captures real-world code patterns and edge cases.
1. Understand the Scenario
You're building an AI code review tool. You need test cases to evaluate whether it catches real bugs. Your git history contains years of bug fixes - each one is a potential test case where the 'before' is buggy code and 'after' is the fix.
Learning Objectives
- Extract before/after code pairs from bug-fix commits
- Structure test cases with metadata (difficulty, category, source)
- Generate synthetic edge cases for rare scenarios
- Balance dataset difficulty for meaningful evaluation
- Version and manage dataset evolution
Concepts You'll Practice
2. Follow the Instructions
The Challenge: Finding Good Test Cases
Your AI code review tool needs test cases that represent real bugs. Unit tests with contrived examples won't catch the subtle bugs that actually happen in production.
The Solution: Mine your git history. Every bug fix commit contains:
- Before: The buggy code (input)
- After: The fixed code (expected output)
- Message: Description of the bug (metadata)
Step 1: Find Bug-Fix Commits
Search for commits that likely contain bug fixes.
#!/bin/bash
# find_bug_fixes.sh - Extract bug-fix commits
# Search for common bug-fix patterns in commit messages
git log --oneline --all \
--grep='fix' \
--grep='bug' \
--grep='issue' \
--grep='patch' \
--grep='resolve' \
--since='2023-01-01' \
> /tmp/bug_commits.txt
echo "Found $(wc -l < /tmp/bug_commits.txt) potential bug-fix commits"
# Show sample
head -20 /tmp/bug_commits.txt Step 2: Extract Before/After Pairs
For each bug-fix commit, extract the code before and after the change.
import { execSync } from 'child_process';
import * as fs from 'fs';
interface CodePair {
commitSha: string;
message: string;
filePath: string;
before: string;
after: string;
linesChanged: number;
}
function extractCodePairs(commitSha: string): CodePair[] {
const pairs: CodePair[] = [];
// Get commit message
const message = execSync(
`git log -1 --format='%s' ${commitSha}`,
{ encoding: 'utf-8' }
).trim();
// Get list of modified files
const files = execSync(
`git diff-tree --no-commit-id --name-only -r ${commitSha}`,
{ encoding: 'utf-8' }
).trim().split('\n').filter(f => f.endsWith('.ts') || f.endsWith('.js'));
for (const filePath of files) {
try {
// Get file content BEFORE commit (parent)
const before = execSync(
`git show ${commitSha}^:${filePath}`,
{ encoding: 'utf-8' }
);
// Get file content AFTER commit
const after = execSync(
`git show ${commitSha}:${filePath}`,
{ encoding: 'utf-8' }
);
// Count lines changed
const diffStat = execSync(
`git diff --stat ${commitSha}^ ${commitSha} -- ${filePath}`,
{ encoding: 'utf-8' }
);
const linesChanged = parseInt(diffStat.match(/\d+/)?.[0] || '0');
pairs.push({
commitSha,
message,
filePath,
before,
after,
linesChanged
});
} catch (e) {
// File might not exist in parent (new file) - skip
}
}
return pairs;
} Step 3: Structure as Golden Dataset
Convert raw pairs into structured test cases with metadata.
interface GoldenExample {
id: string;
input: {
code: string;
context: string;
};
expected: {
hasIssue: boolean;
issueType?: string;
fixedCode?: string;
};
metadata: {
source: 'git_history' | 'synthetic' | 'production';
difficulty: 'easy' | 'medium' | 'hard';
category: string;
commitSha?: string;
addedDate: string;
};
}
function pairToGoldenExample(pair: CodePair): GoldenExample {
// Classify difficulty based on lines changed
const difficulty = pair.linesChanged < 10 ? 'easy'
: pair.linesChanged < 50 ? 'medium'
: 'hard';
// Extract issue type from commit message
const issueType = extractIssueType(pair.message);
return {
id: `git_${pair.commitSha.slice(0, 8)}_${pair.filePath.replace(/\//g, '_')}`,
input: {
code: pair.before,
context: `File: ${pair.filePath}\nReview this code for potential issues.`
},
expected: {
hasIssue: true, // Bug-fix commits always have issues
issueType,
fixedCode: pair.after
},
metadata: {
source: 'git_history',
difficulty,
category: issueType,
commitSha: pair.commitSha,
addedDate: new Date().toISOString().split('T')[0]
}
};
}
function extractIssueType(message: string): string {
const lower = message.toLowerCase();
if (lower.includes('null') || lower.includes('undefined')) return 'null_safety';
if (lower.includes('type') || lower.includes('typescript')) return 'type_error';
if (lower.includes('security') || lower.includes('xss') || lower.includes('injection')) return 'security';
if (lower.includes('performance') || lower.includes('memory')) return 'performance';
if (lower.includes('race') || lower.includes('async')) return 'concurrency';
return 'general';
} Step 4: Generate Synthetic Edge Cases
For rare scenarios, generate synthetic test cases using a strong model.
import OpenAI from 'openai';
const openai = new OpenAI();
async function generateSyntheticEdgeCases(
existingExamples: GoldenExample[],
category: string,
count: number
): Promise<GoldenExample[]> {
// Find examples in this category for context
const categoryExamples = existingExamples
.filter(e => e.metadata.category === category)
.slice(0, 3);
const prompt = `You are generating test cases for an AI code review tool.
Category: ${category}
Here are some real examples from git history:
${categoryExamples.map(e => `Input:\n${e.input.code.slice(0, 500)}\n\nIssue: ${e.expected.issueType}`).join('\n\n---\n\n')}
Generate ${count} NEW synthetic examples that are:
1. Challenging but realistic
2. Different from the examples above
3. Cover edge cases not represented
For each example, provide:
- buggy_code: The code with the bug
- fixed_code: The corrected version
- issue_description: What's wrong
- difficulty: easy/medium/hard
Respond as JSON array.`;
const response = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [{ role: 'user', content: prompt }],
response_format: { type: 'json_object' }
});
const generated = JSON.parse(response.choices[0].message.content!);
return generated.examples.map((ex: any, i: number) => ({
id: `synthetic_${category}_${Date.now()}_${i}`,
input: {
code: ex.buggy_code,
context: 'Review this code for potential issues.'
},
expected: {
hasIssue: true,
issueType: category,
fixedCode: ex.fixed_code
},
metadata: {
source: 'synthetic' as const,
difficulty: ex.difficulty,
category,
addedDate: new Date().toISOString().split('T')[0]
}
}));
} Step 5: Balance and Validate Dataset
function analyzeDataset(examples: GoldenExample[]): void {
const stats = {
total: examples.length,
bySource: {} as Record<string, number>,
byDifficulty: {} as Record<string, number>,
byCategory: {} as Record<string, number>
};
for (const ex of examples) {
stats.bySource[ex.metadata.source] = (stats.bySource[ex.metadata.source] || 0) + 1;
stats.byDifficulty[ex.metadata.difficulty] = (stats.byDifficulty[ex.metadata.difficulty] || 0) + 1;
stats.byCategory[ex.metadata.category] = (stats.byCategory[ex.metadata.category] || 0) + 1;
}
console.log('Dataset Analysis:');
console.log(` Total examples: ${stats.total}`);
console.log('\n By Source:');
for (const [source, count] of Object.entries(stats.bySource)) {
console.log(` ${source}: ${count} (${(count/stats.total*100).toFixed(1)}%)`);
}
console.log('\n By Difficulty:');
for (const [diff, count] of Object.entries(stats.byDifficulty)) {
console.log(` ${diff}: ${count} (${(count/stats.total*100).toFixed(1)}%)`);
}
// Check balance recommendations
const easyPct = (stats.byDifficulty['easy'] || 0) / stats.total;
const hardPct = (stats.byDifficulty['hard'] || 0) / stats.total;
if (easyPct > 0.7) {
console.log('\n WARNING: Dataset too easy (>70% easy). Add harder examples.');
}
if (hardPct < 0.1) {
console.log('\n WARNING: Not enough hard examples (<10%). Generate adversarial cases.');
}
}
// Save dataset with versioning
function saveDataset(examples: GoldenExample[], version: string): void {
const dataset = {
version,
createdAt: new Date().toISOString(),
stats: {
total: examples.length,
sources: [...new Set(examples.map(e => e.metadata.source))],
categories: [...new Set(examples.map(e => e.metadata.category))]
},
examples
};
fs.writeFileSync(
`datasets/golden_v${version}.json`,
JSON.stringify(dataset, null, 2)
);
console.log(`Saved dataset v${version} with ${examples.length} examples`);
} Your Task
Build a complete golden dataset pipeline:
- Find bug-fix commits in a real repository
- Extract before/after pairs for modified files
- Structure as golden examples with proper metadata
- Generate synthetic edge cases for underrepresented categories
- Analyze and balance the final dataset
3. Try It Yourself
import { execSync } from 'child_process';
import * as fs from 'fs';
interface GoldenExample {
id: string;
input: { code: string; context: string };
expected: { hasIssue: boolean; issueType?: string };
metadata: {
source: 'git_history' | 'synthetic';
difficulty: 'easy' | 'medium' | 'hard';
category: string;
addedDate: string;
};
}
// TODO: Implement these functions
function findBugFixCommits(): string[] {
// Search git log for bug-fix commits
throw new Error('Not implemented');
}
function extractCodePair(commitSha: string): { before: string; after: string } | null {
// Get file content before and after commit
throw new Error('Not implemented');
}
function classifyDifficulty(linesChanged: number): 'easy' | 'medium' | 'hard' {
// Classify based on complexity
throw new Error('Not implemented');
}
function analyzeDataset(examples: GoldenExample[]): void {
// Print stats and balance warnings
throw new Error('Not implemented');
}
// Main
const commits = findBugFixCommits();
console.log(`Found ${commits.length} bug-fix commits`); This typescript exercise requires local setup. Copy the code to your IDE to run.
4. Get Help (If Needed)
Reveal progressive hints
5. Check the Solution
Reveal the complete solution
import { execSync } from 'child_process';
import * as fs from 'fs';
interface GoldenExample {
id: string;
input: { code: string; context: string };
expected: { hasIssue: boolean; issueType?: string; fixedCode?: string };
metadata: {
source: 'git_history' | 'synthetic';
difficulty: 'easy' | 'medium' | 'hard';
category: string;
commitSha?: string;
addedDate: string;
};
}
function findBugFixCommits(): string[] {
const output = execSync(
`git log --oneline --all --grep='fix' --grep='bug' --since='2023-01-01' | head -100`,
{ encoding: 'utf-8' }
);
return output.trim().split('\n').map(line => line.split(' ')[0]).filter(Boolean);
}
function extractCodePair(commitSha: string): {
before: string;
after: string;
filePath: string;
message: string;
linesChanged: number;
} | null {
try {
const message = execSync(`git log -1 --format='%s' ${commitSha}`, { encoding: 'utf-8' }).trim();
const files = execSync(
`git diff-tree --no-commit-id --name-only -r ${commitSha}`,
{ encoding: 'utf-8' }
).trim().split('\n').filter(f => f.endsWith('.ts') || f.endsWith('.js'));
if (files.length === 0) return null;
const filePath = files[0]; // Take first TypeScript/JavaScript file
const before = execSync(`git show ${commitSha}^:${filePath}`, { encoding: 'utf-8' });
const after = execSync(`git show ${commitSha}:${filePath}`, { encoding: 'utf-8' });
const diffStat = execSync(`git diff --stat ${commitSha}^ ${commitSha} -- ${filePath}`, { encoding: 'utf-8' });
const linesChanged = parseInt(diffStat.match(/\d+/)?.[0] || '0');
return { before, after, filePath, message, linesChanged };
} catch {
return null;
}
}
function classifyDifficulty(linesChanged: number): 'easy' | 'medium' | 'hard' {
if (linesChanged < 10) return 'easy';
if (linesChanged < 50) return 'medium';
return 'hard';
}
function extractIssueType(message: string): string {
const lower = message.toLowerCase();
if (lower.includes('null') || lower.includes('undefined')) return 'null_safety';
if (lower.includes('type')) return 'type_error';
if (lower.includes('security') || lower.includes('xss')) return 'security';
return 'general';
}
function analyzeDataset(examples: GoldenExample[]): void {
const byDiff: Record<string, number> = { easy: 0, medium: 0, hard: 0 };
const byCat: Record<string, number> = {};
for (const ex of examples) {
byDiff[ex.metadata.difficulty]++;
byCat[ex.metadata.category] = (byCat[ex.metadata.category] || 0) + 1;
}
console.log('\nDataset Analysis:');
console.log(` Total: ${examples.length}`);
console.log('\n By Difficulty:');
for (const [d, c] of Object.entries(byDiff)) {
console.log(` ${d}: ${c} (${(c/examples.length*100).toFixed(0)}%)`);
}
console.log('\n By Category:');
for (const [cat, c] of Object.entries(byCat)) {
console.log(` ${cat}: ${c}`);
}
// Balance warnings
if (byDiff.easy / examples.length > 0.7) {
console.log('\n ⚠️ Too easy - add harder examples');
}
if (byDiff.hard / examples.length < 0.1) {
console.log('\n ⚠️ Not enough hard examples');
}
}
// Main execution
const commits = findBugFixCommits();
console.log(`Found ${commits.length} bug-fix commits`);
const examples: GoldenExample[] = [];
for (const sha of commits.slice(0, 50)) { // Process first 50
const pair = extractCodePair(sha);
if (!pair) continue;
examples.push({
id: `git_${sha.slice(0, 8)}`,
input: {
code: pair.before,
context: `File: ${pair.filePath}\nReview for bugs.`
},
expected: {
hasIssue: true,
issueType: extractIssueType(pair.message),
fixedCode: pair.after
},
metadata: {
source: 'git_history',
difficulty: classifyDifficulty(pair.linesChanged),
category: extractIssueType(pair.message),
commitSha: sha,
addedDate: new Date().toISOString().split('T')[0]
}
});
}
analyzeDataset(examples);
// Save
fs.mkdirSync('datasets', { recursive: true });
fs.writeFileSync('datasets/golden_v1.json', JSON.stringify({ examples }, null, 2));
console.log(`\nSaved ${examples.length} examples to datasets/golden_v1.json`); Common Mistakes
Not handling commits where file doesn't exist in parent
Why it's wrong: New files have no parent version - git show will fail.
How to fix: Wrap in try/catch and skip commits that add new files.
Including all file types
Why it's wrong: Dataset for code review should only contain code files, not configs or docs.
How to fix: Filter for .ts, .js, .tsx, .jsx extensions only.
No difficulty balancing
Why it's wrong: Dataset with 90% easy examples gives false confidence in eval results.
How to fix: Analyze distribution and generate synthetic hard cases to balance.
Test Cases
Finds bug-fix commits
Should return array of commit SHAs from git history
Repository with bug-fix commitsArray of SHA strings, length > 0Extracts before/after pairs
Should extract code from commit and its parent
Valid commit SHA with TypeScript file changesObject with before, after, filePath, linesChangedClassifies difficulty correctly
Should classify based on lines changed
linesChanged: 5easy