Build an LLM-as-Judge Evaluation Pipeline
Learn to evaluate AI outputs using model-graded evaluation (LLM-as-Judge), the pattern where a stronger model grades outputs from weaker models.
1. Understand the Scenario
You're building a healthcare summarization system that condenses discharge notes. Traditional unit tests won't work because valid summaries can vary. You'll build an LLM-as-Judge pipeline to evaluate summary quality at scale.
Learning Objectives
- Understand why traditional tests fail for LLM outputs
- Implement the LLM-as-Judge pattern using model-graded evaluation
- Design evaluation rubrics with specific, measurable criteria
- Calculate Pass@k metrics for production readiness
- Use property-based assertions instead of exact-match tests
Concepts You'll Practice
2. Follow the Instructions
The Problem: Non-Deterministic Testing
In traditional software, func(A) always equals B. In AI software, func(A) equals B mostly, but sometimes B' or C. In healthcare, C might be a patient safety risk.
Why Traditional Tests Fail:
// ❌ BAD: Exact match (fails if AI says 'T2D' instead of 'Type 2 Diabetes')
assert(summary === 'Patient has Type 2 Diabetes.');
// ✅ GOOD: Property-based assertions (always true for valid summaries)
assert(summary.includes('Diabetes')); // Contains diagnosis
assert(summary.length < originalText.length); // Actually compressed
assert(!detectPHI(summary)); // No PHI leaked
assert(isValidJSON(output)); // Structure is valid
The Solution: Evals vs Tests
| Feature | Unit Test | AI Eval |
|---|---|---|
| Assertion | assert result === 'Diabetes' |
assert similarity(result, 'Diabetes') > 0.85 |
| Logic | Binary (Pass/Fail) | Statistical (Pass Rate over N runs) |
| Tooling | Jest, PyTest | OpenAI Evals, Promptfoo, RAGAS |
| Cost | Milliseconds | Seconds (requires LLM call) |
The Key Metric: Pass@k
Don't measure if code works once. Measure Pass@k: 'If we run this prompt 100 times, how many produce valid output?'
- Mature system: 99/100 (99% Pass@100)
- Immature system: 85/100 (15 out of 100 users get bad data)
Step 1: Define Your Evaluation Rubric
The judge model needs clear criteria. Each criterion should be:
- Specific — Not 'good summary' but 'contains primary diagnosis'
- Binary — Either meets criterion or doesn't
- Independent — Can evaluate one criterion without others
// Define the evaluation rubric
const RUBRIC = {
criteria: [
{
id: 'contains_diagnosis',
description: 'Summary mentions the primary diagnosis from the original note',
weight: 3 // Critical - must have
},
{
id: 'contains_followup',
description: 'Summary mentions follow-up care instructions or appointments',
weight: 2
},
{
id: 'no_phi_leaked',
description: 'Summary does not contain patient names, DOB, or identifiers',
weight: 3 // Critical - safety
},
{
id: 'compression',
description: 'Summary is shorter than original while retaining key information',
weight: 1
},
{
id: 'clinical_accuracy',
description: 'No hallucinated medical facts or contraindicated information',
weight: 3 // Critical - safety
}
],
passing_threshold: 0.8, // 80% weighted score to pass
max_score: 12 // Sum of all weights
}; Step 2: Build the Judge Prompt
The judge model evaluates outputs against the rubric. Use chain-of-thought (cot_classify) for better accuracy.
const judgeSystemPrompt = `You are an expert medical documentation reviewer.
You will evaluate summaries of discharge notes against specific criteria.
For each criterion, you must:
1. Quote the relevant part of the summary (or note if absent)
2. Explain your reasoning
3. Mark as PASS or FAIL
Be strict but fair. When in doubt, FAIL.`;
const buildJudgePrompt = (
originalNote: string,
summary: string,
rubric: typeof RUBRIC
) => {
const criteriaList = rubric.criteria
.map((c, i) => `${i + 1}. ${c.id}: ${c.description} (weight: ${c.weight})`)
.join('\n');
return `## Original Discharge Note
${originalNote}
## Generated Summary
${summary}
## Evaluation Criteria
${criteriaList}
Evaluate the summary against each criterion. For each:
1. Quote relevant evidence
2. Explain reasoning
3. Mark PASS or FAIL`;
}; Step 3: Define Structured Output for Judge
Use structured outputs to guarantee the judge's response format.
import OpenAI from 'openai';
import { z } from 'zod';
const EvaluationResultSchema = z.object({
evaluations: z.array(z.object({
criterion_id: z.string(),
evidence: z.string().describe('Quote from summary or note "not found"'),
reasoning: z.string(),
passed: z.boolean()
})),
total_score: z.number(),
max_score: z.number(),
passed: z.boolean(),
summary_feedback: z.string()
});
type EvaluationResult = z.infer<typeof EvaluationResultSchema>;
// JSON Schema for OpenAI structured outputs
const evaluationJsonSchema = {
type: 'object',
properties: {
evaluations: {
type: 'array',
items: {
type: 'object',
properties: {
criterion_id: { type: 'string' },
evidence: { type: 'string' },
reasoning: { type: 'string' },
passed: { type: 'boolean' }
},
required: ['criterion_id', 'evidence', 'reasoning', 'passed'],
additionalProperties: false
}
},
total_score: { type: 'number' },
max_score: { type: 'number' },
passed: { type: 'boolean' },
summary_feedback: { type: 'string' }
},
required: ['evaluations', 'total_score', 'max_score', 'passed', 'summary_feedback'],
additionalProperties: false
}; Step 4: Run the Judge
Use a stronger model (GPT-4) to judge outputs from the weaker model.
async function evaluateSummary(
originalNote: string,
summary: string,
rubric: typeof RUBRIC
): Promise<EvaluationResult> {
const openai = new OpenAI();
const response = await openai.chat.completions.create({
model: 'gpt-4o', // Use stronger model as judge
messages: [
{ role: 'system', content: judgeSystemPrompt },
{ role: 'user', content: buildJudgePrompt(originalNote, summary, rubric) }
],
response_format: {
type: 'json_schema',
json_schema: {
name: 'evaluation_result',
strict: true,
schema: evaluationJsonSchema
}
}
});
const rawResult = JSON.parse(response.choices[0].message.content!);
// Calculate score from evaluations
let totalScore = 0;
for (const eval of rawResult.evaluations) {
const criterion = rubric.criteria.find(c => c.id === eval.criterion_id);
if (criterion && eval.passed) {
totalScore += criterion.weight;
}
}
rawResult.total_score = totalScore;
rawResult.max_score = rubric.max_score;
rawResult.passed = (totalScore / rubric.max_score) >= rubric.passing_threshold;
// Validate with Zod
return EvaluationResultSchema.parse(rawResult);
} Step 5: Calculate Pass@k
Run the evaluation multiple times to get statistical confidence.
async function calculatePassAtK(
generateSummary: (note: string) => Promise<string>,
testCases: { note: string; expected_diagnosis: string }[],
k: number = 100
): Promise<{ pass_rate: number; failures: EvaluationResult[] }> {
const results: EvaluationResult[] = [];
const failures: EvaluationResult[] = [];
for (const testCase of testCases) {
for (let i = 0; i < k; i++) {
const summary = await generateSummary(testCase.note);
const evaluation = await evaluateSummary(testCase.note, summary, RUBRIC);
results.push(evaluation);
if (!evaluation.passed) {
failures.push(evaluation);
}
}
}
const passRate = results.filter(r => r.passed).length / results.length;
console.log(`\nPass@${k} Results:`);
console.log(` Pass Rate: ${(passRate * 100).toFixed(1)}%`);
console.log(` Total Runs: ${results.length}`);
console.log(` Failures: ${failures.length}`);
// For healthcare: require 99%+ pass rate
if (passRate < 0.99) {
console.log(' ⚠️ WARNING: Pass rate below 99% - not production ready');
} else {
console.log(' ✅ Pass rate meets production threshold');
}
return { pass_rate: passRate, failures };
} Your Task
Build the complete evaluation pipeline:
- Define the rubric with weighted criteria
- Implement the judge function with structured outputs
- Calculate Pass@k metrics
- Add property-based assertions for critical safety checks
3. Try It Yourself
import OpenAI from 'openai';
import { z } from 'zod';
const openai = new OpenAI();
// TODO: Define the evaluation rubric
const RUBRIC = {
criteria: [
// Add criteria here
],
passing_threshold: 0.8,
max_score: 0
};
// TODO: Define Zod schema for evaluation result
const EvaluationResultSchema = z.object({
// Add fields here
});
type EvaluationResult = z.infer<typeof EvaluationResultSchema>;
// TODO: Build judge prompt
function buildJudgePrompt(originalNote: string, summary: string): string {
throw new Error('Not implemented');
}
// TODO: Run evaluation
async function evaluateSummary(
originalNote: string,
summary: string
): Promise<EvaluationResult> {
throw new Error('Not implemented');
}
// TODO: Calculate Pass@k
async function calculatePassAtK(
testCases: { note: string }[],
k: number
): Promise<{ pass_rate: number }> {
throw new Error('Not implemented');
}
// Test data
const testNote = `
DISCHARGE SUMMARY
Patient: [REDACTED]
Admission Date: 2024-01-15
Discharge Date: 2024-01-18
Diagnosis: Type 2 Diabetes Mellitus with hyperglycemia
Hospital Course:
Patient presented with blood glucose of 450 mg/dL. Started on insulin
drip, transitioned to subcutaneous insulin. Glucose stabilized.
Discharge Medications:
- Metformin 1000mg twice daily
- Lantus 20 units at bedtime
Follow-up:
- Endocrinology appointment in 2 weeks
- Primary care in 1 week
`;
const testSummary = `
Patient admitted for Type 2 Diabetes with elevated blood sugar (450).
Treated with insulin, now stable. Discharged on Metformin and Lantus.
Follow-up with endocrinology in 2 weeks and primary care in 1 week.
`;
// Run test
evaluateSummary(testNote, testSummary).then(result => {
console.log('Evaluation Result:', JSON.stringify(result, null, 2));
}); This typescript exercise requires local setup. Copy the code to your IDE to run.
4. Get Help (If Needed)
Reveal progressive hints
5. Check the Solution
Reveal the complete solution
import OpenAI from 'openai';
import { z } from 'zod';
const openai = new OpenAI();
// Evaluation rubric with weighted criteria
const RUBRIC = {
criteria: [
{
id: 'contains_diagnosis',
description: 'Summary mentions the primary diagnosis from the original note',
weight: 3
},
{
id: 'contains_followup',
description: 'Summary mentions follow-up care instructions or appointments',
weight: 2
},
{
id: 'no_phi_leaked',
description: 'Summary does not contain patient names, DOB, SSN, or other identifiers',
weight: 3
},
{
id: 'compression',
description: 'Summary is shorter than original while retaining key information',
weight: 1
},
{
id: 'clinical_accuracy',
description: 'No hallucinated medical facts or contraindicated information',
weight: 3
}
],
passing_threshold: 0.8,
max_score: 12
};
// Zod schema for type-safe validation
const EvaluationResultSchema = z.object({
evaluations: z.array(z.object({
criterion_id: z.string(),
evidence: z.string(),
reasoning: z.string(),
passed: z.boolean()
})),
total_score: z.number(),
max_score: z.number(),
passed: z.boolean(),
summary_feedback: z.string()
});
type EvaluationResult = z.infer<typeof EvaluationResultSchema>;
// JSON Schema for OpenAI structured outputs
const evaluationJsonSchema = {
type: 'object',
properties: {
evaluations: {
type: 'array',
items: {
type: 'object',
properties: {
criterion_id: { type: 'string' },
evidence: { type: 'string' },
reasoning: { type: 'string' },
passed: { type: 'boolean' }
},
required: ['criterion_id', 'evidence', 'reasoning', 'passed'],
additionalProperties: false
}
},
total_score: { type: 'number' },
max_score: { type: 'number' },
passed: { type: 'boolean' },
summary_feedback: { type: 'string' }
},
required: ['evaluations', 'total_score', 'max_score', 'passed', 'summary_feedback'],
additionalProperties: false
};
const judgeSystemPrompt = `You are an expert medical documentation reviewer.
You evaluate discharge note summaries against specific quality criteria.
For each criterion:
1. Quote relevant evidence from the summary (or note "not found")
2. Explain your reasoning clearly
3. Mark as PASS or FAIL
Be strict but fair. Patient safety is paramount. When in doubt, FAIL.`;
function buildJudgePrompt(originalNote: string, summary: string): string {
const criteriaList = RUBRIC.criteria
.map((c, i) => `${i + 1}. ${c.id}: ${c.description} (weight: ${c.weight})`)
.join('\n');
return `## Original Discharge Note\n${originalNote}\n\n## Generated Summary\n${summary}\n\n## Evaluation Criteria\n${criteriaList}\n\nEvaluate the summary against EACH criterion listed above.`;
}
async function evaluateSummary(
originalNote: string,
summary: string
): Promise<EvaluationResult> {
// Property-based assertions (deterministic pre-checks)
// Allow 20% buffer for abbreviation expansion (e.g., "Pt w/ DM2" -> "Patient with Type 2 Diabetes")
// Dense clinical notes often use abbreviations that get expanded in summaries
const COMPRESSION_BUFFER = 1.2;
if (summary.length > originalNote.length * COMPRESSION_BUFFER) {
return {
evaluations: [{
criterion_id: 'compression',
evidence: `Summary: ${summary.length} chars, Original: ${originalNote.length} chars (max allowed: ${Math.floor(originalNote.length * COMPRESSION_BUFFER)})`,
reasoning: 'Summary exceeds 120% of original length - abbreviation expansion alone cannot explain this',
passed: false
}],
total_score: 0,
max_score: RUBRIC.max_score,
passed: false,
summary_feedback: 'Summary failed compression check before LLM evaluation'
};
}
// LLM-as-Judge evaluation
const response = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [
{ role: 'system', content: judgeSystemPrompt },
{ role: 'user', content: buildJudgePrompt(originalNote, summary) }
],
response_format: {
type: 'json_schema',
json_schema: {
name: 'evaluation_result',
strict: true,
schema: evaluationJsonSchema
}
}
});
const rawResult = JSON.parse(response.choices[0].message.content!);
// Calculate score from evaluations (deterministic post-processing)
let totalScore = 0;
for (const evaluation of rawResult.evaluations) {
const criterion = RUBRIC.criteria.find(c => c.id === evaluation.criterion_id);
if (criterion && evaluation.passed) {
totalScore += criterion.weight;
}
}
rawResult.total_score = totalScore;
rawResult.max_score = RUBRIC.max_score;
rawResult.passed = (totalScore / RUBRIC.max_score) >= RUBRIC.passing_threshold;
// Validate with Zod
return EvaluationResultSchema.parse(rawResult);
}
async function calculatePassAtK(
generateSummary: (note: string) => Promise<string>,
testCases: { note: string }[],
k: number = 10
): Promise<{ pass_rate: number; failures: EvaluationResult[] }> {
const results: EvaluationResult[] = [];
const failures: EvaluationResult[] = [];
for (const testCase of testCases) {
for (let i = 0; i < k; i++) {
try {
const summary = await generateSummary(testCase.note);
const evaluation = await evaluateSummary(testCase.note, summary);
results.push(evaluation);
if (!evaluation.passed) {
failures.push(evaluation);
}
} catch (error) {
// Count errors as failures
failures.push({
evaluations: [],
total_score: 0,
max_score: RUBRIC.max_score,
passed: false,
summary_feedback: `Error: ${error}`
});
}
}
}
const passRate = (results.length - failures.length) / results.length;
console.log(`\n📊 Pass@${k} Results:`);
console.log(` Pass Rate: ${(passRate * 100).toFixed(1)}%`);
console.log(` Total Runs: ${results.length}`);
console.log(` Failures: ${failures.length}`);
if (passRate < 0.99) {
console.log(' ⚠️ Below 99% - not production ready for healthcare');
} else {
console.log(' ✅ Meets production threshold');
}
return { pass_rate: passRate, failures };
}
// Test data
const testNote = `
DISCHARGE SUMMARY
Patient: [REDACTED]
Admission Date: 2024-01-15
Discharge Date: 2024-01-18
Diagnosis: Type 2 Diabetes Mellitus with hyperglycemia
Hospital Course:
Patient presented with blood glucose of 450 mg/dL. Started on insulin
drip, transitioned to subcutaneous insulin. Glucose stabilized.
Discharge Medications:
- Metformin 1000mg twice daily
- Lantus 20 units at bedtime
Follow-up:
- Endocrinology appointment in 2 weeks
- Primary care in 1 week
`;
const testSummary = `
Patient admitted for Type 2 Diabetes with elevated blood sugar (450).
Treated with insulin, now stable. Discharged on Metformin and Lantus.
Follow-up with endocrinology in 2 weeks and primary care in 1 week.
`;
// Run single evaluation
evaluateSummary(testNote, testSummary).then(result => {
console.log('\n📋 Single Evaluation:');
console.log(` Passed: ${result.passed}`);
console.log(` Score: ${result.total_score}/${result.max_score}`);
console.log('\n Criteria Results:');
for (const e of result.evaluations) {
console.log(` ${e.passed ? '✅' : '❌'} ${e.criterion_id}`);
console.log(` Evidence: ${e.evidence.substring(0, 80)}...`);
}
});
/* Expected output:
📋 Single Evaluation:
Passed: true
Score: 12/12
Criteria Results:
✅ contains_diagnosis
Evidence: "Type 2 Diabetes with elevated blood sugar (450)"...
✅ contains_followup
Evidence: "Follow-up with endocrinology in 2 weeks"...
✅ no_phi_leaked
Evidence: No patient identifiers found...
✅ compression
Evidence: Summary is 203 chars, original is 512 chars...
✅ clinical_accuracy
Evidence: Medications and diagnoses match original...
*/ Common Mistakes
Using exact-match assertions for LLM outputs
Why it's wrong: LLM outputs are non-deterministic. 'Type 2 Diabetes' vs 'T2D' vs 'Diabetes Mellitus Type II' are all valid.
How to fix: Use property-based assertions (contains, length, format) or semantic similarity instead of exact matches.
Using the same model to judge itself
Why it's wrong: A model will be biased toward its own outputs. Same errors propagate.
How to fix: Use a stronger model (GPT-4) to judge outputs from weaker models (GPT-3.5), or use multiple diverse models.
Running evaluation only once
Why it's wrong: Single runs don't capture the variance in LLM outputs. You might get lucky or unlucky.
How to fix: Calculate Pass@k over many runs (k=100 for production readiness) to get statistical confidence.
No deterministic pre/post processing
Why it's wrong: Wrapping probabilistic calls without deterministic layers makes results unpredictable.
How to fix: Use the Sandwich Pattern: deterministic pre-processing (schema) → LLM call → deterministic validation (Zod).
Strict character count for compression checks
Why it's wrong: Dense clinical notes use abbreviations (Pt, DM2, w/) that expand in summaries. A summary can be longer if it expands 'Pt w/ DM2' to 'Patient with Type 2 Diabetes'.
How to fix: Allow a buffer (e.g., 120% of original) for abbreviation expansion, or check 'information density' rather than strict length.
Test Cases
Passes valid summary
A good summary should pass all criteria
Test note + valid summary with diagnosis, follow-up, no PHIpassed: true, total_score >= 10Fails PHI leak
Summary containing patient name should fail
Test note + summary with 'John Doe' includedpassed: false, no_phi_leaked: falseFails compression
Summary longer than original should fail
Short note + very long summarypassed: false, compression: false