Build an LLM-as-Judge Evaluation Pipeline

Build intermediate 45 min typescript

Sources verified Dec 23

Learn to evaluate AI outputs using model-graded evaluation (LLM-as-Judge), the pattern where a stronger model grades outputs from weaker models.

1. Understand the Scenario

You're building a healthcare summarization system that condenses discharge notes. Traditional unit tests won't work because valid summaries can vary. You'll build an LLM-as-Judge pipeline to evaluate summary quality at scale.

Learning Objectives

Understand why traditional tests fail for LLM outputs
Implement the LLM-as-Judge pattern using model-graded evaluation
Design evaluation rubrics with specific, measurable criteria
Calculate Pass@k metrics for production readiness
Use property-based assertions instead of exact-match tests

Concepts You'll Practice

Structured Outputs & JSON Schema JSON Schema for Structured Outputs

2. Follow the Instructions

The Problem: Non-Deterministic Testing

In traditional software, func(A) always equals B. In AI software, func(A) equals B mostly, but sometimes B' or C. In healthcare, C might be a patient safety risk.

Why Traditional Tests Fail:

// ❌ BAD: Exact match (fails if AI says 'T2D' instead of 'Type 2 Diabetes')
assert(summary === 'Patient has Type 2 Diabetes.');

// ✅ GOOD: Property-based assertions (always true for valid summaries)
assert(summary.includes('Diabetes'));           // Contains diagnosis
assert(summary.length < originalText.length);  // Actually compressed
assert(!detectPHI(summary));                    // No PHI leaked
assert(isValidJSON(output));                    // Structure is valid

The Solution: Evals vs Tests

Feature	Unit Test	AI Eval
Assertion	`assert result === 'Diabetes'`	`assert similarity(result, 'Diabetes') > 0.85`
Logic	Binary (Pass/Fail)	Statistical (Pass Rate over N runs)
Tooling	Jest, PyTest	OpenAI Evals, Promptfoo, RAGAS
Cost	Milliseconds	Seconds (requires LLM call)

The Key Metric: Pass@k

Don't measure if code works once. Measure Pass@k: 'If we run this prompt 100 times, how many produce valid output?'

Mature system: 99/100 (99% Pass@100)
Immature system: 85/100 (15 out of 100 users get bad data)

Step 1: Define Your Evaluation Rubric

The judge model needs clear criteria. Each criterion should be:

Specific — Not 'good summary' but 'contains primary diagnosis'
Binary — Either meets criterion or doesn't
Independent — Can evaluate one criterion without others

step1_rubric.ts
 // Define the evaluation rubric
const RUBRIC = {
  criteria: [
    {
      id: 'contains_diagnosis',
      description: 'Summary mentions the primary diagnosis from the original note',
      weight: 3  // Critical - must have
    },
    {
      id: 'contains_followup',
      description: 'Summary mentions follow-up care instructions or appointments',
      weight: 2
    },
    {
      id: 'no_phi_leaked',
      description: 'Summary does not contain patient names, DOB, or identifiers',
      weight: 3  // Critical - safety
    },
    {
      id: 'compression',
      description: 'Summary is shorter than original while retaining key information',
      weight: 1
    },
    {
      id: 'clinical_accuracy',
      description: 'No hallucinated medical facts or contraindicated information',
      weight: 3  // Critical - safety
    }
  ],
  passing_threshold: 0.8,  // 80% weighted score to pass
  max_score: 12            // Sum of all weights
}; 

Step 2: Build the Judge Prompt

The judge model evaluates outputs against the rubric. Use chain-of-thought (cot_classify) for better accuracy.

step2_judge_prompt.ts
 const judgeSystemPrompt = `You are an expert medical documentation reviewer.
You will evaluate summaries of discharge notes against specific criteria.

For each criterion, you must:
1. Quote the relevant part of the summary (or note if absent)
2. Explain your reasoning
3. Mark as PASS or FAIL

Be strict but fair. When in doubt, FAIL.`;

const buildJudgePrompt = (
  originalNote: string,
  summary: string,
  rubric: typeof RUBRIC
) => {
  const criteriaList = rubric.criteria
    .map((c, i) => `${i + 1}. ${c.id}: ${c.description} (weight: ${c.weight})`)
    .join('\n');

  return `## Original Discharge Note
${originalNote}

## Generated Summary
${summary}

## Evaluation Criteria
${criteriaList}

Evaluate the summary against each criterion. For each:
1. Quote relevant evidence
2. Explain reasoning
3. Mark PASS or FAIL`;
}; 

Step 3: Define Structured Output for Judge

Use structured outputs to guarantee the judge's response format.

step3_schema.ts
 import OpenAI from 'openai';
import { z } from 'zod';

const EvaluationResultSchema = z.object({
  evaluations: z.array(z.object({
    criterion_id: z.string(),
    evidence: z.string().describe('Quote from summary or note "not found"'),
    reasoning: z.string(),
    passed: z.boolean()
  })),
  total_score: z.number(),
  max_score: z.number(),
  passed: z.boolean(),
  summary_feedback: z.string()
});

type EvaluationResult = z.infer<typeof EvaluationResultSchema>;

// JSON Schema for OpenAI structured outputs
const evaluationJsonSchema = {
  type: 'object',
  properties: {
    evaluations: {
      type: 'array',
      items: {
        type: 'object',
        properties: {
          criterion_id: { type: 'string' },
          evidence: { type: 'string' },
          reasoning: { type: 'string' },
          passed: { type: 'boolean' }
        },
        required: ['criterion_id', 'evidence', 'reasoning', 'passed'],
        additionalProperties: false
      }
    },
    total_score: { type: 'number' },
    max_score: { type: 'number' },
    passed: { type: 'boolean' },
    summary_feedback: { type: 'string' }
  },
  required: ['evaluations', 'total_score', 'max_score', 'passed', 'summary_feedback'],
  additionalProperties: false
}; 

Step 4: Run the Judge

Use a stronger model (GPT-4) to judge outputs from the weaker model.

step4_run_judge.ts
 async function evaluateSummary(
  originalNote: string,
  summary: string,
  rubric: typeof RUBRIC
): Promise<EvaluationResult> {
  const openai = new OpenAI();
  
  const response = await openai.chat.completions.create({
    model: 'gpt-4o',  // Use stronger model as judge
    messages: [
      { role: 'system', content: judgeSystemPrompt },
      { role: 'user', content: buildJudgePrompt(originalNote, summary, rubric) }
    ],
    response_format: {
      type: 'json_schema',
      json_schema: {
        name: 'evaluation_result',
        strict: true,
        schema: evaluationJsonSchema
      }
    }
  });

  const rawResult = JSON.parse(response.choices[0].message.content!);
  
  // Calculate score from evaluations
  let totalScore = 0;
  for (const eval of rawResult.evaluations) {
    const criterion = rubric.criteria.find(c => c.id === eval.criterion_id);
    if (criterion && eval.passed) {
      totalScore += criterion.weight;
    }
  }
  
  rawResult.total_score = totalScore;
  rawResult.max_score = rubric.max_score;
  rawResult.passed = (totalScore / rubric.max_score) >= rubric.passing_threshold;

  // Validate with Zod
  return EvaluationResultSchema.parse(rawResult);
} 

Step 5: Calculate Pass@k

Run the evaluation multiple times to get statistical confidence.

step5_pass_at_k.ts
 async function calculatePassAtK(
  generateSummary: (note: string) => Promise<string>,
  testCases: { note: string; expected_diagnosis: string }[],
  k: number = 100
): Promise<{ pass_rate: number; failures: EvaluationResult[] }> {
  const results: EvaluationResult[] = [];
  const failures: EvaluationResult[] = [];
  
  for (const testCase of testCases) {
    for (let i = 0; i < k; i++) {
      const summary = await generateSummary(testCase.note);
      const evaluation = await evaluateSummary(testCase.note, summary, RUBRIC);
      results.push(evaluation);
      
      if (!evaluation.passed) {
        failures.push(evaluation);
      }
    }
  }
  
  const passRate = results.filter(r => r.passed).length / results.length;
  
  console.log(`\nPass@${k} Results:`);
  console.log(`  Pass Rate: ${(passRate * 100).toFixed(1)}%`);
  console.log(`  Total Runs: ${results.length}`);
  console.log(`  Failures: ${failures.length}`);
  
  // For healthcare: require 99%+ pass rate
  if (passRate < 0.99) {
    console.log('  ⚠️  WARNING: Pass rate below 99% - not production ready');
  } else {
    console.log('  ✅ Pass rate meets production threshold');
  }
  
  return { pass_rate: passRate, failures };
} 

Your Task

Build the complete evaluation pipeline:

Define the rubric with weighted criteria
Implement the judge function with structured outputs
Calculate Pass@k metrics
Add property-based assertions for critical safety checks

3. Try It Yourself

exercise_starter.ts
 import OpenAI from 'openai';
import { z } from 'zod';

const openai = new OpenAI();

// TODO: Define the evaluation rubric
const RUBRIC = {
  criteria: [
    // Add criteria here
  ],
  passing_threshold: 0.8,
  max_score: 0
};

// TODO: Define Zod schema for evaluation result
const EvaluationResultSchema = z.object({
  // Add fields here
});

type EvaluationResult = z.infer<typeof EvaluationResultSchema>;

// TODO: Build judge prompt
function buildJudgePrompt(originalNote: string, summary: string): string {
  throw new Error('Not implemented');
}

// TODO: Run evaluation
async function evaluateSummary(
  originalNote: string,
  summary: string
): Promise<EvaluationResult> {
  throw new Error('Not implemented');
}

// TODO: Calculate Pass@k
async function calculatePassAtK(
  testCases: { note: string }[],
  k: number
): Promise<{ pass_rate: number }> {
  throw new Error('Not implemented');
}

// Test data
const testNote = `
DISCHARGE SUMMARY
Patient: [REDACTED]
Admission Date: 2024-01-15
Discharge Date: 2024-01-18

Diagnosis: Type 2 Diabetes Mellitus with hyperglycemia

Hospital Course:
Patient presented with blood glucose of 450 mg/dL. Started on insulin
drip, transitioned to subcutaneous insulin. Glucose stabilized.

Discharge Medications:
- Metformin 1000mg twice daily
- Lantus 20 units at bedtime

Follow-up:
- Endocrinology appointment in 2 weeks
- Primary care in 1 week
`;

const testSummary = `
Patient admitted for Type 2 Diabetes with elevated blood sugar (450).
Treated with insulin, now stable. Discharged on Metformin and Lantus.
Follow-up with endocrinology in 2 weeks and primary care in 1 week.
`;

// Run test
evaluateSummary(testNote, testSummary).then(result => {
  console.log('Evaluation Result:', JSON.stringify(result, null, 2));
}); 

This typescript exercise requires local setup. Copy the code to your IDE to run.

4. Get Help (If Needed)

Reveal progressive hints

Hint 1: The rubric should have weighted criteria. Critical items (diagnosis, safety) get weight 3, nice-to-haves get weight 1.

Hint 2: Use property-based assertions for deterministic checks (e.g., compression) before calling the LLM judge.

Hint 3: For structured outputs, all properties must be in 'required' and use additionalProperties: false in strict mode.

5. Check the Solution

Reveal the complete solution

exercise_solution.ts
 import OpenAI from 'openai';
import { z } from 'zod';

const openai = new OpenAI();

// Evaluation rubric with weighted criteria
const RUBRIC = {
  criteria: [
    {
      id: 'contains_diagnosis',
      description: 'Summary mentions the primary diagnosis from the original note',
      weight: 3
    },
    {
      id: 'contains_followup',
      description: 'Summary mentions follow-up care instructions or appointments',
      weight: 2
    },
    {
      id: 'no_phi_leaked',
      description: 'Summary does not contain patient names, DOB, SSN, or other identifiers',
      weight: 3
    },
    {
      id: 'compression',
      description: 'Summary is shorter than original while retaining key information',
      weight: 1
    },
    {
      id: 'clinical_accuracy',
      description: 'No hallucinated medical facts or contraindicated information',
      weight: 3
    }
  ],
  passing_threshold: 0.8,
  max_score: 12
};

// Zod schema for type-safe validation
const EvaluationResultSchema = z.object({
  evaluations: z.array(z.object({
    criterion_id: z.string(),
    evidence: z.string(),
    reasoning: z.string(),
    passed: z.boolean()
  })),
  total_score: z.number(),
  max_score: z.number(),
  passed: z.boolean(),
  summary_feedback: z.string()
});

type EvaluationResult = z.infer<typeof EvaluationResultSchema>;

// JSON Schema for OpenAI structured outputs
const evaluationJsonSchema = {
  type: 'object',
  properties: {
    evaluations: {
      type: 'array',
      items: {
        type: 'object',
        properties: {
          criterion_id: { type: 'string' },
          evidence: { type: 'string' },
          reasoning: { type: 'string' },
          passed: { type: 'boolean' }
        },
        required: ['criterion_id', 'evidence', 'reasoning', 'passed'],
        additionalProperties: false
      }
    },
    total_score: { type: 'number' },
    max_score: { type: 'number' },
    passed: { type: 'boolean' },
    summary_feedback: { type: 'string' }
  },
  required: ['evaluations', 'total_score', 'max_score', 'passed', 'summary_feedback'],
  additionalProperties: false
};

const judgeSystemPrompt = `You are an expert medical documentation reviewer.
You evaluate discharge note summaries against specific quality criteria.

For each criterion:
1. Quote relevant evidence from the summary (or note "not found")
2. Explain your reasoning clearly
3. Mark as PASS or FAIL

Be strict but fair. Patient safety is paramount. When in doubt, FAIL.`;

function buildJudgePrompt(originalNote: string, summary: string): string {
  const criteriaList = RUBRIC.criteria
    .map((c, i) => `${i + 1}. ${c.id}: ${c.description} (weight: ${c.weight})`)
    .join('\n');

  return `## Original Discharge Note\n${originalNote}\n\n## Generated Summary\n${summary}\n\n## Evaluation Criteria\n${criteriaList}\n\nEvaluate the summary against EACH criterion listed above.`;
}

async function evaluateSummary(
  originalNote: string,
  summary: string
): Promise<EvaluationResult> {
  // Property-based assertions (deterministic pre-checks)
  // Allow 20% buffer for abbreviation expansion (e.g., "Pt w/ DM2" -> "Patient with Type 2 Diabetes")
  // Dense clinical notes often use abbreviations that get expanded in summaries
  const COMPRESSION_BUFFER = 1.2;
  if (summary.length > originalNote.length * COMPRESSION_BUFFER) {
    return {
      evaluations: [{
        criterion_id: 'compression',
        evidence: `Summary: ${summary.length} chars, Original: ${originalNote.length} chars (max allowed: ${Math.floor(originalNote.length * COMPRESSION_BUFFER)})`,
        reasoning: 'Summary exceeds 120% of original length - abbreviation expansion alone cannot explain this',
        passed: false
      }],
      total_score: 0,
      max_score: RUBRIC.max_score,
      passed: false,
      summary_feedback: 'Summary failed compression check before LLM evaluation'
    };
  }

  // LLM-as-Judge evaluation
  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [
      { role: 'system', content: judgeSystemPrompt },
      { role: 'user', content: buildJudgePrompt(originalNote, summary) }
    ],
    response_format: {
      type: 'json_schema',
      json_schema: {
        name: 'evaluation_result',
        strict: true,
        schema: evaluationJsonSchema
      }
    }
  });

  const rawResult = JSON.parse(response.choices[0].message.content!);
  
  // Calculate score from evaluations (deterministic post-processing)
  let totalScore = 0;
  for (const evaluation of rawResult.evaluations) {
    const criterion = RUBRIC.criteria.find(c => c.id === evaluation.criterion_id);
    if (criterion && evaluation.passed) {
      totalScore += criterion.weight;
    }
  }
  
  rawResult.total_score = totalScore;
  rawResult.max_score = RUBRIC.max_score;
  rawResult.passed = (totalScore / RUBRIC.max_score) >= RUBRIC.passing_threshold;

  // Validate with Zod
  return EvaluationResultSchema.parse(rawResult);
}

async function calculatePassAtK(
  generateSummary: (note: string) => Promise<string>,
  testCases: { note: string }[],
  k: number = 10
): Promise<{ pass_rate: number; failures: EvaluationResult[] }> {
  const results: EvaluationResult[] = [];
  const failures: EvaluationResult[] = [];
  
  for (const testCase of testCases) {
    for (let i = 0; i < k; i++) {
      try {
        const summary = await generateSummary(testCase.note);
        const evaluation = await evaluateSummary(testCase.note, summary);
        results.push(evaluation);
        
        if (!evaluation.passed) {
          failures.push(evaluation);
        }
      } catch (error) {
        // Count errors as failures
        failures.push({
          evaluations: [],
          total_score: 0,
          max_score: RUBRIC.max_score,
          passed: false,
          summary_feedback: `Error: ${error}`
        });
      }
    }
  }
  
  const passRate = (results.length - failures.length) / results.length;
  
  console.log(`\n📊 Pass@${k} Results:`);
  console.log(`   Pass Rate: ${(passRate * 100).toFixed(1)}%`);
  console.log(`   Total Runs: ${results.length}`);
  console.log(`   Failures: ${failures.length}`);
  
  if (passRate < 0.99) {
    console.log('   ⚠️  Below 99% - not production ready for healthcare');
  } else {
    console.log('   ✅ Meets production threshold');
  }
  
  return { pass_rate: passRate, failures };
}

// Test data
const testNote = `
DISCHARGE SUMMARY
Patient: [REDACTED]
Admission Date: 2024-01-15
Discharge Date: 2024-01-18

Diagnosis: Type 2 Diabetes Mellitus with hyperglycemia

Hospital Course:
Patient presented with blood glucose of 450 mg/dL. Started on insulin
drip, transitioned to subcutaneous insulin. Glucose stabilized.

Discharge Medications:
- Metformin 1000mg twice daily
- Lantus 20 units at bedtime

Follow-up:
- Endocrinology appointment in 2 weeks
- Primary care in 1 week
`;

const testSummary = `
Patient admitted for Type 2 Diabetes with elevated blood sugar (450).
Treated with insulin, now stable. Discharged on Metformin and Lantus.
Follow-up with endocrinology in 2 weeks and primary care in 1 week.
`;

// Run single evaluation
evaluateSummary(testNote, testSummary).then(result => {
  console.log('\n📋 Single Evaluation:');
  console.log(`   Passed: ${result.passed}`);
  console.log(`   Score: ${result.total_score}/${result.max_score}`);
  console.log('\n   Criteria Results:');
  for (const e of result.evaluations) {
    console.log(`   ${e.passed ? '✅' : '❌'} ${e.criterion_id}`);
    console.log(`      Evidence: ${e.evidence.substring(0, 80)}...`);
  }
});

/* Expected output:
📋 Single Evaluation:
   Passed: true
   Score: 12/12

   Criteria Results:
   ✅ contains_diagnosis
      Evidence: "Type 2 Diabetes with elevated blood sugar (450)"...
   ✅ contains_followup
      Evidence: "Follow-up with endocrinology in 2 weeks"...
   ✅ no_phi_leaked
      Evidence: No patient identifiers found...
   ✅ compression
      Evidence: Summary is 203 chars, original is 512 chars...
   ✅ clinical_accuracy
      Evidence: Medications and diagnoses match original...
*/ 
  Weight critical criteria (safety, accuracy) higher than nice-to-haves 
  Property-based assertion: deterministic check before LLM call 
  Allow 20% buffer for abbreviation expansion (e.g., 'Pt w/ DM2' -> 'Patient with Type 2 Diabetes') 
  Deterministic post-processing: calculate score from LLM evaluations 
  Zod validation ensures type safety at runtime 

Common Mistakes

Using exact-match assertions for LLM outputs

Why it's wrong: LLM outputs are non-deterministic. 'Type 2 Diabetes' vs 'T2D' vs 'Diabetes Mellitus Type II' are all valid.

How to fix: Use property-based assertions (contains, length, format) or semantic similarity instead of exact matches.

Using the same model to judge itself

Why it's wrong: A model will be biased toward its own outputs. Same errors propagate.

How to fix: Use a stronger model (GPT-4) to judge outputs from weaker models (GPT-3.5), or use multiple diverse models.

Running evaluation only once

Why it's wrong: Single runs don't capture the variance in LLM outputs. You might get lucky or unlucky.

How to fix: Calculate Pass@k over many runs (k=100 for production readiness) to get statistical confidence.

No deterministic pre/post processing

Why it's wrong: Wrapping probabilistic calls without deterministic layers makes results unpredictable.

How to fix: Use the Sandwich Pattern: deterministic pre-processing (schema) → LLM call → deterministic validation (Zod).

Strict character count for compression checks

Why it's wrong: Dense clinical notes use abbreviations (Pt, DM2, w/) that expand in summaries. A summary can be longer if it expands 'Pt w/ DM2' to 'Patient with Type 2 Diabetes'.

How to fix: Allow a buffer (e.g., 120% of original) for abbreviation expansion, or check 'information density' rather than strict length.

Test Cases

Passes valid summary

A good summary should pass all criteria

Input: Test note + valid summary with diagnosis, follow-up, no PHI

Expected: passed: true, total_score >= 10

Fails PHI leak

Summary containing patient name should fail

Input: Test note + summary with 'John Doe' included

Expected: passed: false, no_phi_leaked: false

Fails compression

Summary longer than original should fail

Input: Short note + very long summary

Expected: passed: false, compression: false

Sources

Tempered AI — Forged Through Practice, Not Hype

? Keyboard shortcuts