Skip to content

Design a Type-Safe AI Schema

Build beginner 30 min typescript
Sources not yet verified

Design and implement a Zod schema for structured AI outputs. Learn to constrain LLM responses for reliable data extraction.

1. Understand the Scenario

You're building an invoice processing system that uses AI to extract structured data from invoice images/PDFs. The extracted data needs to be type-safe and validated before entering your database.

Learning Objectives

  • Design a Zod schema for invoice data
  • Convert Zod schema to JSON Schema for OpenAI
  • Extract structured data from unstructured text
  • Handle edge cases and optional fields

2. Follow the Instructions

What You'll Build

A schema-driven invoice extractor that:

  1. Defines the invoice structure with Zod
  2. Converts to JSON Schema for OpenAI's structured outputs
  3. Extracts invoice data with guaranteed format
  4. Validates the extracted data at runtime

The Challenge

Given invoice text like:

INVOICE #2024-0892
Date: December 15, 2024
Due: January 15, 2025

From: Acme Corp
123 Business St, Suite 100
New York, NY 10001

To: Widget Inc
456 Commerce Ave
San Francisco, CA 94102

Items:
- Web Development Services (40 hrs @ $150/hr) - $6,000.00
- Cloud Hosting (Monthly) - $299.00
- Domain Renewal - $15.00

Subtotal: $6,314.00
Tax (8.5%): $536.69
Total: $6,850.69

Payment Terms: Net 30
Status: Unpaid

Extract structured, validated data.

Step 1: Define the Schema with Zod

Start by defining what an invoice looks like:

import { z } from 'zod';

// Define individual components first
const AddressSchema = z.object({
  company: z.string().describe('Company or person name'),
  street: z.string().describe('Street address'),
  city: z.string(),
  state: z.string().describe('State or province code'),
  zip: z.string().describe('Postal/ZIP code')
});

const LineItemSchema = z.object({
  description: z.string().describe('Item or service description'),
  quantity: z.number().optional().describe('Quantity if applicable'),
  unit_price: z.number().optional().describe('Price per unit'),
  amount: z.number().describe('Line item total')
});

// Compose into the full invoice schema
const InvoiceSchema = z.object({
  invoice_number: z.string().describe('Invoice ID/number'),
  date: z.string().describe('Invoice date in YYYY-MM-DD format'),
  due_date: z.string().optional().describe('Due date in YYYY-MM-DD format'),
  from: AddressSchema.describe('Seller/vendor address'),
  to: AddressSchema.describe('Buyer/customer address'),
  line_items: z.array(LineItemSchema).describe('List of items/services'),
  subtotal: z.number().describe('Sum before tax'),
  tax_rate: z.number().optional().describe('Tax rate as decimal, e.g., 0.085 for 8.5%'),
  tax_amount: z.number().optional().describe('Tax amount'),
  total: z.number().describe('Final total'),
  status: z.enum(['paid', 'unpaid', 'partial', 'overdue']).describe('Payment status'),
  payment_terms: z.string().optional().describe('e.g., Net 30')
});

// Get TypeScript type for free!
type Invoice = z.infer<typeof InvoiceSchema>;

Step 2: Convert to JSON Schema

OpenAI's structured outputs use JSON Schema. Use zod-to-json-schema to convert:

import { zodToJsonSchema } from 'zod-to-json-schema';

const jsonSchema = zodToJsonSchema(InvoiceSchema, {
  name: 'invoice_extraction',
  $refStrategy: 'none' // Inline all definitions
});

console.log(JSON.stringify(jsonSchema, null, 2));

Your Task: Complete the extraction function in the starter code. Use OpenAI's structured outputs to extract invoice data, then validate with Zod.

3. Try It Yourself

starter_code.ts
import OpenAI from 'openai';
import { z } from 'zod';
import { zodToJsonSchema } from 'zod-to-json-schema';

const openai = new OpenAI();

// TODO: Define your InvoiceSchema here
const InvoiceSchema = z.object({
  // Your schema definition
});

type Invoice = z.infer<typeof InvoiceSchema>;

// TODO: Implement the extraction function
async function extractInvoice(invoiceText: string): Promise<Invoice> {
  // 1. Convert Zod schema to JSON Schema
  // 2. Call OpenAI with structured output
  // 3. Parse and validate the response
  // 4. Return the validated invoice
  
  throw new Error('Not implemented');
}

// Test with sample invoice
const sampleInvoice = `
INVOICE #2024-0892
Date: December 15, 2024
Due: January 15, 2025

From: Acme Corp, 123 Business St, Suite 100, New York, NY 10001
To: Widget Inc, 456 Commerce Ave, San Francisco, CA 94102

Items:
- Web Development Services (40 hrs @ $150/hr) - $6,000.00
- Cloud Hosting (Monthly) - $299.00

Subtotal: $6,299.00
Tax (8.5%): $535.42
Total: $6,834.42

Status: Unpaid
`;

extractInvoice(sampleInvoice)
  .then(invoice => console.log(JSON.stringify(invoice, null, 2)))
  .catch(console.error);

This typescript exercise requires local setup. Copy the code to your IDE to run.

4. Get Help (If Needed)

Reveal progressive hints
Hint 1: Use .describe() on every field to give the LLM context about what to extract.
Hint 2: The system prompt should include instructions for formatting: dates as YYYY-MM-DD, percentages as decimals, amounts without currency symbols.
Hint 3: After getting the response, parse with JSON.parse() then validate with InvoiceSchema.parse() to catch any edge cases the LLM might miss.

5. Check the Solution

Reveal the complete solution
solution.ts
import OpenAI from 'openai';
import { z } from 'zod';
import { zodToJsonSchema } from 'zod-to-json-schema';

const openai = new OpenAI();

// Define component schemas
const AddressSchema = z.object({
  company: z.string().describe('Company or person name'),
  street: z.string().describe('Street address including suite/unit'),
  city: z.string(),
  state: z.string().describe('State or province code (e.g., NY, CA)'),
  zip: z.string().describe('Postal/ZIP code')
});

const LineItemSchema = z.object({
  description: z.string().describe('Item or service description'),
  quantity: z.number().optional().describe('Quantity if applicable'),
  unit_price: z.number().optional().describe('Price per unit if applicable'),
  amount: z.number().describe('Line item total amount')
});

const InvoiceSchema = z.object({
  invoice_number: z.string().describe('Invoice ID/number including any prefix'),
  date: z.string().describe('Invoice date in YYYY-MM-DD format'),
  due_date: z.string().optional().describe('Due date in YYYY-MM-DD format'),
  from: AddressSchema.describe('Seller/vendor address'),
  to: AddressSchema.describe('Buyer/customer address'),
  line_items: z.array(LineItemSchema).describe('List of items/services billed'),
  subtotal: z.number().describe('Sum of line items before tax'),
  tax_rate: z.number().optional().describe('Tax rate as decimal (0.085 for 8.5%)'),
  tax_amount: z.number().optional().describe('Calculated tax amount'),
  total: z.number().describe('Final total including tax'),
  status: z.enum(['paid', 'unpaid', 'partial', 'overdue']).describe('Payment status'),
  payment_terms: z.string().optional().describe('Payment terms like Net 30')
});

type Invoice = z.infer<typeof InvoiceSchema>;

async function extractInvoice(invoiceText: string): Promise<Invoice> {
  // 1. Convert Zod schema to JSON Schema
  const jsonSchema = zodToJsonSchema(InvoiceSchema, {
    name: 'invoice_extraction',
    $refStrategy: 'none'
  });

  // 2. Call OpenAI with structured output
  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [
      {
        role: 'system',
        content: 'You are an invoice data extraction assistant. Extract structured data from invoice text. Convert dates to YYYY-MM-DD format. Convert percentages to decimals (8.5% = 0.085). Parse amounts as numbers without currency symbols.'
      },
      {
        role: 'user',
        content: `Extract the invoice data from this text:\n\n${invoiceText}`
      }
    ],
    response_format: {
      type: 'json_schema',
      json_schema: {
        name: 'invoice_extraction',
        schema: jsonSchema as any
      }
    }
  });

  const content = response.choices[0].message.content;
  if (!content) {
    throw new Error('No content in response');
  }

  // 3. Parse the JSON response
  const parsed = JSON.parse(content);

  // 4. Validate with Zod (catches edge cases)
  const validated = InvoiceSchema.parse(parsed);

  return validated;
}

// Test with sample invoice
const sampleInvoice = `
INVOICE #2024-0892
Date: December 15, 2024
Due: January 15, 2025

From: Acme Corp, 123 Business St, Suite 100, New York, NY 10001
To: Widget Inc, 456 Commerce Ave, San Francisco, CA 94102

Items:
- Web Development Services (40 hrs @ $150/hr) - $6,000.00
- Cloud Hosting (Monthly) - $299.00

Subtotal: $6,299.00
Tax (8.5%): $535.42
Total: $6,834.42

Status: Unpaid
`;

extractInvoice(sampleInvoice)
  .then(invoice => {
    console.log('Extracted Invoice:');
    console.log(JSON.stringify(invoice, null, 2));
  })
  .catch(console.error);

/* Expected output:
{
  "invoice_number": "2024-0892",
  "date": "2024-12-15",
  "due_date": "2025-01-15",
  "from": {
    "company": "Acme Corp",
    "street": "123 Business St, Suite 100",
    "city": "New York",
    "state": "NY",
    "zip": "10001"
  },
  "to": {
    "company": "Widget Inc",
    "street": "456 Commerce Ave",
    "city": "San Francisco",
    "state": "CA",
    "zip": "94102"
  },
  "line_items": [
    {
      "description": "Web Development Services",
      "quantity": 40,
      "unit_price": 150,
      "amount": 6000
    },
    {
      "description": "Cloud Hosting (Monthly)",
      "amount": 299
    }
  ],
  "subtotal": 6299,
  "tax_rate": 0.085,
  "tax_amount": 535.42,
  "total": 6834.42,
  "status": "unpaid"
}
*/

Common Mistakes

Not using .describe() on schema fields

Why it's wrong: Without descriptions, the LLM has no context about what data to extract for each field

How to fix: Add .describe() to every field - these become JSON Schema descriptions that guide extraction

Forgetting to handle optional fields

Why it's wrong: Not all invoices have every field (e.g., due_date, tax_rate) - requiring them causes validation failures

How to fix: Use .optional() in Zod for fields that may not be present in all invoices

Trusting LLM output without validation

Why it's wrong: Even with structured outputs, edge cases can produce invalid data (wrong date format, missing fields)

How to fix: Always validate with Zod after parsing - it catches errors the API might miss

Using $refs in JSON Schema with OpenAI

Why it's wrong: OpenAI's structured outputs don't support JSON Schema $ref - the request will fail

How to fix: Set $refStrategy: 'none' when converting to inline all definitions

Test Cases

Schema infers correct TypeScript type

Input: typeof InvoiceSchema._type
Expected: TypeScript type with all expected fields

Extracts invoice number correctly

Input: extractInvoice(sampleInvoice)
Expected: invoice_number should be '2024-0892'

Converts dates to ISO format

Input: extractInvoice(sampleInvoice)
Expected: date should be '2024-12-15', due_date should be '2025-01-15'

Parses line items as array

Input: extractInvoice(sampleInvoice)
Expected: line_items should be array with 2 items

Sources

Tempered AI Forged Through Practice, Not Hype

Keyboard Shortcuts

j
Next page
k
Previous page
h
Section home
/
Search
?
Show shortcuts
m
Toggle sidebar
Esc
Close modal
Shift+R
Reset all progress
? Keyboard shortcuts