Design a Type-Safe AI Schema
Design and implement a Zod schema for structured AI outputs. Learn to constrain LLM responses for reliable data extraction.
1. Understand the Scenario
You're building an invoice processing system that uses AI to extract structured data from invoice images/PDFs. The extracted data needs to be type-safe and validated before entering your database.
Learning Objectives
- Design a Zod schema for invoice data
- Convert Zod schema to JSON Schema for OpenAI
- Extract structured data from unstructured text
- Handle edge cases and optional fields
Concepts You'll Practice
2. Follow the Instructions
What You'll Build
A schema-driven invoice extractor that:
- Defines the invoice structure with Zod
- Converts to JSON Schema for OpenAI's structured outputs
- Extracts invoice data with guaranteed format
- Validates the extracted data at runtime
The Challenge
Given invoice text like:
INVOICE #2024-0892
Date: December 15, 2024
Due: January 15, 2025
From: Acme Corp
123 Business St, Suite 100
New York, NY 10001
To: Widget Inc
456 Commerce Ave
San Francisco, CA 94102
Items:
- Web Development Services (40 hrs @ $150/hr) - $6,000.00
- Cloud Hosting (Monthly) - $299.00
- Domain Renewal - $15.00
Subtotal: $6,314.00
Tax (8.5%): $536.69
Total: $6,850.69
Payment Terms: Net 30
Status: Unpaid
Extract structured, validated data.
Step 1: Define the Schema with Zod
Start by defining what an invoice looks like:
import { z } from 'zod';
// Define individual components first
const AddressSchema = z.object({
company: z.string().describe('Company or person name'),
street: z.string().describe('Street address'),
city: z.string(),
state: z.string().describe('State or province code'),
zip: z.string().describe('Postal/ZIP code')
});
const LineItemSchema = z.object({
description: z.string().describe('Item or service description'),
quantity: z.number().optional().describe('Quantity if applicable'),
unit_price: z.number().optional().describe('Price per unit'),
amount: z.number().describe('Line item total')
});
// Compose into the full invoice schema
const InvoiceSchema = z.object({
invoice_number: z.string().describe('Invoice ID/number'),
date: z.string().describe('Invoice date in YYYY-MM-DD format'),
due_date: z.string().optional().describe('Due date in YYYY-MM-DD format'),
from: AddressSchema.describe('Seller/vendor address'),
to: AddressSchema.describe('Buyer/customer address'),
line_items: z.array(LineItemSchema).describe('List of items/services'),
subtotal: z.number().describe('Sum before tax'),
tax_rate: z.number().optional().describe('Tax rate as decimal, e.g., 0.085 for 8.5%'),
tax_amount: z.number().optional().describe('Tax amount'),
total: z.number().describe('Final total'),
status: z.enum(['paid', 'unpaid', 'partial', 'overdue']).describe('Payment status'),
payment_terms: z.string().optional().describe('e.g., Net 30')
});
// Get TypeScript type for free!
type Invoice = z.infer<typeof InvoiceSchema>;
Step 2: Convert to JSON Schema
OpenAI's structured outputs use JSON Schema. Use zod-to-json-schema to convert:
import { zodToJsonSchema } from 'zod-to-json-schema';
const jsonSchema = zodToJsonSchema(InvoiceSchema, {
name: 'invoice_extraction',
$refStrategy: 'none' // Inline all definitions
});
console.log(JSON.stringify(jsonSchema, null, 2));
Your Task: Complete the extraction function in the starter code. Use OpenAI's structured outputs to extract invoice data, then validate with Zod.
3. Try It Yourself
import OpenAI from 'openai';
import { z } from 'zod';
import { zodToJsonSchema } from 'zod-to-json-schema';
const openai = new OpenAI();
// TODO: Define your InvoiceSchema here
const InvoiceSchema = z.object({
// Your schema definition
});
type Invoice = z.infer<typeof InvoiceSchema>;
// TODO: Implement the extraction function
async function extractInvoice(invoiceText: string): Promise<Invoice> {
// 1. Convert Zod schema to JSON Schema
// 2. Call OpenAI with structured output
// 3. Parse and validate the response
// 4. Return the validated invoice
throw new Error('Not implemented');
}
// Test with sample invoice
const sampleInvoice = `
INVOICE #2024-0892
Date: December 15, 2024
Due: January 15, 2025
From: Acme Corp, 123 Business St, Suite 100, New York, NY 10001
To: Widget Inc, 456 Commerce Ave, San Francisco, CA 94102
Items:
- Web Development Services (40 hrs @ $150/hr) - $6,000.00
- Cloud Hosting (Monthly) - $299.00
Subtotal: $6,299.00
Tax (8.5%): $535.42
Total: $6,834.42
Status: Unpaid
`;
extractInvoice(sampleInvoice)
.then(invoice => console.log(JSON.stringify(invoice, null, 2)))
.catch(console.error); This typescript exercise requires local setup. Copy the code to your IDE to run.
4. Get Help (If Needed)
Reveal progressive hints
5. Check the Solution
Reveal the complete solution
import OpenAI from 'openai';
import { z } from 'zod';
import { zodToJsonSchema } from 'zod-to-json-schema';
const openai = new OpenAI();
// Define component schemas
const AddressSchema = z.object({
company: z.string().describe('Company or person name'),
street: z.string().describe('Street address including suite/unit'),
city: z.string(),
state: z.string().describe('State or province code (e.g., NY, CA)'),
zip: z.string().describe('Postal/ZIP code')
});
const LineItemSchema = z.object({
description: z.string().describe('Item or service description'),
quantity: z.number().optional().describe('Quantity if applicable'),
unit_price: z.number().optional().describe('Price per unit if applicable'),
amount: z.number().describe('Line item total amount')
});
const InvoiceSchema = z.object({
invoice_number: z.string().describe('Invoice ID/number including any prefix'),
date: z.string().describe('Invoice date in YYYY-MM-DD format'),
due_date: z.string().optional().describe('Due date in YYYY-MM-DD format'),
from: AddressSchema.describe('Seller/vendor address'),
to: AddressSchema.describe('Buyer/customer address'),
line_items: z.array(LineItemSchema).describe('List of items/services billed'),
subtotal: z.number().describe('Sum of line items before tax'),
tax_rate: z.number().optional().describe('Tax rate as decimal (0.085 for 8.5%)'),
tax_amount: z.number().optional().describe('Calculated tax amount'),
total: z.number().describe('Final total including tax'),
status: z.enum(['paid', 'unpaid', 'partial', 'overdue']).describe('Payment status'),
payment_terms: z.string().optional().describe('Payment terms like Net 30')
});
type Invoice = z.infer<typeof InvoiceSchema>;
async function extractInvoice(invoiceText: string): Promise<Invoice> {
// 1. Convert Zod schema to JSON Schema
const jsonSchema = zodToJsonSchema(InvoiceSchema, {
name: 'invoice_extraction',
$refStrategy: 'none'
});
// 2. Call OpenAI with structured output
const response = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [
{
role: 'system',
content: 'You are an invoice data extraction assistant. Extract structured data from invoice text. Convert dates to YYYY-MM-DD format. Convert percentages to decimals (8.5% = 0.085). Parse amounts as numbers without currency symbols.'
},
{
role: 'user',
content: `Extract the invoice data from this text:\n\n${invoiceText}`
}
],
response_format: {
type: 'json_schema',
json_schema: {
name: 'invoice_extraction',
schema: jsonSchema as any
}
}
});
const content = response.choices[0].message.content;
if (!content) {
throw new Error('No content in response');
}
// 3. Parse the JSON response
const parsed = JSON.parse(content);
// 4. Validate with Zod (catches edge cases)
const validated = InvoiceSchema.parse(parsed);
return validated;
}
// Test with sample invoice
const sampleInvoice = `
INVOICE #2024-0892
Date: December 15, 2024
Due: January 15, 2025
From: Acme Corp, 123 Business St, Suite 100, New York, NY 10001
To: Widget Inc, 456 Commerce Ave, San Francisco, CA 94102
Items:
- Web Development Services (40 hrs @ $150/hr) - $6,000.00
- Cloud Hosting (Monthly) - $299.00
Subtotal: $6,299.00
Tax (8.5%): $535.42
Total: $6,834.42
Status: Unpaid
`;
extractInvoice(sampleInvoice)
.then(invoice => {
console.log('Extracted Invoice:');
console.log(JSON.stringify(invoice, null, 2));
})
.catch(console.error);
/* Expected output:
{
"invoice_number": "2024-0892",
"date": "2024-12-15",
"due_date": "2025-01-15",
"from": {
"company": "Acme Corp",
"street": "123 Business St, Suite 100",
"city": "New York",
"state": "NY",
"zip": "10001"
},
"to": {
"company": "Widget Inc",
"street": "456 Commerce Ave",
"city": "San Francisco",
"state": "CA",
"zip": "94102"
},
"line_items": [
{
"description": "Web Development Services",
"quantity": 40,
"unit_price": 150,
"amount": 6000
},
{
"description": "Cloud Hosting (Monthly)",
"amount": 299
}
],
"subtotal": 6299,
"tax_rate": 0.085,
"tax_amount": 535.42,
"total": 6834.42,
"status": "unpaid"
}
*/ Common Mistakes
Not using .describe() on schema fields
Why it's wrong: Without descriptions, the LLM has no context about what data to extract for each field
How to fix: Add .describe() to every field - these become JSON Schema descriptions that guide extraction
Forgetting to handle optional fields
Why it's wrong: Not all invoices have every field (e.g., due_date, tax_rate) - requiring them causes validation failures
How to fix: Use .optional() in Zod for fields that may not be present in all invoices
Trusting LLM output without validation
Why it's wrong: Even with structured outputs, edge cases can produce invalid data (wrong date format, missing fields)
How to fix: Always validate with Zod after parsing - it catches errors the API might miss
Using $refs in JSON Schema with OpenAI
Why it's wrong: OpenAI's structured outputs don't support JSON Schema $ref - the request will fail
How to fix: Set $refStrategy: 'none' when converting to inline all definitions
Test Cases
Schema infers correct TypeScript type
typeof InvoiceSchema._typeTypeScript type with all expected fieldsExtracts invoice number correctly
extractInvoice(sampleInvoice)invoice_number should be '2024-0892'Converts dates to ISO format
extractInvoice(sampleInvoice)date should be '2024-12-15', due_date should be '2025-01-15'Parses line items as array
extractInvoice(sampleInvoice)line_items should be array with 2 items