Structured Output Extraction
Learn to extract structured data from unstructured text using JSON Schema constraints, ensuring type-safe outputs from LLMs.
1. Understand the Scenario
You're building a system that extracts contact information from business emails. Instead of parsing free-form text, you'll use structured outputs to guarantee the data format.
Learning Objectives
- Understand how JSON Schema constrains LLM outputs
- Design schemas for real-world extraction tasks
- Handle optional vs required fields
- Validate extracted data with proper types
Concepts You'll Practice
2. Follow the Instructions
What You'll Build
A contact extractor that takes messy email signatures and extracts structured contact data:
Input:
Best regards,
Sarah Chen | Senior Engineer
Acme Corp - sarah.chen@acme.io
Mobile: +1 (555) 123-4567
Output:
{
"name": "Sarah Chen",
"title": "Senior Engineer",
"company": "Acme Corp",
"email": "sarah.chen@acme.io",
"phone": "+1 (555) 123-4567"
}
Step 1: Define Your Schema
The schema tells the model exactly what structure to produce. Use JSON Schema format with clear descriptions.
// Define the expected output structure
const contactSchema = {
type: 'object',
properties: {
name: {
type: 'string',
description: 'Full name of the person'
},
title: {
type: 'string',
description: 'Job title or role'
},
company: {
type: 'string',
description: 'Company or organization name'
},
email: {
type: 'string',
description: 'Email address'
},
phone: {
type: 'string',
description: 'Phone number in any format'
}
},
required: ['name'], // Only name is required
additionalProperties: false
}; Step 2: Call the API with Response Format
With OpenAI, use response_format to enforce JSON Schema compliance. The model will only produce valid JSON matching your schema.
import OpenAI from 'openai';
const openai = new OpenAI();
async function extractContact(emailText: string) {
const response = await openai.chat.completions.create({
model: 'gpt-4o-2024-08-06',
messages: [
{
role: 'system',
content: 'Extract contact information from email signatures. Return null for fields not found.'
},
{
role: 'user',
content: emailText
}
],
response_format: {
type: 'json_schema',
json_schema: {
name: 'contact_info',
strict: true,
schema: contactSchema
}
}
});
return JSON.parse(response.choices[0].message.content!);
} Step 3: Add Runtime Validation
Even with structured outputs, add runtime validation for defense in depth. Use Zod or similar for TypeScript-native validation.
Your Task: Complete the extraction function with proper schema definition and Zod validation.
3. Try It Yourself
import OpenAI from 'openai';
import { z } from 'zod';
const openai = new OpenAI();
// TODO: Define Zod schema for runtime validation
const ContactSchema = z.object({
// Add your fields here
});
type Contact = z.infer<typeof ContactSchema>;
// TODO: Define JSON Schema for the API
const jsonSchema = {
// Add your schema here
};
async function extractContact(emailText: string): Promise<Contact> {
// TODO: Implement extraction with structured output
// 1. Call the API with response_format
// 2. Parse and validate the response
// 3. Return typed data
throw new Error('Not implemented');
}
// Test cases
const testEmails = [
`Thanks,\nJohn Smith\njohn@example.com`,
`Best,\nMaria Garcia | CTO\nTechStart Inc.\nmaria@techstart.io | +1-555-999-0000`,
`Regards\n\nNo contact info here`
];
for (const email of testEmails) {
extractContact(email).then(console.log);
} This typescript exercise requires local setup. Copy the code to your IDE to run.
4. Get Help (If Needed)
Reveal progressive hints
5. Check the Solution
Reveal the complete solution
import OpenAI from 'openai';
import { z } from 'zod';
const openai = new OpenAI();
// Zod schema for runtime validation
const ContactSchema = z.object({
name: z.string(),
title: z.string().nullable(),
company: z.string().nullable(),
email: z.string().nullable(),
phone: z.string().nullable()
});
type Contact = z.infer<typeof ContactSchema>;
// JSON Schema for the API
const jsonSchema = {
type: 'object',
properties: {
name: { type: 'string', description: 'Full name of the person' },
title: { type: ['string', 'null'], description: 'Job title or role' },
company: { type: ['string', 'null'], description: 'Company name' },
email: { type: ['string', 'null'], description: 'Email address' },
phone: { type: ['string', 'null'], description: 'Phone number' }
},
required: ['name', 'title', 'company', 'email', 'phone'],
additionalProperties: false
};
async function extractContact(emailText: string): Promise<Contact> {
const response = await openai.chat.completions.create({
model: 'gpt-4o-2024-08-06',
messages: [
{
role: 'system',
content: 'Extract contact information from the email signature. Return null for any fields that cannot be found.'
},
{
role: 'user',
content: emailText
}
],
response_format: {
type: 'json_schema',
json_schema: {
name: 'contact_info',
strict: true,
schema: jsonSchema
}
}
});
const rawData = JSON.parse(response.choices[0].message.content!);
// Validate with Zod for type safety
return ContactSchema.parse(rawData);
}
// Test cases
const testEmails = [
`Thanks,\nJohn Smith\njohn@example.com`,
`Best,\nMaria Garcia | CTO\nTechStart Inc.\nmaria@techstart.io | +1-555-999-0000`,
`Regards\n\nNo contact info here`
];
for (const email of testEmails) {
extractContact(email)
.then(contact => console.log('Extracted:', contact))
.catch(err => console.error('Validation failed:', err.message));
}
/* Expected outputs:
Extracted: { name: 'John Smith', title: null, company: null, email: 'john@example.com', phone: null }
Extracted: { name: 'Maria Garcia', title: 'CTO', company: 'TechStart Inc.', email: 'maria@techstart.io', phone: '+1-555-999-0000' }
Extracted: { name: 'Unknown', title: null, company: null, email: null, phone: null }
*/ Common Mistakes
Not including all properties in 'required' when using strict mode
Why it's wrong: In strict mode, the API requires all properties to be listed in 'required'. Optional fields should use `type: ['string', 'null']`.
How to fix: List all properties in 'required' and use union types with 'null' for optional fields.
Mismatch between JSON Schema and Zod schema
Why it's wrong: If schemas don't match, validation may fail or allow invalid data through.
How to fix: Ensure both schemas expect the same structure. Use `.nullable()` in Zod for fields that can be null.
Forgetting to parse JSON from the response
Why it's wrong: The API returns a string, even with structured outputs. You must parse it to get an object.
How to fix: Always use `JSON.parse(response.choices[0].message.content!)` before validation.
Test Cases
Extracts complete contact
All fields present in signature
Best,
Maria Garcia | CTO
TechStart Inc.
maria@techstart.io{ name: 'Maria Garcia', title: 'CTO', company: 'TechStart Inc.', email: 'maria@techstart.io' }Handles minimal info
Only name and email available
Thanks,
John
john@example.com{ name: 'John', email: 'john@example.com', title: null, company: null }Handles missing data gracefully
No clear contact info
RegardsReturns object with null values for missing fields