Structured Output Extraction
Learn to extract structured data from unstructured text using JSON Schema constraints, ensuring type-safe outputs from LLMs.
1. Understand the Scenario
You're building a system that extracts contact information from business emails. Instead of parsing free-form text, you'll use structured outputs to guarantee the data format.
Learning Objectives
- Understand how JSON Schema constrains LLM outputs
- Design schemas for real-world extraction tasks
- Handle optional vs required fields
- Validate extracted data with proper types
Concepts You'll Practice
2. Follow the Instructions
What You'll Build
A contact extractor that takes messy email signatures and extracts structured contact data:
Input:
Best regards,
Sarah Chen | Senior Engineer
Acme Corp - sarah.chen@acme.io
Mobile: +1 (555) 123-4567
Output:
{
"name": "Sarah Chen",
"title": "Senior Engineer",
"company": "Acme Corp",
"email": "sarah.chen@acme.io",
"phone": "+1 (555) 123-4567"
}
Step 1: Define Your Schema
The schema tells the model exactly what structure to produce. Use JSON Schema format with clear descriptions.
// Define the expected output structure
const contactSchema = {
type: 'object',
properties: {
name: {
type: 'string',
description: 'Full name of the person'
},
title: {
type: 'string',
description: 'Job title or role'
},
company: {
type: 'string',
description: 'Company or organization name'
},
email: {
type: 'string',
description: 'Email address'
},
phone: {
type: 'string',
description: 'Phone number in any format'
}
},
required: ['name'], // Only name is required
additionalProperties: false
};
💡 TIP: Schema Design Tips
- Use
additionalProperties: falseto prevent extra fields - Only mark truly required fields as
required - Add descriptions to help the model understand context
Step 2: Call the API with Response Format
With OpenAI, use response_format to enforce JSON Schema compliance. The model will only produce valid JSON matching your schema.
import OpenAI from 'openai';
const openai = new OpenAI();
async function extractContact(emailText: string) {
const response = await openai.chat.completions.create({
model: 'gpt-4o-2024-08-06',
messages: [
{
role: 'system',
content: 'Extract contact information from email signatures. Return null for fields not found.'
},
{
role: 'user',
content: emailText
}
],
response_format: {
type: 'json_schema',
json_schema: {
name: 'contact_info',
strict: true,
schema: contactSchema
}
}
});
return JSON.parse(response.choices[0].message.content!);
}
Step 3: Add Runtime Validation
Even with structured outputs, add runtime validation for defense in depth. Use Zod or similar for TypeScript-native validation.
Your Task: Complete the extraction function with proper schema definition and Zod validation.
3. Try It Yourself
import OpenAI from 'openai';
import { z } from 'zod';
const openai = new OpenAI();
// TODO: Define Zod schema for runtime validation
const ContactSchema = z.object({
// Add your fields here
});
type Contact = z.infer<typeof ContactSchema>;
// TODO: Define JSON Schema for the API
const jsonSchema = {
// Add your schema here
};
async function extractContact(emailText: string): Promise<Contact> {
// TODO: Implement extraction with structured output
// 1. Call the API with response_format
// 2. Parse and validate the response
// 3. Return typed data
throw new Error('Not implemented');
}
// Test cases
const testEmails = [
`Thanks,\nJohn Smith\njohn@example.com`,
`Best,\nMaria Garcia | CTO\nTechStart Inc.\nmaria@techstart.io | +1-555-999-0000`,
`Regards\n\nNo contact info here`
];
for (const email of testEmails) {
extractContact(email).then(console.log);
} This typescript exercise requires local setup. Copy the code to your IDE to run.
4. Get Help (If Needed)
Reveal progressive hints
5. Check the Solution
Reveal the complete solution
/**
* Key Points:
* - Line ~8: Use .nullable() for optional fields that may be null
* - Line ~20: In strict mode, all properties must be listed in 'required'
* - Line ~21: Use ['string', 'null'] to allow null values in JSON Schema
* - Line ~48: Always validate API responses at runtime
*/
import OpenAI from 'openai';
import { z } from 'zod';
const openai = new OpenAI();
// Zod schema for runtime validation
const ContactSchema = z.object({
name: z.string(),
title: z.string().nullable(),
company: z.string().nullable(),
email: z.string().nullable(),
phone: z.string().nullable()
});
type Contact = z.infer<typeof ContactSchema>;
// JSON Schema for the API
const jsonSchema = {
type: 'object',
properties: {
name: { type: 'string', description: 'Full name of the person' },
title: { type: ['string', 'null'], description: 'Job title or role' },
company: { type: ['string', 'null'], description: 'Company name' },
email: { type: ['string', 'null'], description: 'Email address' },
phone: { type: ['string', 'null'], description: 'Phone number' }
},
required: ['name', 'title', 'company', 'email', 'phone'],
additionalProperties: false
};
async function extractContact(emailText: string): Promise<Contact> {
const response = await openai.chat.completions.create({
model: 'gpt-4o-2024-08-06',
messages: [
{
role: 'system',
content: 'Extract contact information from the email signature. Return null for any fields that cannot be found.'
},
{
role: 'user',
content: emailText
}
],
response_format: {
type: 'json_schema',
json_schema: {
name: 'contact_info',
strict: true,
schema: jsonSchema
}
}
});
const rawData = JSON.parse(response.choices[0].message.content!);
// Validate with Zod for type safety
return ContactSchema.parse(rawData);
}
// Test cases
const testEmails = [
`Thanks,\nJohn Smith\njohn@example.com`,
`Best,\nMaria Garcia | CTO\nTechStart Inc.\nmaria@techstart.io | +1-555-999-0000`,
`Regards\n\nNo contact info here`
];
for (const email of testEmails) {
extractContact(email)
.then(contact => console.log('Extracted:', contact))
.catch(err => console.error('Validation failed:', err.message));
}
/* Expected outputs:
Extracted: { name: 'John Smith', title: null, company: null, email: 'john@example.com', phone: null }
Extracted: { name: 'Maria Garcia', title: 'CTO', company: 'TechStart Inc.', email: 'maria@techstart.io', phone: '+1-555-999-0000' }
Extracted: { name: 'Unknown', title: null, company: null, email: null, phone: null }
*/ Common Mistakes
Not including all properties in 'required' when using strict mode
Why it's wrong: In strict mode, the API requires all properties to be listed in 'required'. Optional fields should use `type: ['string', 'null']`.
How to fix: List all properties in 'required' and use union types with 'null' for optional fields.
Mismatch between JSON Schema and Zod schema
Why it's wrong: If schemas don't match, validation may fail or allow invalid data through.
How to fix: Ensure both schemas expect the same structure. Use `.nullable()` in Zod for fields that can be null.
Forgetting to parse JSON from the response
Why it's wrong: The API returns a string, even with structured outputs. You must parse it to get an object.
How to fix: Always use `JSON.parse(response.choices[0].message.content!)` before validation.
Test Cases
Extracts complete contact
All fields present in signature
Best,
Maria Garcia | CTO
TechStart Inc.
maria@techstart.io{ name: 'Maria Garcia', title: 'CTO', company: 'TechStart Inc.', email: 'maria@techstart.io' }Handles minimal info
Only name and email available
Thanks,
John
john@example.com{ name: 'John', email: 'john@example.com', title: null, company: null }Handles missing data gracefully
No clear contact info
RegardsReturns object with null values for missing fields