Skip to content

Structured Output Extraction

Build beginner 25 min typescript
Sources not yet verified

Learn to extract structured data from unstructured text using JSON Schema constraints, ensuring type-safe outputs from LLMs.

1. Understand the Scenario

You're building a system that extracts contact information from business emails. Instead of parsing free-form text, you'll use structured outputs to guarantee the data format.

Learning Objectives

  • Understand how JSON Schema constrains LLM outputs
  • Design schemas for real-world extraction tasks
  • Handle optional vs required fields
  • Validate extracted data with proper types

2. Follow the Instructions

What You'll Build

A contact extractor that takes messy email signatures and extracts structured contact data:

Input:

Best regards,
Sarah Chen | Senior Engineer
Acme Corp - sarah.chen@acme.io
Mobile: +1 (555) 123-4567

Output:

{
  "name": "Sarah Chen",
  "title": "Senior Engineer",
  "company": "Acme Corp",
  "email": "sarah.chen@acme.io",
  "phone": "+1 (555) 123-4567"
}

Step 1: Define Your Schema

The schema tells the model exactly what structure to produce. Use JSON Schema format with clear descriptions.

// Define the expected output structure
const contactSchema = {
  type: 'object',
  properties: {
    name: {
      type: 'string',
      description: 'Full name of the person'
    },
    title: {
      type: 'string',
      description: 'Job title or role'
    },
    company: {
      type: 'string',
      description: 'Company or organization name'
    },
    email: {
      type: 'string',
      description: 'Email address'
    },
    phone: {
      type: 'string',
      description: 'Phone number in any format'
    }
  },
  required: ['name'],  // Only name is required
  additionalProperties: false
};

💡 TIP: Schema Design Tips

  1. Use additionalProperties: false to prevent extra fields
  2. Only mark truly required fields as required
  3. Add descriptions to help the model understand context

Step 2: Call the API with Response Format

With OpenAI, use response_format to enforce JSON Schema compliance. The model will only produce valid JSON matching your schema.

import OpenAI from 'openai';

const openai = new OpenAI();

async function extractContact(emailText: string) {
  const response = await openai.chat.completions.create({
    model: 'gpt-4o-2024-08-06',
    messages: [
      {
        role: 'system',
        content: 'Extract contact information from email signatures. Return null for fields not found.'
      },
      {
        role: 'user',
        content: emailText
      }
    ],
    response_format: {
      type: 'json_schema',
      json_schema: {
        name: 'contact_info',
        strict: true,
        schema: contactSchema
      }
    }
  });

  return JSON.parse(response.choices[0].message.content!);
}

Step 3: Add Runtime Validation

Even with structured outputs, add runtime validation for defense in depth. Use Zod or similar for TypeScript-native validation.

Your Task: Complete the extraction function with proper schema definition and Zod validation.

3. Try It Yourself

starter_code.ts
import OpenAI from 'openai';
import { z } from 'zod';

const openai = new OpenAI();

// TODO: Define Zod schema for runtime validation
const ContactSchema = z.object({
  // Add your fields here
});

type Contact = z.infer<typeof ContactSchema>;

// TODO: Define JSON Schema for the API
const jsonSchema = {
  // Add your schema here
};

async function extractContact(emailText: string): Promise<Contact> {
  // TODO: Implement extraction with structured output
  // 1. Call the API with response_format
  // 2. Parse and validate the response
  // 3. Return typed data
  throw new Error('Not implemented');
}

// Test cases
const testEmails = [
  `Thanks,\nJohn Smith\njohn@example.com`,
  `Best,\nMaria Garcia | CTO\nTechStart Inc.\nmaria@techstart.io | +1-555-999-0000`,
  `Regards\n\nNo contact info here`
];

for (const email of testEmails) {
  extractContact(email).then(console.log);
}

This typescript exercise requires local setup. Copy the code to your IDE to run.

4. Get Help (If Needed)

Reveal progressive hints
Hint 1: For strict mode, all properties must be in the 'required' array. Use `type: ['string', 'null']` to allow null values.
Hint 2: The Zod schema should use `.nullable()` for optional fields, matching the JSON Schema's null allowance.
Hint 3: Use `response_format: { type: 'json_schema', json_schema: { name: 'contact_info', strict: true, schema: ... } }` in the API call.

5. Check the Solution

Reveal the complete solution
solution.ts
/**
 * Key Points:
 * - Line ~8: Use .nullable() for optional fields that may be null
 * - Line ~20: In strict mode, all properties must be listed in 'required'
 * - Line ~21: Use ['string', 'null'] to allow null values in JSON Schema
 * - Line ~48: Always validate API responses at runtime
 */
import OpenAI from 'openai';
import { z } from 'zod';

const openai = new OpenAI();

// Zod schema for runtime validation
const ContactSchema = z.object({
  name: z.string(),
  title: z.string().nullable(),
  company: z.string().nullable(),
  email: z.string().nullable(),
  phone: z.string().nullable()
});

type Contact = z.infer<typeof ContactSchema>;

// JSON Schema for the API
const jsonSchema = {
  type: 'object',
  properties: {
    name: { type: 'string', description: 'Full name of the person' },
    title: { type: ['string', 'null'], description: 'Job title or role' },
    company: { type: ['string', 'null'], description: 'Company name' },
    email: { type: ['string', 'null'], description: 'Email address' },
    phone: { type: ['string', 'null'], description: 'Phone number' }
  },
  required: ['name', 'title', 'company', 'email', 'phone'],
  additionalProperties: false
};

async function extractContact(emailText: string): Promise<Contact> {
  const response = await openai.chat.completions.create({
    model: 'gpt-4o-2024-08-06',
    messages: [
      {
        role: 'system',
        content: 'Extract contact information from the email signature. Return null for any fields that cannot be found.'
      },
      {
        role: 'user',
        content: emailText
      }
    ],
    response_format: {
      type: 'json_schema',
      json_schema: {
        name: 'contact_info',
        strict: true,
        schema: jsonSchema
      }
    }
  });

  const rawData = JSON.parse(response.choices[0].message.content!);
  
  // Validate with Zod for type safety
  return ContactSchema.parse(rawData);
}

// Test cases
const testEmails = [
  `Thanks,\nJohn Smith\njohn@example.com`,
  `Best,\nMaria Garcia | CTO\nTechStart Inc.\nmaria@techstart.io | +1-555-999-0000`,
  `Regards\n\nNo contact info here`
];

for (const email of testEmails) {
  extractContact(email)
    .then(contact => console.log('Extracted:', contact))
    .catch(err => console.error('Validation failed:', err.message));
}

/* Expected outputs:
Extracted: { name: 'John Smith', title: null, company: null, email: 'john@example.com', phone: null }
Extracted: { name: 'Maria Garcia', title: 'CTO', company: 'TechStart Inc.', email: 'maria@techstart.io', phone: '+1-555-999-0000' }
Extracted: { name: 'Unknown', title: null, company: null, email: null, phone: null }
*/
L8: Use .nullable() for optional fields that may be null
L20: In strict mode, all properties must be listed in 'required'
L21: Use ['string', 'null'] to allow null values in JSON Schema
L48: Always validate API responses at runtime

Common Mistakes

Not including all properties in 'required' when using strict mode

Why it's wrong: In strict mode, the API requires all properties to be listed in 'required'. Optional fields should use `type: ['string', 'null']`.

How to fix: List all properties in 'required' and use union types with 'null' for optional fields.

Mismatch between JSON Schema and Zod schema

Why it's wrong: If schemas don't match, validation may fail or allow invalid data through.

How to fix: Ensure both schemas expect the same structure. Use `.nullable()` in Zod for fields that can be null.

Forgetting to parse JSON from the response

Why it's wrong: The API returns a string, even with structured outputs. You must parse it to get an object.

How to fix: Always use `JSON.parse(response.choices[0].message.content!)` before validation.

Test Cases

Extracts complete contact

All fields present in signature

Input: Best, Maria Garcia | CTO TechStart Inc. maria@techstart.io
Expected: { name: 'Maria Garcia', title: 'CTO', company: 'TechStart Inc.', email: 'maria@techstart.io' }

Handles minimal info

Only name and email available

Input: Thanks, John john@example.com
Expected: { name: 'John', email: 'john@example.com', title: null, company: null }

Handles missing data gracefully

No clear contact info

Input: Regards
Expected: Returns object with null values for missing fields

Sources

Tempered AI Forged Through Practice, Not Hype

Keyboard Shortcuts

j
Next page
k
Previous page
h
Section home
/
Search
?
Show shortcuts
m
Toggle sidebar
Esc
Close modal
Shift+R
Reset all progress
? Keyboard shortcuts