Skip to content

Pydantic for Type-Safe AI

Type Systems intermediate 20 min
Sources verified Dec 22

Pydantic brings runtime validation and type safety to Python AI applications, automatically converting JSON Schema to validated Python objects with IDE autocomplete.

Pydantic is a Python library for data validation using type annotations. In AI development, it bridges the gap between LLM outputs (JSON) and your Python code (typed objects), providing automatic validation, serialization, and IDE support.

The key insight: Define your data model once as a Python class, and Pydantic handles JSON Schema generation, validation, parsing, and type safety automatically.

Why Pydantic for AI?

Without Pydantic:

# Manual parsing, no validation, no types
response = client.chat.completions.create(...)
data = json.loads(response.choices[0].message.content)
name = data['name']  # Hope this exists
email = data['email']  # Hope it's a valid email

With Pydantic:

from pydantic import BaseModel, EmailStr

class Person(BaseModel):
    name: str
    email: EmailStr  # Validates email format
    age: int | None = None

# Automatic validation + IDE autocomplete
person = Person.model_validate_json(response.choices[0].message.content)
print(person.name)  # Type-safe, guaranteed to exist
print(person.email)  # Guaranteed valid email
pydantic_openai.py
from pydantic import BaseModel, Field, EmailStr, validator
from openai import OpenAI
import json

# Define your output structure as a Pydantic model
class Skill(BaseModel):
    name: str
    years_experience: int = Field(ge=0, description="Years of experience")
    proficiency: str = Field(pattern="^(beginner|intermediate|expert)$")

class Resume(BaseModel):
    name: str = Field(min_length=1)
    email: EmailStr
    skills: list[Skill] = Field(min_length=1)
    summary: str | None = None
    
    @validator('skills')
    def validate_skills(cls, v):
        if len(v) > 20:
            raise ValueError('Maximum 20 skills allowed')
        return v

# Pydantic automatically generates JSON Schema
schema = Resume.model_json_schema()
print(json.dumps(schema, indent=2))
# {
#   "type": "object",
#   "required": ["name", "email", "skills"],
#   "properties": {
#     "name": {"type": "string", "minLength": 1},
#     "email": {"type": "string", "format": "email"},
#     "skills": {
#       "type": "array",
#       "items": {...},
#       "minItems": 1
#     }
#   }
# }

# Use the schema with OpenAI
client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "user", "content": "Extract resume info from: John Doe, john@example.com, 5 years Python..."}
    ],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "resume",
            "schema": schema,
            "strict": True
        }
    }
)

# Parse and validate in one step
resume = Resume.model_validate_json(response.choices[0].message.content)

# Type-safe access with IDE autocomplete
print(resume.name)  # str
print(resume.email)  # EmailStr (validated)
for skill in resume.skills:  # list[Skill]
    print(f"{skill.name}: {skill.proficiency}")  # IDE knows these fields exist
L9: Field() adds validation and descriptions
L17: Custom validators for business logic
L24: Auto-generate JSON Schema from model
L58: Parse + validate in one call
L61: Full type safety and autocomplete

Pydantic with Instructor Library

The Instructor library makes Pydantic + LLMs even easier:

instructor_example.py
import instructor
from pydantic import BaseModel
from openai import OpenAI

# Patch OpenAI client to use Instructor
client = instructor.from_openai(OpenAI())

class Person(BaseModel):
    name: str
    age: int
    email: str

# Instructor handles schema generation + validation automatically
person = client.chat.completions.create(
    model="gpt-4o",
    response_model=Person,  # Just pass the Pydantic model
    messages=[
        {"role": "user", "content": "Extract: John is 30 years old, email john@example.com"}
    ]
)

print(person.name)  # "John"
print(person.age)   # 30
print(type(person)) # <class 'Person'>
L16: response_model automatically handles schema + validation
L24: Returns typed Pydantic object, not JSON string

Pydantic Features for AI

1. Validation

  • Type checking: str, int, float, bool, list, dict
  • Format validation: EmailStr, HttpUrl, UUID, datetime
  • Constraints: Field(ge=0, le=100, min_length=1, pattern=r'^[A-Z]')
  • Custom validators: @validator decorator for business logic

2. Nested Models

class Address(BaseModel):
    street: str
    city: str
    country: str

class Company(BaseModel):
    name: str
    address: Address  # Nested validation
    employees: list[Person]  # List of validated objects

3. Optional Fields & Defaults

class Config(BaseModel):
    required_field: str
    optional_field: str | None = None
    with_default: int = 10
    from_function: str = Field(default_factory=lambda: "generated")

4. Discriminated Unions

from typing import Literal, Union
from pydantic import Field

class ToolCall(BaseModel):
    type: Literal["function"]
    function_name: str

class TextResponse(BaseModel):
    type: Literal["text"]
    content: str

Response = Union[ToolCall, TextResponse]
# Pydantic uses 'type' field to determine which model to use

Pydantic vs Manual JSON Parsing

Aspect Manual JSON Pydantic
Type safety None Full typing + IDE support
Validation Manual checks Automatic
Schema generation Write by hand Auto-generated from model
Error messages Generic Detailed field-level errors
Refactoring Find/replace strings Type checker catches issues
Documentation Separate docs Self-documenting types

When to Use Pydantic

Use Pydantic When Consider Alternatives When
Building production Python apps Quick scripting/prototypes
You need type safety + validation Output is simple key-value
Working with OpenAI/Anthropic APIs Using JavaScript/TypeScript (use Zod)
Data flows through multiple functions One-off parsing
You want IDE autocomplete Performance is absolutely critical

Common Patterns

Streaming with Partial Models

import instructor
from pydantic import BaseModel

class Analysis(BaseModel):
    summary: str
    entities: list[str]
    sentiment: str

# Stream partial results as they arrive
for partial in client.chat.completions.create_partial(
    model="gpt-4o",
    response_model=Analysis,
    messages=[...],
    stream=True
):
    print(partial.summary)  # Updates as tokens arrive

Retry with Validation

import instructor
from tenacity import retry, stop_after_attempt

@retry(stop=stop_after_attempt(3))
def extract_with_retry(text: str) -> Person:
    return client.chat.completions.create(
        model="gpt-4o",
        response_model=Person,
        messages=[{"role": "user", "content": text}]
    )

# Automatically retries if validation fails
person = extract_with_retry("Extract: invalid data...")

Key Takeaways

  • Pydantic converts type annotations into runtime validation
  • Auto-generates JSON Schema from Python classes
  • Integrates seamlessly with OpenAI structured outputs
  • Provides IDE autocomplete and type checking
  • Use Instructor library for simplified LLM integration
  • Pydantic v2 is 5-50x faster than v1 (Rust core)

In This Platform

While this platform uses JSON Schema directly (JavaScript/Node.js), the same validation principles apply. If we were building in Python, every content type (Dimension, Question, Source, Concept) would be a Pydantic model, giving us type safety and validation at runtime.

Relevant Files:
  • schema/survey.schema.json
  • schema/concept.schema.json
platform_pydantic.py (hypothetical)
# Hypothetical: Python version of this platform
from pydantic import BaseModel, Field, HttpUrl
from typing import Literal

class SourceReference(BaseModel):
    id: str
    claim: str
    quote: str | None = None
    page: str | None = None

class QuestionOption(BaseModel):
    text: str
    score: int = Field(ge=0)
    sources: list[SourceReference] = []

class Question(BaseModel):
    id: str
    text: str
    type: Literal["single_choice", "multi_select", "likert"]
    options: list[QuestionOption] = Field(min_length=1)
    max_score: int
    sources: list[SourceReference] = []

# Type-safe loading
with open('dimensions/adoption.json') as f:
    dimension_data = json.load(f)
    questions = [Question(**q) for q in dimension_data['questions']]

# IDE knows the structure
for q in questions:
    print(f"{q.id}: {q.type}")  # Autocomplete works
    print(q.options[0].score)   # Type-checked

Prerequisites

Sources

Tempered AI Forged Through Practice, Not Hype

Keyboard Shortcuts

j
Next page
k
Previous page
h
Section home
/
Search
?
Show shortcuts
m
Toggle sidebar
Esc
Close modal
Shift+R
Reset all progress
? Keyboard shortcuts