AI Cost Optimization

Patterns intermediate 10 min

Sources verified Dec 23, 2025

Strategies to reduce AI infrastructure costs by 50-90% through prompt caching, batch APIs, model tiering, and context pruning.

AI costs explode in production. A simple agentic workflow can cost $5+ per run. Without optimization, teams burn through budgets or get caught by surprise bills. The good news: strategic optimization can cut costs by 50-90% without sacrificing quality.

The 5x Rule: Claude Opus 4.5 costs ~5x more than Haiku for the same tokens ($5/$25 vs $1/$5 per MTok). GPT-4o costs ~10x more than GPT-4o-mini. Most tasks don't need the expensive model. Route wisely.

Strategy 1: Prompt Caching

Place static context (system prompts, documentation, examples) at the START of prompts. Both Anthropic and OpenAI cache prompt prefixes - you only pay full price for tokens that change.

✅ GOOD (cacheable):
[SYSTEM PROMPT - 1000 tokens - CACHED]
[EXAMPLES - 500 tokens - CACHED]
[User's specific question - 50 tokens - NEW]

❌ BAD (not cacheable):
[User's specific question - 50 tokens]
[SYSTEM PROMPT - 1000 tokens]
[EXAMPLES - 500 tokens]

Anthropic's prompt caching can reduce costs by 90% for the cached portion.

Strategy 2: Batch APIs

For non-urgent tasks (nightly reports, bulk processing, async evaluation), use batch endpoints. OpenAI and Anthropic offer 50% discounts for batch requests with 24-hour SLAs.

Strategy 3: Model Tiering

Use expensive models (Opus, GPT-4o) only for hard tasks. Use cheap models (Haiku, GPT-4o-mini) for everything else.

function selectModel(task: Task): Model {
  // Tier 1: Hard reasoning, complex code
  if (task.complexity === 'high' || task.requiresReasoning) {
    return 'claude-sonnet-4-20250514'; // Or opus for hardest
  }
  
  // Tier 2: Standard code, simple analysis
  if (task.complexity === 'medium') {
    return 'gpt-4o-mini';
  }
  
  // Tier 3: Classification, simple extraction
  return 'claude-haiku-4-20250514'; // 5x cheaper than Opus
}

Strategy 4: Context Pruning

Large context windows are expensive. Don't send entire files when excerpts suffice.

Summarize history: Compress conversation history to key points
Chunk retrieval: For RAG, only retrieve top-k most relevant chunks
Trim examples: Use 2-3 examples instead of 10
Remove redundancy: Don't repeat information in system prompt AND user message

Strategy 5: Cost Attribution

Tag requests to understand WHERE costs come from. Without attribution, you can't optimize.

const response = await anthropic.messages.create({
  model: 'claude-sonnet-4-20250514',
  metadata: {
    user_id: user.id,
    feature: 'code_review',
    team: 'platform'
  },
  // ...
});

Then query: 'Which features cost most?' 'Which users are heavy spenders?'

Key Takeaways

AI costs explode in production - a single agentic workflow can cost $5+ per run
The 5x Rule: Opus costs ~5x more than Haiku for same tokens ($5 vs $1 input)
Prompt caching: put static content FIRST, save up to 90% on cached portion
Batch APIs: 50% discount for 24-hour SLA (use for async/nightly tasks)
Model tiering: expensive models for hard tasks, cheap models for everything else
Context pruning: summarize history, limit retrieval, trim examples
Cost attribution: tag requests by feature/user/team to find waste
Monitor continuously - costs sneak up without visibility

In This Platform

This platform demonstrates cost-awareness: we use build-time validation (zero runtime AI cost), precomputed content (no per-user generation), and static site deployment (no inference at request time). The architecture choice itself is the ultimate cost optimization.

Relevant Files:

build.js
frontend/src/lib/recommendations-core.ts

Prerequisites

Sources

Tempered AI — Forged Through Practice, Not Hype

? Keyboard shortcuts