Skip to content

AI Cost Optimization

Patterns intermediate 10 min
Sources verified Dec 23

Strategies to reduce AI infrastructure costs by 50-90% through prompt caching, batch APIs, model tiering, and context pruning.

AI costs explode in production. A simple agentic workflow can cost $5+ per run. Without optimization, teams burn through budgets or get caught by surprise bills. The good news: strategic optimization can cut costs by 50-90% without sacrificing quality.

The 100x Rule: Claude Opus 4.5 costs ~100x more than Haiku for the same tokens. GPT-4o costs ~50x more than GPT-4o-mini. Most tasks don't need the expensive model. Route wisely.

Strategy 1: Prompt Caching

Place static context (system prompts, documentation, examples) at the START of prompts. Both Anthropic and OpenAI cache prompt prefixes - you only pay full price for tokens that change.

✅ GOOD (cacheable):
[SYSTEM PROMPT - 1000 tokens - CACHED]
[EXAMPLES - 500 tokens - CACHED]
[User's specific question - 50 tokens - NEW]

❌ BAD (not cacheable):
[User's specific question - 50 tokens]
[SYSTEM PROMPT - 1000 tokens]
[EXAMPLES - 500 tokens]

Anthropic's prompt caching can reduce costs by 90% for the cached portion.

Strategy 2: Batch APIs

For non-urgent tasks (nightly reports, bulk processing, async evaluation), use batch endpoints. OpenAI and Anthropic offer 50% discounts for batch requests with 24-hour SLAs.

Strategy 3: Model Tiering

Use expensive models (Opus, GPT-4o) only for hard tasks. Use cheap models (Haiku, GPT-4o-mini) for everything else.

function selectModel(task: Task): Model {
  // Tier 1: Hard reasoning, complex code
  if (task.complexity === 'high' || task.requiresReasoning) {
    return 'claude-3-5-sonnet'; // Or opus for hardest
  }
  
  // Tier 2: Standard code, simple analysis
  if (task.complexity === 'medium') {
    return 'gpt-4o-mini';
  }
  
  // Tier 3: Classification, simple extraction
  return 'claude-3-haiku'; // 100x cheaper than Opus
}

Strategy 4: Context Pruning

Large context windows are expensive. Don't send entire files when excerpts suffice.

  • Summarize history: Compress conversation history to key points
  • Chunk retrieval: For RAG, only retrieve top-k most relevant chunks
  • Trim examples: Use 2-3 examples instead of 10
  • Remove redundancy: Don't repeat information in system prompt AND user message

Strategy 5: Cost Attribution

Tag requests to understand WHERE costs come from. Without attribution, you can't optimize.

const response = await anthropic.messages.create({
  model: 'claude-3-sonnet',
  metadata: {
    user_id: user.id,
    feature: 'code_review',
    team: 'platform'
  },
  // ...
});

Then query: 'Which features cost most?' 'Which users are heavy spenders?'

Key Takeaways

  • AI costs explode in production - a single agentic workflow can cost $5+ per run
  • The 100x Rule: Opus costs ~100x more than Haiku for same tokens
  • Prompt caching: put static content FIRST, save up to 90% on cached portion
  • Batch APIs: 50% discount for 24-hour SLA (use for async/nightly tasks)
  • Model tiering: expensive models for hard tasks, cheap models for everything else
  • Context pruning: summarize history, limit retrieval, trim examples
  • Cost attribution: tag requests by feature/user/team to find waste
  • Monitor continuously - costs sneak up without visibility

In This Platform

This platform demonstrates cost-awareness: we use build-time validation (zero runtime AI cost), precomputed content (no per-user generation), and static site deployment (no inference at request time). The architecture choice itself is the ultimate cost optimization.

Relevant Files:
  • build.js
  • frontend/src/lib/recommendations-core.ts

Prerequisites

Sources

Tempered AI Forged Through Practice, Not Hype

Keyboard Shortcuts

j
Next page
k
Previous page
h
Section home
/
Search
?
Show shortcuts
m
Toggle sidebar
Esc
Close modal
Shift+R
Reset all progress
? Keyboard shortcuts