AI Cost Optimization
Strategies to reduce AI infrastructure costs by 50-90% through prompt caching, batch APIs, model tiering, and context pruning.
AI costs explode in production. A simple agentic workflow can cost $5+ per run. Without optimization, teams burn through budgets or get caught by surprise bills. The good news: strategic optimization can cut costs by 50-90% without sacrificing quality.
The 100x Rule: Claude Opus 4.5 costs ~100x more than Haiku for the same tokens. GPT-4o costs ~50x more than GPT-4o-mini. Most tasks don't need the expensive model. Route wisely.
Strategy 1: Prompt Caching
Place static context (system prompts, documentation, examples) at the START of prompts. Both Anthropic and OpenAI cache prompt prefixes - you only pay full price for tokens that change.
✅ GOOD (cacheable):
[SYSTEM PROMPT - 1000 tokens - CACHED]
[EXAMPLES - 500 tokens - CACHED]
[User's specific question - 50 tokens - NEW]
❌ BAD (not cacheable):
[User's specific question - 50 tokens]
[SYSTEM PROMPT - 1000 tokens]
[EXAMPLES - 500 tokens]
Anthropic's prompt caching can reduce costs by 90% for the cached portion.
Strategy 2: Batch APIs
For non-urgent tasks (nightly reports, bulk processing, async evaluation), use batch endpoints. OpenAI and Anthropic offer 50% discounts for batch requests with 24-hour SLAs.
Strategy 3: Model Tiering
Use expensive models (Opus, GPT-4o) only for hard tasks. Use cheap models (Haiku, GPT-4o-mini) for everything else.
function selectModel(task: Task): Model {
// Tier 1: Hard reasoning, complex code
if (task.complexity === 'high' || task.requiresReasoning) {
return 'claude-3-5-sonnet'; // Or opus for hardest
}
// Tier 2: Standard code, simple analysis
if (task.complexity === 'medium') {
return 'gpt-4o-mini';
}
// Tier 3: Classification, simple extraction
return 'claude-3-haiku'; // 100x cheaper than Opus
}
Strategy 4: Context Pruning
Large context windows are expensive. Don't send entire files when excerpts suffice.
- Summarize history: Compress conversation history to key points
- Chunk retrieval: For RAG, only retrieve top-k most relevant chunks
- Trim examples: Use 2-3 examples instead of 10
- Remove redundancy: Don't repeat information in system prompt AND user message
Strategy 5: Cost Attribution
Tag requests to understand WHERE costs come from. Without attribution, you can't optimize.
const response = await anthropic.messages.create({
model: 'claude-3-sonnet',
metadata: {
user_id: user.id,
feature: 'code_review',
team: 'platform'
},
// ...
});
Then query: 'Which features cost most?' 'Which users are heavy spenders?'
Key Takeaways
- AI costs explode in production - a single agentic workflow can cost $5+ per run
- The 100x Rule: Opus costs ~100x more than Haiku for same tokens
- Prompt caching: put static content FIRST, save up to 90% on cached portion
- Batch APIs: 50% discount for 24-hour SLA (use for async/nightly tasks)
- Model tiering: expensive models for hard tasks, cheap models for everything else
- Context pruning: summarize history, limit retrieval, trim examples
- Cost attribution: tag requests by feature/user/team to find waste
- Monitor continuously - costs sneak up without visibility
In This Platform
This platform demonstrates cost-awareness: we use build-time validation (zero runtime AI cost), precomputed content (no per-user generation), and static site deployment (no inference at request time). The architecture choice itself is the ultimate cost optimization.
- build.js
- frontend/src/lib/recommendations-core.ts