Context Windows
Fundamentals beginner 12 min
Sources verified Dec 22
The context window is the maximum amount of text (in tokens) an LLM can 'see' at once, including prompts, history, injected documents, and responses.
The context window is the maximum amount of text (in tokens) the model can "see" at once. This includes:
- Your system prompt
- Conversation history
- Any documents/context you inject
- The model's response so far
If your total exceeds the limit, you'll get an error or truncation.
Common Context Limits (2025)
| Model | Context Window |
|---|---|
| GPT-4o | 128K tokens |
| Claude 3.5 Sonnet | 200K tokens |
| Gemini 1.5 Pro | 2M tokens |
| Llama 3.3 70B | 128K tokens |
A 128K context is roughly equivalent to a 300-page book.
The "Lost in the Middle" Problem
Research shows LLMs struggle with information in the middle of long contexts. They perform best on:
- Content at the beginning of the context
- Content at the end of the context
Information buried in the middle may be "forgotten" or overlooked. This has implications for prompt design.
context_management.ts
// Managing context effectively
function buildContext({
systemPrompt,
documents,
conversationHistory,
maxTokens = 128000
}: ContextParams): Message[] {
// Reserve tokens for response
const responseReserve = 4000;
const available = maxTokens - responseReserve;
// Priority: system prompt > recent history > documents
let used = estimateTokens(systemPrompt);
// Keep recent conversation (recency matters)
const recentHistory = conversationHistory.slice(-10);
used += estimateTokens(recentHistory);
// Fill remaining with most relevant documents
const relevantDocs = rankByRelevance(documents)
.filter(doc => (used += estimateTokens(doc)) < available);
return [systemPrompt, ...relevantDocs, ...recentHistory];
} L7: Always reserve space for the model's response
L13: Recent messages are more important for coherence
Context Window Strategies
- Summarization: Compress old conversation into summaries
- RAG: Retrieve only relevant chunks instead of full documents
- Sliding window: Keep N most recent messages
- Priority ordering: Put critical info at start and end
Key Takeaways
- Context window = total tokens the model can process at once
- Includes prompt + history + injected content + response
- Larger contexts = higher cost and latency
- Information in the middle of long contexts may be overlooked
- RAG often beats stuffing everything into context
In This Platform
This platform's prompts are designed for token efficiency. Each prompt template has a focused purpose to avoid context bloat when multiple modules are combined.
Relevant Files:
- prompts/analysis.json
- config/scoring.json
Prerequisites
Sources
Tempered AI — Forged Through Practice, Not Hype
? Keyboard shortcuts