Tokenization
glossary beginner 3 min
Sources verified Dec 27, 2025
How AI models break text into smaller pieces (tokens) for processing.
Simple Definition
Tokenization is how AI models break text into smaller pieces called "tokens" before processing. A token is typically 3-4 characters—roughly ¾ of a word. Understanding tokens helps you estimate costs and manage context limits.
Technical Definition
AI models don't read text character-by-character or word-by-word. They use subword tokenization:
| Text | Token Count | Notes |
|---|---|---|
| "Hello" | 1 token | Common word |
| "tokenization" | 3 tokens | "token" + "iz" + "ation" |
| "GPT" | 1 token | Common abbreviation |
| "supercalifragilistic" | 7 tokens | Rare word, many subwords |
Rules of thumb:
- 1 token ≈ 4 characters ≈ ¾ word
- 100 tokens ≈ 75 words
- 1,000 tokens ≈ 750 words ≈ 1-2 pages
Why Tokens Matter
Cost: AI pricing is per-token. More tokens = higher cost.
Context limits: Models have token limits (e.g., 128K tokens). Your prompt + response must fit within this limit.
Code is token-expensive: Whitespace, brackets, and boilerplate add up. A 100-line file might be 500-1000 tokens.
Key Takeaways
- Tokens are the units AI models use to process text
- 1 token ≈ 4 characters ≈ ¾ word
- Tokens affect cost and context limits
- Code is more token-expensive than prose
Sources
Tempered AI — Forged Through Practice, Not Hype
? Keyboard shortcuts