Skip to content

Attention Mechanism

Fundamentals intermediate 15 min
Sources verified Dec 22

The core mechanism that allows language models to understand how words relate to each other by dynamically focusing on relevant parts of the input.

The attention mechanism is the breakthrough that made modern LLMs possible. Before attention, AI models processed text sequentially, like reading word-by-word without being able to look back. Attention allows the model to examine all words simultaneously and understand their relationships.

When processing the sentence 'The cat sat on the mat because it was tired,' the model uses attention to determine that 'it' refers to 'cat' rather than 'mat.' It does this by computing relevance scores between every pair of words in the input.

How Self-Attention Works

The mechanism uses three components for each token:

  • Query: What am I looking for? (What this word wants to know about other words)
  • Key: What do I offer? (What information this word can provide)
  • Value: What information do I contain? (The actual content to use)

For every token, the model computes how much attention to pay to every other token by comparing queries and keys, then combines the values weighted by those attention scores.

Why Context Length Matters

Attention computes relationships between every pair of tokens, making it O(n²) in complexity. This is why:

  • Longer context windows require exponentially more computation
  • Models have maximum context lengths (32k, 100k, 200k tokens)
  • Cost increases dramatically with longer inputs
  • Some models use optimizations like 'sparse attention' to reduce this cost

Practical Implications

Understanding attention helps explain LLM behavior:

  1. Position matters: Attention scores can be influenced by where information appears in the prompt
  2. Recency bias: Later tokens often get more attention than earlier ones
  3. Lost in the middle: Information buried in long contexts may be overlooked
  4. Prompt engineering: Structuring prompts to help the model attend to the right information
  5. Multi-head attention: Models use multiple attention mechanisms in parallel to capture different types of relationships

Key Takeaways

  • Attention allows models to understand relationships between all words in the input simultaneously
  • It computes relevance scores between every pair of tokens using query, key, and value vectors
  • Attention is O(n²), which is why longer contexts cost more and have limits
  • Understanding attention helps with prompt engineering—put important information near where it's needed
  • Multi-head attention captures different types of relationships (syntax, semantics, etc.)

In This Platform

Understanding attention helps explain why careful prompt structure matters in our assessment questions. We structure prompts with clear sections and relevant context close to the question because attention mechanisms work better when related information is nearby. This same principle applies when designing AI prompts in production systems.

Relevant Files:
  • prompts/analysis_prompts.json
  • prompts/recommendation_prompts.json

Prerequisites

Sources

Tempered AI Forged Through Practice, Not Hype

Keyboard Shortcuts

j
Next page
k
Previous page
h
Section home
/
Search
?
Show shortcuts
m
Toggle sidebar
Esc
Close modal
Shift+R
Reset all progress
? Keyboard shortcuts