- Youtube: Attention in transformers, step-by-step | Deep Learning Chapter 6
- Official lesson: 3blue1brown.com/lessons/attention
Key Takeaway
- Query = “what am I looking for?”, Key = “what do I contain?”, Value = “what do I send if selected?”
- Dot products between Q and K → attention weights → how much each token borrows from others
- Causal mask prevents attending to future tokens; softmax normalizes weights to sum to 1
- Multiple heads run in parallel, each learning a different type of relationship (syntax, coreference, semantics…)
This chapter answers:
How exactly does the attention mechanism work — mathematically and step by step?
Why Attention Is Needed
The Polysemy Problem
After tokenization and embedding (DL5), every occurrence of the word “mole” gets the same initial vector — because the embedding matrix is a static lookup table with no knowledge of context. But “mole” can mean:
- A small burrowing animal
- A unit of measurement (chemistry)
- A skin lesion (medicine)
The attention block is the mechanism that updates these generic embeddings based on surrounding context, moving the vector for “mole” toward the correct meaning for each sentence.
The Information Aggregation Problem
The final prediction at the end of the network uses only the last token’s vector. This means all information relevant to predicting the next token — from across the entire 2,048-token context window — must eventually flow into that one vector. Attention is the mechanism that enables this cross-token communication.
A Concrete Example
Consider: "The fluffy blue creature..."
- When processing the token “creature”, the model needs to know it is both “fluffy” and “blue”
- The query produced by “creature” aligns closely with the keys produced by “fluffy” and “blue”
- This causes those adjective embeddings to influence the “creature” vector
A Single Attention Head
Each attention head is parameterized by three learned matrices:
- — the Query matrix
- — the Key matrix
- — the Value matrix (factored into and )
Step 1: Compute Queries
For each token embedding , compute its query vector:
- shape (GPT-3): 128 × 12,288
- Query vector dimensionality: 128 (much smaller than the 12,288-dimensional embedding)
- Conceptual meaning: “What kind of information am I looking for?”
Step 2: Compute Keys
For each token embedding , compute its key vector:
- shape (GPT-3): 128 × 12,288
- Key vector dimensionality: 128 (same space as queries, enabling comparison)
- Conceptual meaning: “What kind of information do I contain?”
Step 3: Compute Attention Scores (Dot Products)
For every pair of tokens , compute the alignment score:
- If token ‘s query and token ‘s key are aligned → large positive score → token is relevant to token
- All pairwise scores form an matrix (where = number of tokens in context)
The compact matrix form for all scores at once:
where and are matrices whose columns are all query/key vectors.
Step 4: Scale for Numerical Stability
Divide all scores by where is the key/query dimensionality (128 in GPT-3):
This prevents very large dot products from causing the softmax to saturate (become too “peaked”), which would make gradients vanish during training.
Step 5: Masking (Causal / Autoregressive Mask)
During training, the model learns to predict the next token at every position in a sequence simultaneously (not just the last one). To prevent the model from “cheating” by looking at future tokens:
- For every token at position , set the scores attending to all positions to
- After softmax, , so future tokens contribute zero to the update
This enforces that each position can only attend to itself and earlier positions.
Step 6: Softmax — Normalize to Attention Weights
Apply softmax column-wise to the masked scores:
- Each column sums to 1 → valid probability distribution
- The resulting matrix is called the attention pattern
- Entry = how much token attends to (draws information from) token
Step 7: Compute Value Vectors and Updates
The Value Matrix (Factored Form)
Rather than one large value matrix, GPT-3 uses a low-rank factored form:
- : 128 × 12,288 — projects embedding down to 128 dimensions
- : 12,288 × 128 — projects back up to embedding space
- This matches the parameter budget of and , keeping the design symmetric
For each token , the value vector is:
Computing the Update
The update to add to token ‘s embedding is the weighted sum of value vectors:
This is added to the original embedding:
The Full Compact Formula
The entire single-head attention operation can be written as:
Where:
- = matrix of all query vectors (one column per token)
- = matrix of all key vectors
- = matrix of all value vectors
- Output = weighted combination of value vectors for each token
Multi-Head Attention
A single head captures one type of contextual relationship. A full attention block runs many heads in parallel, each with its own learned , , and matrices.
GPT-3 Configuration
- 96 attention heads per attention block
- Each head independently:
- Computes its own query/key/value projections
- Produces its own attention pattern
- Produces its own update for each token
- All heads’ updates are summed and added to the embeddings
Why Multiple Heads?
Different heads can learn to attend to different types of relationships:
- One head might track syntactic dependencies (subject → verb)
- Another might track coreference (pronoun → noun)
- Another might track semantic similarity
- Another might track positional proximity
The Output Matrix
After all head outputs are concatenated, they pass through an output matrix that combines them back into the embedding space.
Cross-Attention (Encoder–Decoder Models)
The attention described above is self-attention: both queries and keys/values come from the same sequence.
Cross-attention is used in encoder-decoder models (e.g., translation):
- Queries come from the decoder (the output sequence being generated)
- Keys and values come from the encoder (the input sequence, e.g., a sentence in another language)
- This lets the decoder “look at” relevant parts of the input when generating each output token
- Cross-attention does not require causal masking (the decoder can attend to the full input)
Parameter Count
For a Single Attention Head (GPT-3)
| Matrix | Shape | Parameters |
|---|---|---|
| 128 × 12,288 | 1,572,864 | |
| 128 × 12,288 | 1,572,864 | |
| 128 × 12,288 | 1,572,864 | |
| 12,288 × 128 | 1,572,864 | |
| Total per head | ~6.3 million |
For the Full Multi-Head Attention Block
- 96 heads × ~6.3M = ~600 million parameters per block
Across All 96 Layers
- 96 blocks × ~600M = ~58 billion parameters total in attention layers
- This is roughly one-third of GPT-3’s total 175 billion parameters
The Residual Stream Picture
It helps to think of each token as having a residual stream — a 12,288-dimensional vector that flows through all layers:
- Each attention block and MLP block reads from and writes to this stream (via addition)
- Early layers might encode simple features (part of speech, position)
- Later layers encode increasingly abstract information (tone, coreference, topic)
- The 12,288 dimensions act like a high-bandwidth working memory that different blocks can use for different sub-tasks
As Grant Sanderson puts it: “After nouns absorb adjective meanings, those adjective embeddings become effectively free and available for extra processing.”
Key Intuitions Summary
| Concept | Intuition |
|---|---|
| Query | ”What am I looking for?” |
| Key | ”What do I contain?” |
| Value | ”What should I send if you attend to me?” |
| Attention weight | ”How much should token borrow from token ?” |
| Masking | Prevents looking into the future during training |
| Multi-head | Multiple parallel “types” of attention simultaneously |
| Cross-attention | Decoder queries, encoder keys/values (for seq-to-seq) |
Connections and Further Reading
- Original paper: “Attention is All You Need” (Vaswani et al., 2017) — introduced the transformer architecture
- Andrej Karpathy: “Build GPT from Scratch” — hands-on coding implementation
- Anthropic: Transformer Circuits research — mechanistic interpretability of attention heads