Youtube: Attention in transformers, step-by-step | Deep Learning Chapter 6
Official lesson: 3blue1brown.com/lessons/attention

Key Takeaway

Query = “what am I looking for?”, Key = “what do I contain?”, Value = “what do I send if selected?”

Dot products between Q and K → attention weights → how much each token borrows from others

Causal mask prevents attending to future tokens; softmax normalizes weights to sum to 1

Multiple heads run in parallel, each learning a different type of relationship (syntax, coreference, semantics…)

This chapter answers:

How exactly does the attention mechanism work — mathematically and step by step?

Why Attention Is Needed

The Polysemy Problem

After tokenization and embedding (DL5), every occurrence of the word “mole” gets the same initial vector — because the embedding matrix is a static lookup table with no knowledge of context. But “mole” can mean:

A small burrowing animal
A unit of measurement (chemistry)
A skin lesion (medicine)

The attention block is the mechanism that updates these generic embeddings based on surrounding context, moving the vector for “mole” toward the correct meaning for each sentence.

The Information Aggregation Problem

The final prediction at the end of the network uses only the last token’s vector. This means all information relevant to predicting the next token — from across the entire 2,048-token context window — must eventually flow into that one vector. Attention is the mechanism that enables this cross-token communication.

A Concrete Example

Consider: "The fluffy blue creature..."

When processing the token “creature”, the model needs to know it is both “fluffy” and “blue”
The query produced by “creature” aligns closely with the keys produced by “fluffy” and “blue”
This causes those adjective embeddings to influence the “creature” vector

A Single Attention Head

Each attention head is parameterized by three learned matrices:

$W_{Q}$ — the Query matrix
$W_{K}$ — the Key matrix
$W_{V}$ — the Value matrix (factored into $W_{V ↓}$ and $W_{V ↑}$ )

Step 1: Compute Queries

For each token embedding $e_{i}$ , compute its query vector:

$q_{i} = W_{Q} \cdot e_{i}$

$W_{Q}$ shape (GPT-3): 128 × 12,288
Query vector dimensionality: 128 (much smaller than the 12,288-dimensional embedding)
Conceptual meaning: “What kind of information am I looking for?”

Step 2: Compute Keys

For each token embedding $e_{j}$ , compute its key vector:

$k_{j} = W_{K} \cdot e_{j}$

$W_{K}$ shape (GPT-3): 128 × 12,288
Key vector dimensionality: 128 (same space as queries, enabling comparison)
Conceptual meaning: “What kind of information do I contain?”

Step 3: Compute Attention Scores (Dot Products)

For every pair of tokens $(i, j)$ , compute the alignment score:

$score (i, j) = q_{i} \cdot k_{j}$

If token $i$ ‘s query and token $j$ ‘s key are aligned → large positive score → token $j$ is relevant to token $i$
All pairwise scores form an $n \times n$ matrix (where $n$ = number of tokens in context)

The compact matrix form for all scores at once:

$Scores = K^{T} Q$

where $Q$ and $K$ are matrices whose columns are all query/key vectors.

Step 4: Scale for Numerical Stability

Divide all scores by $d_{k}$ where $d_{k}$ is the key/query dimensionality (128 in GPT-3):

$Scaled Scores = \frac{K ^{T} Q}{d _{k}}$

This prevents very large dot products from causing the softmax to saturate (become too “peaked”), which would make gradients vanish during training.

Step 5: Masking (Causal / Autoregressive Mask)

During training, the model learns to predict the next token at every position in a sequence simultaneously (not just the last one). To prevent the model from “cheating” by looking at future tokens:

For every token at position $i$ , set the scores attending to all positions $j > i$ to $- \infty$
After softmax, $e^{- \infty} = 0$ , so future tokens contribute zero to the update

This enforces that each position can only attend to itself and earlier positions.

$MaskedScore (i, j) = {score (i, j) - \infty if j \leq i if j > i$

Step 6: Softmax — Normalize to Attention Weights

Apply softmax column-wise to the masked scores:

$AttentionWeight (i, j) = \frac{e ^{MaskedScore (i, j)}}{\sum _{k \leq i} e ^{MaskedScore (i, k)}}$

Each column sums to 1 → valid probability distribution
The resulting $n \times n$ matrix is called the attention pattern
Entry $(i, j)$ = how much token $i$ attends to (draws information from) token $j$

Step 7: Compute Value Vectors and Updates

The Value Matrix (Factored Form)

Rather than one large value matrix, GPT-3 uses a low-rank factored form:

$W_{V} = W_{V ↑} \cdot W_{V ↓}$

$W_{V ↓}$ : 128 × 12,288 — projects embedding down to 128 dimensions
$W_{V ↑}$ : 12,288 × 128 — projects back up to embedding space
This matches the parameter budget of $W_{Q}$ and $W_{K}$ , keeping the design symmetric

For each token $j$ , the value vector is:

$v_{j} = W_{V ↓} \cdot e_{j}$

Computing the Update $Δ e_{i}$

The update to add to token $i$ ‘s embedding is the weighted sum of value vectors:

$Δ e_{i} = W_{V ↑} \sum_{j} AttentionWeight (i, j) \cdot v_{j}$

This $Δ e_{i}$ is added to the original embedding: $e_{i} \leftarrow e_{i} + Δ e_{i}$

The Full Compact Formula

The entire single-head attention operation can be written as:

$Attention (Q, K, V) = softmax (\frac{K ^{T} Q}{d _{k}}) V$

Where:

$Q$ = matrix of all query vectors (one column per token)
$K$ = matrix of all key vectors
$V$ = matrix of all value vectors
Output = weighted combination of value vectors for each token

Multi-Head Attention

A single head captures one type of contextual relationship. A full attention block runs many heads in parallel, each with its own learned $W_{Q}$ , $W_{K}$ , and $W_{V}$ matrices.

GPT-3 Configuration

96 attention heads per attention block
Each head independently:
- Computes its own query/key/value projections
- Produces its own $128 \times 128$ attention pattern
- Produces its own $Δ e$ update for each token
All heads’ $Δ e$ updates are summed and added to the embeddings

Why Multiple Heads?

Different heads can learn to attend to different types of relationships:

One head might track syntactic dependencies (subject → verb)
Another might track coreference (pronoun → noun)
Another might track semantic similarity
Another might track positional proximity

The Output Matrix $W_{O}$

After all head outputs are concatenated, they pass through an output matrix $W_{O}$ that combines them back into the embedding space.

Cross-Attention (Encoder–Decoder Models)

The attention described above is self-attention: both queries and keys/values come from the same sequence.

Cross-attention is used in encoder-decoder models (e.g., translation):

Queries come from the decoder (the output sequence being generated)
Keys and values come from the encoder (the input sequence, e.g., a sentence in another language)
This lets the decoder “look at” relevant parts of the input when generating each output token
Cross-attention does not require causal masking (the decoder can attend to the full input)

Parameter Count

For a Single Attention Head (GPT-3)

Matrix	Shape	Parameters
$W_{Q}$	128 × 12,288	1,572,864
$W_{K}$	128 × 12,288	1,572,864
$W_{V ↓}$	128 × 12,288	1,572,864
$W_{V ↑}$	12,288 × 128	1,572,864
Total per head		~6.3 million

For the Full Multi-Head Attention Block

96 heads × ~6.3M = ~600 million parameters per block

Across All 96 Layers

96 blocks × ~600M = ~58 billion parameters total in attention layers
This is roughly one-third of GPT-3’s total 175 billion parameters

The Residual Stream Picture

It helps to think of each token as having a residual stream — a 12,288-dimensional vector that flows through all layers:

Each attention block and MLP block reads from and writes to this stream (via addition)
Early layers might encode simple features (part of speech, position)
Later layers encode increasingly abstract information (tone, coreference, topic)
The 12,288 dimensions act like a high-bandwidth working memory that different blocks can use for different sub-tasks

As Grant Sanderson puts it: “After nouns absorb adjective meanings, those adjective embeddings become effectively free and available for extra processing.”

Key Intuitions Summary

Concept	Intuition
Query	”What am I looking for?”
Key	”What do I contain?”
Value	”What should I send if you attend to me?”
Attention weight	”How much should token $i$ borrow from token $j$ ?”
Masking	Prevents looking into the future during training
Multi-head	Multiple parallel “types” of attention simultaneously
Cross-attention	Decoder queries, encoder keys/values (for seq-to-seq)

Connections and Further Reading

Original paper: “Attention is All You Need” (Vaswani et al., 2017) — introduced the transformer architecture
Andrej Karpathy: “Build GPT from Scratch” — hands-on coding implementation
Anthropic: Transformer Circuits research — mechanistic interpretability of attention heads

Notes

Explorer

DL6 Attention in Transformers

Why Attention Is Needed

The Polysemy Problem

The Information Aggregation Problem

A Concrete Example

A Single Attention Head

Step 1: Compute Queries

Step 2: Compute Keys

Step 3: Compute Attention Scores (Dot Products)

Step 4: Scale for Numerical Stability

Step 5: Masking (Causal / Autoregressive Mask)

Step 6: Softmax — Normalize to Attention Weights

Step 7: Compute Value Vectors and Updates

The Value Matrix (Factored Form)

Computing the Update $Δ e_{i}$

The Full Compact Formula

Multi-Head Attention

GPT-3 Configuration

Why Multiple Heads?

The Output Matrix $W_{O}$

Cross-Attention (Encoder–Decoder Models)

Parameter Count

For a Single Attention Head (GPT-3)

For the Full Multi-Head Attention Block

Across All 96 Layers

The Residual Stream Picture

Key Intuitions Summary

Connections and Further Reading

Table of Contents

Graph View

Table of Contents

Backlinks

Notes

Explorer

DL6 Attention in Transformers

Why Attention Is Needed

The Polysemy Problem

The Information Aggregation Problem

A Concrete Example

A Single Attention Head

Step 1: Compute Queries

Step 2: Compute Keys

Step 3: Compute Attention Scores (Dot Products)

Step 4: Scale for Numerical Stability

Step 5: Masking (Causal / Autoregressive Mask)

Step 6: Softmax — Normalize to Attention Weights

Step 7: Compute Value Vectors and Updates

The Value Matrix (Factored Form)

Computing the Update Δei​

The Full Compact Formula

Multi-Head Attention

GPT-3 Configuration

Why Multiple Heads?

The Output Matrix WO​

Cross-Attention (Encoder–Decoder Models)

Parameter Count

For a Single Attention Head (GPT-3)

For the Full Multi-Head Attention Block

Across All 96 Layers

The Residual Stream Picture

Key Intuitions Summary

Connections and Further Reading

Graph View

Table of Contents

Backlinks

Computing the Update $Δ e_{i}$

The Output Matrix $W_{O}$