- Youtube: How might LLMs store facts | Deep Learning Chapter 7
- Official lesson: 3blue1brown.com/lessons/mlp
Key Takeaway
- MLP = key-value memory: rows of are keys (pattern matchers), columns of are values (facts to inject)
- ReLU gates the lookup — a neuron fires only when its pattern matches the input
- High-dimensional spaces can hold exponentially many quasi-orthogonal directions (superposition)
- A single MLP block encodes far more facts than it has neurons
This chapter answers:
How do the MLP (feed-forward) blocks in a transformer store and retrieve factual knowledge?
Background: Where Do Facts Live?
Researchers at Google DeepMind reported that factual knowledge in transformers tends to be stored in the multi-layer perceptron (MLP) blocks, not primarily in the attention layers. A full mechanistic understanding of exactly how this works remains an active research area, but the video gives a compelling conceptual model.
MLP vs. Attention
| Attention Block | MLP Block | |
|---|---|---|
| Purpose | Cross-token communication | Per-token processing |
| Tokens interact? | Yes — tokens exchange information | No — each token processed independently |
| Stores facts? | Less so | Primarily |
| Share of GPT-3 params | ~1/3 (~58B) | ~2/3 (~116B) |
How Directions Encode Meaning (Recap)
The MLP operates on the residual stream — the high-dimensional vector flowing through each token. Directions in this space encode semantic meaning:
- A vector that strongly aligns with the “Michael” direction has a large dot product with
- A vector that aligns with both “Michael” and “Jordan” simultaneously encodes “Michael Jordan” as a concept
- The dot product of any vector with a direction measures alignment: if aligned, if perpendicular, negative if opposed
The MLP Architecture: Three Steps
A single MLP block performs three sequential operations on each token’s vector.
Step 1: Up Projection
- shape (GPT-3): 49,152 × 12,288 (projects up by a factor of 4)
- : bias vector of length 49,152
- Output : a vector of 49,152 values, one per “neuron” in the hidden layer
What Each Row of Does
Each row of is a learned “question” vector in the embedding space. If the row is aligned with the “Michael Jordan” concept, then:
- When encodes “Michael Jordan”: (dot product is large)
- When does not encode “Michael Jordan”:
With a bias of (learned during training):
- “Michael Jordan” present: → neuron activates
- Other inputs: → neuron silenced
Step 2: ReLU Activation
Applied element-wise to every value in :
- Negative values → zero (neuron is inactive / “off”)
- Positive values → unchanged (neuron is active / “on”)
Why ReLU Is Powerful Here
ReLU creates sharp gating behavior:
- A neuron fires only when the specific pattern its row in is looking for is present
- This mimics an AND gate: the “Michael Jordan” neuron fires only when both “Michael” and “Jordan” features are present simultaneously
- The resulting vector is typically sparse — most neurons are zero
Step 3: Down Projection
- shape (GPT-3): 12,288 × 49,152 (projects back down to embedding space)
- is added to the original embedding:
What Each Column of Does
Think of column by column:
- When neuron activates (fires with value ), column of is scaled by and added to the output
- Column is a direction in embedding space — it points toward some semantic concept
Example: the “Michael Jordan” neuron (neuron ) firing might cause:
- Column to point toward “basketball”
- Also contributing directions toward “Chicago Bulls” and “number 23”
The final output is a superposition of all active neurons’ associated directions.
The Full MLP as a Fact-Retrieval System
Putting it together, the MLP implements a pattern like:
IF the current token vector matches the pattern “Michael Jordan” (via rows + ReLU),
THEN add the directions “basketball”, “Chicago Bulls”, “number 23” to its embedding (via columns)
This is analogous to a key-value memory store:
- The rows of are the keys — patterns to match against
- The columns of are the values — information to retrieve when a key matches
- ReLU implements the lookup — gating whether a key is matched or not
The Superposition Problem
A Naïve View
If we expected one neuron per concept, a 49,152-neuron hidden layer could only store 49,152 distinct facts. But language models appear to store millions of facts. How?
The Superposition Hypothesis
Neurons rarely represent single, clean concepts. Instead, the network encodes many more features than it has neurons by storing them as overlapping, quasi-orthogonal directions in the activation space.
This is called superposition: multiple features are encoded simultaneously, each as a different direction in the high-dimensional space. Each direction can be “decoded” with a dot product, but they partially interfere with each other.
The Johnson-Lindenstrauss Lemma
The mathematical foundation comes from the Johnson-Lindenstrauss lemma:
For any set of points in high-dimensional space, they can be projected into dimensions while approximately preserving distances.
Key formula:
But more importantly, the lemma implies:
The number of nearly-perpendicular (quasi-orthogonal) vectors that can fit in a -dimensional space grows exponentially with .
Practical Implications for GPT-3 ():
| Minimum angle between vectors | Number of quasi-orthogonal vectors possible |
|---|---|
| 89° | ~ |
| 88° | ~ |
| 87° | ~ |
| 85° | ~ |
At ~85–88° separation, GPT-3’s 12,288-dimensional space can theoretically hold billions to trillions of distinct feature directions — far more than the number of real-world concepts the model needs to represent.
The Tradeoff
Using quasi-orthogonal (rather than perfectly orthogonal) vectors means:
- Activating one feature causes slight spurious activation of nearby features
- For sparse features (ones that are rarely active), this interference is low-cost
- For dense features, interference becomes a problem
The network learns to tolerate this interference because the benefit (massive capacity) outweighs the cost for sparse, real-world knowledge.
Parameter Count
For a Single MLP Block (GPT-3)
| Matrix | Shape | Parameters |
|---|---|---|
| (up projection) | 49,152 × 12,288 | ~604 million |
| (down projection) | 12,288 × 49,152 | ~604 million |
| Total per MLP block | ~1.2 billion |
Across All 96 Layers (GPT-3)
- 96 blocks × ~1.2B = ~116 billion parameters
- This is approximately two-thirds of GPT-3’s total 175 billion parameters
- The remaining third is mostly attention parameters (~58B) from DL6
Why This Matters for Interpretability
The Challenge
If individual neurons don’t cleanly correspond to individual concepts (due to superposition), then:
- Reading out “what a model knows” by examining individual neuron activations is unreliable
- Circuits that implement factual retrieval may be distributed across many neurons
- Mechanistic interpretability becomes very hard
What We Can Say
- The MLP layers collectively are the primary site of factual knowledge storage
- Editing facts in LLMs (e.g., changing “Michael Jordan plays basketball” to a different sport) has been shown to work best by targeting specific MLP layers — supporting this theory
- The superposition hypothesis is an active area of research (see Anthropic’s “Toy Models of Superposition” paper)
Summary Diagram
Token vector ē (12,288-dimensional)
↓
W_↑ · ē + b_↑ ← "Ask 49,152 questions" (up projection)
↓
ReLU(z) ← Gate: only neurons whose pattern matches fire
↓ (sparse activation vector â, mostly zeros)
W_↓ · â ← "Read out facts" (down projection)
↓
Δē ← Change to add to the embedding
↓
ē ← ē + Δē ← Updated embedding (residual connection)
Key Concepts Summary
| Concept | Meaning |
|---|---|
| Up projection | Each row is a pattern (key) to match |
| ReLU | Threshold gate — neuron fires only if pattern matches |
| Down projection | Each column is information (value) to inject |
| Superposition | Storing more features than dimensions by using quasi-orthogonal vectors |
| Johnson-Lindenstrauss | Guarantees exponential capacity in high-dimensional spaces |
| Sparse activation | Most neurons inactive at any given time — crucial for superposition to work |
Connections
- Previous: DL6 (Attention) — the other major building block of a transformer
- Research: Anthropic’s “Toy Models of Superposition” (2022) — formal treatment of the superposition hypothesis
- Research: Geva et al. (Google) — “Transformer Feed-Forward Layers Are Key-Value Memories” (2021) — empirical evidence that MLPs store facts