Youtube: How might LLMs store facts | Deep Learning Chapter 7
Official lesson: 3blue1brown.com/lessons/mlp

Key Takeaway

MLP = key-value memory: rows of $W_{↑}$ are keys (pattern matchers), columns of $W_{↓}$ are values (facts to inject)

ReLU gates the lookup — a neuron fires only when its pattern matches the input

High-dimensional spaces can hold exponentially many quasi-orthogonal directions (superposition)

A single MLP block encodes far more facts than it has neurons

This chapter answers:

How do the MLP (feed-forward) blocks in a transformer store and retrieve factual knowledge?

Background: Where Do Facts Live?

Researchers at Google DeepMind reported that factual knowledge in transformers tends to be stored in the multi-layer perceptron (MLP) blocks, not primarily in the attention layers. A full mechanistic understanding of exactly how this works remains an active research area, but the video gives a compelling conceptual model.

MLP vs. Attention

	Attention Block	MLP Block
Purpose	Cross-token communication	Per-token processing
Tokens interact?	Yes — tokens exchange information	No — each token processed independently
Stores facts?	Less so	Primarily
Share of GPT-3 params	~1/3 (~58B)	~2/3 (~116B)

How Directions Encode Meaning (Recap)

The MLP operates on the residual stream — the high-dimensional vector flowing through each token. Directions in this space encode semantic meaning:

A vector that strongly aligns with the “Michael” direction has a large dot product with $d_{Michael}$
A vector that aligns with both “Michael” and “Jordan” simultaneously encodes “Michael Jordan” as a concept
The dot product of any vector $v$ with a direction $d$ measures alignment: $v \cdot d \approx 1$ if aligned, $\approx 0$ if perpendicular, negative if opposed

The MLP Architecture: Three Steps

A single MLP block performs three sequential operations on each token’s vector.

Step 1: Up Projection

$z = W_{↑} \cdot e + b_{↑}$

$W_{↑}$ shape (GPT-3): 49,152 × 12,288 (projects up by a factor of 4)
$b_{↑}$ : bias vector of length 49,152
Output $z$ : a vector of 49,152 values, one per “neuron” in the hidden layer

What Each Row of $W_{↑}$ Does

Each row of $W_{↑}$ is a learned “question” vector in the embedding space. If the row $w_{i}$ is aligned with the “Michael Jordan” concept, then:

$z_{i} = w_{i} \cdot e + b_{i}$

When $e$ encodes “Michael Jordan”: $w_{i} \cdot e \approx 2$ (dot product is large)
When $e$ does not encode “Michael Jordan”: $w_{i} \cdot e < 1$

With a bias of $b_{i} = - 1$ (learned during training):

“Michael Jordan” present: $z_{i} = 2 - 1 = 1 > 0$ → neuron activates
Other inputs: $z_{i} < 0$ → neuron silenced

Step 2: ReLU Activation

$a = ReLU (z) = max (0, z)$

Applied element-wise to every value in $z$ :

Negative values → zero (neuron is inactive / “off”)
Positive values → unchanged (neuron is active / “on”)

Why ReLU Is Powerful Here

ReLU creates sharp gating behavior:

A neuron fires only when the specific pattern its row in $W_{↑}$ is looking for is present
This mimics an AND gate: the “Michael Jordan” neuron fires only when both “Michael” and “Jordan” features are present simultaneously
The resulting vector $a$ is typically sparse — most neurons are zero

Step 3: Down Projection

$Δ e = W_{↓} \cdot a$

$W_{↓}$ shape (GPT-3): 12,288 × 49,152 (projects back down to embedding space)
$Δ e$ is added to the original embedding: $e \leftarrow e + Δ e$

What Each Column of $W_{↓}$ Does

Think of $W_{↓}$ column by column:

When neuron $i$ activates (fires with value $a_{i} > 0$ ), column $i$ of $W_{↓}$ is scaled by $a_{i}$ and added to the output
Column $i$ is a direction in embedding space — it points toward some semantic concept

Example: the “Michael Jordan” neuron (neuron $i$ ) firing might cause:

Column $i$ to point toward “basketball”
Also contributing directions toward “Chicago Bulls” and “number 23”

The final output $Δ e$ is a superposition of all active neurons’ associated directions.

The Full MLP as a Fact-Retrieval System

Putting it together, the MLP implements a pattern like:

IF the current token vector matches the pattern “Michael Jordan” (via $W_{↑}$ rows + ReLU),
THEN add the directions “basketball”, “Chicago Bulls”, “number 23” to its embedding (via $W_{↓}$ columns)

This is analogous to a key-value memory store:

The rows of $W_{↑}$ are the keys — patterns to match against
The columns of $W_{↓}$ are the values — information to retrieve when a key matches
ReLU implements the lookup — gating whether a key is matched or not

The Superposition Problem

A Naïve View

If we expected one neuron per concept, a 49,152-neuron hidden layer could only store 49,152 distinct facts. But language models appear to store millions of facts. How?

The Superposition Hypothesis

Neurons rarely represent single, clean concepts. Instead, the network encodes many more features than it has neurons by storing them as overlapping, quasi-orthogonal directions in the activation space.

This is called superposition: multiple features are encoded simultaneously, each as a different direction in the high-dimensional space. Each direction can be “decoded” with a dot product, but they partially interfere with each other.

The Johnson-Lindenstrauss Lemma

The mathematical foundation comes from the Johnson-Lindenstrauss lemma:

For any set of $N$ points in high-dimensional space, they can be projected into $k$ dimensions while approximately preserving distances.
Key formula: $k \geq \frac{C}{ε ^{2}} lo g N$

But more importantly, the lemma implies:

The number of nearly-perpendicular (quasi-orthogonal) vectors that can fit in a $d$ -dimensional space grows exponentially with $d$ .

Practical Implications for GPT-3 ( $d = 12, 288$ ):

Minimum angle between vectors	Number of quasi-orthogonal vectors possible
89°	~ $1 0^{8}$
88°	~ $1 0^{32}$
87°	~ $1 0^{73}$
85°	~ $1 0^{200}$

At ~85–88° separation, GPT-3’s 12,288-dimensional space can theoretically hold billions to trillions of distinct feature directions — far more than the number of real-world concepts the model needs to represent.

The Tradeoff

Using quasi-orthogonal (rather than perfectly orthogonal) vectors means:

Activating one feature causes slight spurious activation of nearby features
For sparse features (ones that are rarely active), this interference is low-cost
For dense features, interference becomes a problem

The network learns to tolerate this interference because the benefit (massive capacity) outweighs the cost for sparse, real-world knowledge.

Parameter Count

For a Single MLP Block (GPT-3)

Matrix	Shape	Parameters
$W_{↑}$ (up projection)	49,152 × 12,288	~604 million
$W_{↓}$ (down projection)	12,288 × 49,152	~604 million
Total per MLP block		~1.2 billion

Across All 96 Layers (GPT-3)

96 blocks × ~1.2B = ~116 billion parameters
This is approximately two-thirds of GPT-3’s total 175 billion parameters
The remaining third is mostly attention parameters (~58B) from DL6

Why This Matters for Interpretability

The Challenge

If individual neurons don’t cleanly correspond to individual concepts (due to superposition), then:

Reading out “what a model knows” by examining individual neuron activations is unreliable
Circuits that implement factual retrieval may be distributed across many neurons
Mechanistic interpretability becomes very hard

What We Can Say

The MLP layers collectively are the primary site of factual knowledge storage
Editing facts in LLMs (e.g., changing “Michael Jordan plays basketball” to a different sport) has been shown to work best by targeting specific MLP layers — supporting this theory
The superposition hypothesis is an active area of research (see Anthropic’s “Toy Models of Superposition” paper)

Summary Diagram

Token vector ē  (12,288-dimensional)
        ↓
W_↑ · ē + b_↑       ← "Ask 49,152 questions" (up projection)
        ↓
ReLU(z)              ← Gate: only neurons whose pattern matches fire
        ↓  (sparse activation vector â, mostly zeros)
W_↓ · â              ← "Read out facts" (down projection)
        ↓
Δē                   ← Change to add to the embedding
        ↓
ē ← ē + Δē           ← Updated embedding (residual connection)

Key Concepts Summary

Concept	Meaning
Up projection $W_{↑}$	Each row is a pattern (key) to match
ReLU	Threshold gate — neuron fires only if pattern matches
Down projection $W_{↓}$	Each column is information (value) to inject
Superposition	Storing more features than dimensions by using quasi-orthogonal vectors
Johnson-Lindenstrauss	Guarantees exponential capacity in high-dimensional spaces
Sparse activation	Most neurons inactive at any given time — crucial for superposition to work

Connections

Previous: DL6 (Attention) — the other major building block of a transformer
Research: Anthropic’s “Toy Models of Superposition” (2022) — formal treatment of the superposition hypothesis
Research: Geva et al. (Google) — “Transformer Feed-Forward Layers Are Key-Value Memories” (2021) — empirical evidence that MLPs store facts

Notes

Explorer

DL7 How LLMs Store Facts

Background: Where Do Facts Live?

MLP vs. Attention

How Directions Encode Meaning (Recap)

The MLP Architecture: Three Steps

Step 1: Up Projection

What Each Row of $W_{↑}$ Does

Step 2: ReLU Activation

Why ReLU Is Powerful Here

Step 3: Down Projection

What Each Column of $W_{↓}$ Does

The Full MLP as a Fact-Retrieval System

The Superposition Problem

A Naïve View

The Superposition Hypothesis

The Johnson-Lindenstrauss Lemma

Practical Implications for GPT-3 ( $d = 12, 288$ ):

The Tradeoff

Parameter Count

For a Single MLP Block (GPT-3)

Across All 96 Layers (GPT-3)

Why This Matters for Interpretability

The Challenge

What We Can Say

Summary Diagram

Key Concepts Summary

Connections

Table of Contents

Graph View

Table of Contents

Backlinks

Notes

Explorer

DL7 How LLMs Store Facts

Background: Where Do Facts Live?

MLP vs. Attention

How Directions Encode Meaning (Recap)

The MLP Architecture: Three Steps

Step 1: Up Projection

What Each Row of W↑​ Does

Step 2: ReLU Activation

Why ReLU Is Powerful Here

Step 3: Down Projection

What Each Column of W↓​ Does

The Full MLP as a Fact-Retrieval System

The Superposition Problem

A Naïve View

The Superposition Hypothesis

The Johnson-Lindenstrauss Lemma

Practical Implications for GPT-3 (d=12,288):

The Tradeoff

Parameter Count

For a Single MLP Block (GPT-3)

Across All 96 Layers (GPT-3)

Why This Matters for Interpretability

The Challenge

What We Can Say

Summary Diagram

Key Concepts Summary

Connections

Graph View

Table of Contents

Backlinks

What Each Row of $W_{↑}$ Does

What Each Column of $W_{↓}$ Does

Practical Implications for GPT-3 ( $d = 12, 288$ ):