Key Takeaway

  • MLP = key-value memory: rows of are keys (pattern matchers), columns of are values (facts to inject)
  • ReLU gates the lookup — a neuron fires only when its pattern matches the input
  • High-dimensional spaces can hold exponentially many quasi-orthogonal directions (superposition)
  • A single MLP block encodes far more facts than it has neurons

This chapter answers:

How do the MLP (feed-forward) blocks in a transformer store and retrieve factual knowledge?

Background: Where Do Facts Live?

Researchers at Google DeepMind reported that factual knowledge in transformers tends to be stored in the multi-layer perceptron (MLP) blocks, not primarily in the attention layers. A full mechanistic understanding of exactly how this works remains an active research area, but the video gives a compelling conceptual model.

MLP vs. Attention

Attention BlockMLP Block
PurposeCross-token communicationPer-token processing
Tokens interact?Yes — tokens exchange informationNo — each token processed independently
Stores facts?Less soPrimarily
Share of GPT-3 params~1/3 (~58B)~2/3 (~116B)

How Directions Encode Meaning (Recap)

The MLP operates on the residual stream — the high-dimensional vector flowing through each token. Directions in this space encode semantic meaning:

  • A vector that strongly aligns with the “Michael” direction has a large dot product with
  • A vector that aligns with both “Michael” and “Jordan” simultaneously encodes “Michael Jordan” as a concept
  • The dot product of any vector with a direction measures alignment: if aligned, if perpendicular, negative if opposed

The MLP Architecture: Three Steps

A single MLP block performs three sequential operations on each token’s vector.

Step 1: Up Projection

  • shape (GPT-3): 49,152 × 12,288 (projects up by a factor of 4)
  • : bias vector of length 49,152
  • Output : a vector of 49,152 values, one per “neuron” in the hidden layer

What Each Row of Does

Each row of is a learned “question” vector in the embedding space. If the row is aligned with the “Michael Jordan” concept, then:

  • When encodes “Michael Jordan”: (dot product is large)
  • When does not encode “Michael Jordan”:

With a bias of (learned during training):

  • “Michael Jordan” present: → neuron activates
  • Other inputs: → neuron silenced

Step 2: ReLU Activation

Applied element-wise to every value in :

  • Negative values → zero (neuron is inactive / “off”)
  • Positive values → unchanged (neuron is active / “on”)

Why ReLU Is Powerful Here

ReLU creates sharp gating behavior:

  • A neuron fires only when the specific pattern its row in is looking for is present
  • This mimics an AND gate: the “Michael Jordan” neuron fires only when both “Michael” and “Jordan” features are present simultaneously
  • The resulting vector is typically sparse — most neurons are zero

Step 3: Down Projection

  • shape (GPT-3): 12,288 × 49,152 (projects back down to embedding space)
  • is added to the original embedding:

What Each Column of Does

Think of column by column:

  • When neuron activates (fires with value ), column of is scaled by and added to the output
  • Column is a direction in embedding space — it points toward some semantic concept

Example: the “Michael Jordan” neuron (neuron ) firing might cause:

  • Column to point toward “basketball
  • Also contributing directions toward “Chicago Bulls” and “number 23

The final output is a superposition of all active neurons’ associated directions.


The Full MLP as a Fact-Retrieval System

Putting it together, the MLP implements a pattern like:

IF the current token vector matches the pattern “Michael Jordan” (via rows + ReLU),
THEN add the directions “basketball”, “Chicago Bulls”, “number 23” to its embedding (via columns)

This is analogous to a key-value memory store:

  • The rows of are the keys — patterns to match against
  • The columns of are the values — information to retrieve when a key matches
  • ReLU implements the lookup — gating whether a key is matched or not

The Superposition Problem

A Naïve View

If we expected one neuron per concept, a 49,152-neuron hidden layer could only store 49,152 distinct facts. But language models appear to store millions of facts. How?

The Superposition Hypothesis

Neurons rarely represent single, clean concepts. Instead, the network encodes many more features than it has neurons by storing them as overlapping, quasi-orthogonal directions in the activation space.

This is called superposition: multiple features are encoded simultaneously, each as a different direction in the high-dimensional space. Each direction can be “decoded” with a dot product, but they partially interfere with each other.

The Johnson-Lindenstrauss Lemma

The mathematical foundation comes from the Johnson-Lindenstrauss lemma:

For any set of points in high-dimensional space, they can be projected into dimensions while approximately preserving distances.
Key formula:

But more importantly, the lemma implies:

The number of nearly-perpendicular (quasi-orthogonal) vectors that can fit in a -dimensional space grows exponentially with .

Practical Implications for GPT-3 ():

Minimum angle between vectorsNumber of quasi-orthogonal vectors possible
89°~
88°~
87°~
85°~

At ~85–88° separation, GPT-3’s 12,288-dimensional space can theoretically hold billions to trillions of distinct feature directions — far more than the number of real-world concepts the model needs to represent.

The Tradeoff

Using quasi-orthogonal (rather than perfectly orthogonal) vectors means:

  • Activating one feature causes slight spurious activation of nearby features
  • For sparse features (ones that are rarely active), this interference is low-cost
  • For dense features, interference becomes a problem

The network learns to tolerate this interference because the benefit (massive capacity) outweighs the cost for sparse, real-world knowledge.


Parameter Count

For a Single MLP Block (GPT-3)

MatrixShapeParameters
(up projection)49,152 × 12,288~604 million
(down projection)12,288 × 49,152~604 million
Total per MLP block~1.2 billion

Across All 96 Layers (GPT-3)

  • 96 blocks × ~1.2B = ~116 billion parameters
  • This is approximately two-thirds of GPT-3’s total 175 billion parameters
  • The remaining third is mostly attention parameters (~58B) from DL6

Why This Matters for Interpretability

The Challenge

If individual neurons don’t cleanly correspond to individual concepts (due to superposition), then:

  • Reading out “what a model knows” by examining individual neuron activations is unreliable
  • Circuits that implement factual retrieval may be distributed across many neurons
  • Mechanistic interpretability becomes very hard

What We Can Say

  • The MLP layers collectively are the primary site of factual knowledge storage
  • Editing facts in LLMs (e.g., changing “Michael Jordan plays basketball” to a different sport) has been shown to work best by targeting specific MLP layers — supporting this theory
  • The superposition hypothesis is an active area of research (see Anthropic’s “Toy Models of Superposition” paper)

Summary Diagram

Token vector ē  (12,288-dimensional)
        ↓
W_↑ · ē + b_↑       ← "Ask 49,152 questions" (up projection)
        ↓
ReLU(z)              ← Gate: only neurons whose pattern matches fire
        ↓  (sparse activation vector â, mostly zeros)
W_↓ · â              ← "Read out facts" (down projection)
        ↓
Δē                   ← Change to add to the embedding
        ↓
ē ← ē + Δē           ← Updated embedding (residual connection)

Key Concepts Summary

ConceptMeaning
Up projection Each row is a pattern (key) to match
ReLUThreshold gate — neuron fires only if pattern matches
Down projection Each column is information (value) to inject
SuperpositionStoring more features than dimensions by using quasi-orthogonal vectors
Johnson-LindenstraussGuarantees exponential capacity in high-dimensional spaces
Sparse activationMost neurons inactive at any given time — crucial for superposition to work

Connections

  • Previous: DL6 (Attention) — the other major building block of a transformer
  • Research: Anthropic’s “Toy Models of Superposition” (2022) — formal treatment of the superposition hypothesis
  • Research: Geva et al. (Google) — “Transformer Feed-Forward Layers Are Key-Value Memories” (2021) — empirical evidence that MLPs store facts