Youtube: Transformers, the tech behind LLMs | Deep Learning Chapter 5
Official lesson: 3blue1brown.com/lessons/gpt

Key Takeaway

Text → tokens → embeddings (high-dimensional vectors)

Vectors pass through alternating Attention blocks (cross-token communication) and MLP blocks (per-token fact storage)

After many layers, the last token’s vector is projected to 50,257 logits → softmax → next token probability

This chapter answers:

What is a transformer, and how does it turn text into predictions at a high level?

What is a Transformer?
- A Transformer is a neural network architecture
- that processes sequences by letting every element attend to every other element simultaneously
- rather than reading them one by one.
Introduced in the paper “Attention Is All You Need” (Vaswani et al., 2017).
The key innovation: replace recurrence (RNNs) with self-attention, enabling full parallelisation during training.
A Transformer takes a sequence of vectors as input, passes them through alternating Attention and MLP blocks, and outputs a transformed sequence of vectors of the same length.
Each attention block allows tokens to communicate with each other (context-aware updates).
Each MLP block processes each token independently (stores patterns and facts).
Information accumulates via residual connections: each block adds to the existing vector rather than replacing it.

In the context of language models:

Input = a sequence of token embeddings
Output = a richer sequence of context-aware embeddings
The final vector is projected to a probability distribution over the next token

What is a GPT?

GPT stands for Generative Pre-trained Transformer:

Generative — it generates new text
Pre-trained — trained on a massive dataset before being fine-tuned for specific tasks
Transformer — based on the transformer neural network architecture

A GPT takes text as input and outputs a probability distribution over all possible next tokens. Text is generated by repeatedly sampling from this distribution and appending the chosen token, then running the model again.

Deep Learning Foundations

What Machine Learning Really Is

Traditional programming: humans write explicit rules
Machine learning: the model learns rules from data
The tunable knobs in a model are called weights (or parameters)
GPT-3 has 175 billion weights organized into ~28,000 matrices across 8 matrix categories

The Core Constraint

All operations in a transformer must be expressible as:

Tensor operations — data flows as multi-dimensional arrays of real numbers
Matrix multiplications — the primary computation
Nonlinear functions — sprinkled in between (e.g., softmax, ReLU) to prevent the whole network from collapsing to a single linear transformation

Training uses backpropagation to adjust all weights based on prediction errors.

Step 1: Tokenization

Input text is broken into small chunks called tokens
Tokens may be whole words, sub-word pieces, or punctuation (e.g., "To| date|,| the| cle|ve|rest|...")
GPT-3 has a vocabulary of 50,257 tokens
The model never works with raw text — only with these token IDs

Step 2: Embedding — Turning Tokens into Vectors

The Embedding Matrix $W_{E}$

Each token is mapped to a high-dimensional vector via the embedding matrix $W_{E}$
$W_{E}$ has one column per vocabulary token
GPT-3 dimensions: 50,257 vocabulary items × 12,288 embedding dimensions
Parameter count: $50, 257 \times 12, 288 = 617, 558, 016$ (~617M parameters, just for embeddings)

What Embeddings Encode

Embeddings are learned during training — not hand-designed
Directions in this high-dimensional space correspond to semantic meaning
Words with similar meanings cluster nearby in this space

Directional Semantics: Analogy Arithmetic

A famous property of trained embeddings:

$queen - king \approx woman - man$

The direction “woman minus man” is a learned vector encoding gender. This kind of relational structure emerges automatically from training on large text corpora.

Another example:
$cats - cat \approx (plural direction)$

This direction correlates with other plurals and with increasing numerical quantities.

Dot Products Measure Similarity

The dot product of two vectors measures how aligned they are
$u \cdot v$ is large and positive when the vectors point in the same direction
This is the primary tool for comparing embeddings

Positional Information

The initial embedding encodes what a token is, not where it appears in the sequence
Position is encoded separately and added to the embedding vector (positional encoding)

Step 3: The Transformer Layers

The context window of GPT-3 holds 2048 tokens at once. Each token’s embedding vector (12,288-dimensional) flows through the network in parallel.

What “Deep” Means

A large number of stacked layers is what puts the “deep” in deep learning. Each layer refines the embeddings further.

Two Types of Operations Alternate

1. Attention Block

Allows all token vectors to communicate with each other
Each vector can absorb contextual information from surrounding vectors
A vector that started as the embedding of an isolated word gets tugged and refined as it absorbs context
Example: the word “mole” has very different meanings in different sentences; only after attention can the model know which meaning is active

2. Feed-Forward Block (MLP — Multi-Layer Perceptron)

Each token vector is processed independently (no cross-token communication)
Functions like a bank of “questions” asked about each vector
Stores patterns and facts learned during training
The MLP gives the model extra representational capacity beyond attention alone

These two blocks alternate many times. The data flows through many layers of [Attention → MLP → Attention → MLP → ...].

The Residual Stream View

Each layer adds its output to the existing vector rather than replacing it. This means:

Early layers can encode simple features
Later layers add progressively more abstract, context-rich information
Information from the original embedding persists and is accessible at every layer

Step 4: The Unembedding Matrix $W_{U}$

After many layers, the final vector of the last token in the context encodes everything the model has inferred about what should come next.

This final vector is multiplied by the unembedding matrix $W_{U}$ :

$W_{U}$ shape: 50,257 rows × 12,288 columns (transpose of embedding matrix shape)
Output: a list of ~50,257 raw scores, one for each vocabulary token

These raw scores are called logits.

Step 5: Softmax — Converting Logits to Probabilities

Raw logits can be any real number (positive, negative, large, small). We need a valid probability distribution (all values between 0 and 1, summing to 1).
Softmax performs this conversion:

$softmax (x_{i}) = \frac{e ^{x_{i}}}{\sum _{j} e ^{x_{j}}}$

Steps:

Raise $e$ to the power of each logit
Divide each result by the sum of all results

Properties:

Larger logits → values closer to 1
Smaller logits → values closer to 0
All outputs sum to 1
It’s a “soft” maximum: all tokens still receive some nonzero probability

Temperature

An optional parameter $T$ can be inserted into softmax:

$softmax_{T} (x_{i}) = \frac{e ^{x_{i} / T}}{\sum _{j} e ^{x_{j} / T}}$

Temperature $T$	Effect
$T > 1$ (high)	Distribution becomes more uniform — more random/creative outputs
$T < 1$ (low)	Distribution becomes more peaked — more deterministic/conservative outputs
$T \to 0$	All weight goes to the single highest-scoring token

The name “temperature” is borrowed from thermodynamics, where temperature governs the randomness of particle behavior.

Key Terminology Summary

Term	Meaning
Token	A small chunk of text (word, sub-word, punctuation)
Embedding	A high-dimensional vector representing a token’s meaning
Logits	Raw, unnormalized output scores before softmax
Weights	Learnable parameters (the “knobs” of the model)
Context size	Maximum number of tokens the model processes at once
Attention	Mechanism for tokens to share information with each other
MLP	Feed-forward sub-network that processes each token independently

GPT-3 Architecture at a Glance

Property	Value
Total parameters	~175 billion
Embedding dimension	12,288
Vocabulary size	50,257 tokens
Context window	2,048 tokens
Number of layers	96
Attention heads per layer	96
Embedding matrix parameters	~617M

The Big Picture

Input text
    ↓  tokenize
[token₁, token₂, ..., tokenₙ]
    ↓  embed (W_E)
[vec₁, vec₂, ..., vecₙ]   ← each is 12,288-dimensional
    ↓  [Attention block] × many layers
    ↓  [MLP block]      × many layers
    ↓  ...
[enriched_vec₁, ..., enriched_vecₙ]
    ↓  take last vector, multiply by W_U
[logit₁, logit₂, ..., logit₅₀₂₅₇]
    ↓  softmax
[prob₁, prob₂, ..., prob₅₀₂₅₇]   ← probability distribution over next token
    ↓  sample
next token

What Comes Next

DL6: Deep dive into the attention mechanism — how exactly tokens communicate and update each other
DL7: Deep dive into MLP blocks — how facts and patterns are stored in the feed-forward layers

Notes

Explorer

DL5 Transformers

What is a GPT?

Deep Learning Foundations

What Machine Learning Really Is

The Core Constraint

Step 1: Tokenization

Step 2: Embedding — Turning Tokens into Vectors

The Embedding Matrix $W_{E}$

What Embeddings Encode

Directional Semantics: Analogy Arithmetic

Dot Products Measure Similarity

Positional Information

Step 3: The Transformer Layers

What “Deep” Means

Two Types of Operations Alternate

1. Attention Block

2. Feed-Forward Block (MLP — Multi-Layer Perceptron)

The Residual Stream View

Step 4: The Unembedding Matrix $W_{U}$

Step 5: Softmax — Converting Logits to Probabilities

Temperature

Key Terminology Summary

GPT-3 Architecture at a Glance

The Big Picture

What Comes Next

Table of Contents

Graph View

Table of Contents

Backlinks

Notes

Explorer

DL5 Transformers

What is a GPT?

Deep Learning Foundations

What Machine Learning Really Is

The Core Constraint

Step 1: Tokenization

Step 2: Embedding — Turning Tokens into Vectors

The Embedding Matrix WE​

What Embeddings Encode

Directional Semantics: Analogy Arithmetic

Dot Products Measure Similarity

Positional Information

Step 3: The Transformer Layers

What “Deep” Means

Two Types of Operations Alternate

1. Attention Block

2. Feed-Forward Block (MLP — Multi-Layer Perceptron)

The Residual Stream View

Step 4: The Unembedding Matrix WU​

Step 5: Softmax — Converting Logits to Probabilities

Temperature

Key Terminology Summary

GPT-3 Architecture at a Glance

The Big Picture

What Comes Next

Graph View

Table of Contents

Backlinks

The Embedding Matrix $W_{E}$

Step 4: The Unembedding Matrix $W_{U}$