DL8 · Image & Video Generation
DL5–7 · Transformer Architecture
DL2–4 · How Networks Learn
DL1 · Neural Network Structure
Recurring tools — appear across all sections

Weights & Bias

The only things that change during training.

Weight = how important is this input connection?
Bias = shifts the neuron's threshold up or down.

3Blue1Brown Deep Learning — Knowledge Map

Series arc: Structure (DL1) → Learning (DL2–4) → Architecture (DL5–7) → Generation (DL8)

Core pattern in every chapter: intuition first · math second · example third

Neuron

Holds one number called an activation.

Computes: σ(w₁a₁ + w₂a₂ + ... + b)
· weighted sum of inputs
· + bias
· → activation function

Layers

Input → Hidden(s) → Output

Each layer learns a deeper abstraction:
pixels → edges → shapes → digit

Deep = many hidden layers.

Cost Function

One number: how wrong is the model right now?

For one example: L = Σ(aⱼ − yⱼ)²
Over all training data: C = average of all L

Goal: make C as small as possible.

Gradient Descent

Imagine C as a landscape of hills and valleys.
We want to roll downhill.

θ ← θ − η · ∇C(θ)

∇C = direction of steepest increase → go opposite.
η (learning rate) = how big each step is.

Backpropagation (Intuition)

Start at the output error.
Pass blame backward layer by layer.

For each weight: 3 factors determine the update:
· how wrong the output was
· how active the sending neuron was
· how many paths flow through it

Forward Pass

a^(l+1) = σ(W · a^(l) + b)

Data flows forward layer by layer.
Final output = prediction.
Compare prediction vs target → produce cost.

Attention Block

Query = 'what am I looking for?'
Key = 'what do I contain?'
Value = 'what do I send if selected?'

Score = Q·K / √d → softmax → weights → Σ(weight × V)

96 heads in parallel, each learns a different
type of relationship (syntax, coreference, meaning…)

Activation Functions

Sigmoid: squash to (0, 1) → 'firing probability'
ReLU: max(0, z) → keep positive, zero out negative

Purpose: add nonlinearity so the network
can learn complex patterns (not just straight lines).

Chain Rule (Calculus)

∂C/∂w = ∂C/∂a · ∂a/∂z · ∂z/∂w
↑ ↑ ↑
how wrong how responsive how active

One neuron = one chain of 3 factors.
Many neurons = sum all paths to that weight.

SGD — Mini-batch

Computing gradient over all data = too slow.
Instead: random sample ~100 examples per step.
Approximate gradient. Noisier path, but 100× faster.

This is how every modern neural network trains.

Residual Stream

Each block ADDS to the vector — never replaces it.

Attention block writes → MLP block writes → ...
Info from early layers is preserved at every layer.
12,288 dims = high-bandwidth working memory.

Token → Embedding

Text splits into tokens (sub-words, ~50K vocab).
Each token → a 12,288-dim vector via matrix W_E.

Directions in this space = meaning:
queen − king ≈ woman − man
Words with similar meaning cluster nearby.

MLP Block (Fact Storage)

Up-project → ReLU gate → Down-project

W↑ rows = pattern detectors (keys)
ReLU = fires ONLY when pattern matches
W↓ columns = facts to inject (values)

IF context = 'Michael Jordan' THEN add 'basketball'

Output: Logits → Softmax → Token

Last token's vector × W_U → 50,257 raw scores (logits)
Softmax converts → probability distribution.
Sample next token → append → repeat.

Temperature controls creativity (high T = more random).

Vectors & Dot Products

The universal currency.
Dot product = how aligned two vectors are.
Used in: every layer (Wx), attention (Q·K), CLIP similarity.

Softmax

Raw scores → valid probability distribution.
All values 0–1, sum to 1.
Used in: attention weights · output token prediction · CLIP.

Gradient / ∇

Direction of steepest increase.
Negative gradient = direction to reduce cost.
Used in: gradient descent (DL2) · CFG score matching (DL8).

Residual Connections

Output = input + f(input) — add, don't replace.
Preserves information from earlier layers.
Used in: Transformer residual stream (DL5) · U-Net skip connections (DL8).

U-Net Denoising

Neural network that predicts the noise at each step.
Encoder → bottleneck → decoder + skip connections.

Text prompt guides via cross-attention:
(same Q·K·V mechanism as DL6)
Queries from image, Keys/Values from CLIP text embedding.

Forward Diffusion

Add Gaussian noise step by step (T = 1000 steps).
Eventually: pure random noise, structure destroyed.

Shortcut formula — jump to any noise level directly:
x_t = √ᾱ_t · x₀ + √(1−ᾱ_t) · ε

CLIP (Text + Image Bridge)

Two encoders trained together:
· Text encoder (Transformer) → vector
· Image encoder (ViT) → same vector space

Matching pairs → close vectors.
Mismatched pairs → far apart.
(Contrastive training on 400M image-caption pairs)

Latent Diffusion (Speed)

Running diffusion on 512×512 pixels = 786K values. Slow.

VAE compresses image → 64×64×4 latent = 16K values.
48× smaller. Run all denoising steps in latent space.
VAE decoder reconstructs the final image at the end.

This is why Stable Diffusion fits on consumer GPUs.

Classifier-Free Guidance

Run U-Net twice per step: with prompt + without prompt.
Extrapolate toward conditional, away from unconditional:

ε̂ = (w+1)·ε_cond − w·ε_uncond

w = 7–15 is typical. Higher = more prompt-faithful,
less diverse and realistic.