Weights & Bias
The only things that change during training.
Weight = how important is this input connection?
Bias = shifts the neuron's threshold up or down.
3Blue1Brown Deep Learning — Knowledge Map
Series arc: Structure (DL1) → Learning (DL2–4) → Architecture (DL5–7) → Generation (DL8)
Core pattern in every chapter: intuition first · math second · example third
Neuron
Holds one number called an activation.
Computes: σ(w₁a₁ + w₂a₂ + ... + b)
· weighted sum of inputs
· + bias
· → activation function
Layers
Input → Hidden(s) → Output
Each layer learns a deeper abstraction:
pixels → edges → shapes → digit
Deep = many hidden layers.
Cost Function
One number: how wrong is the model right now?
For one example: L = Σ(aⱼ − yⱼ)²
Over all training data: C = average of all L
Goal: make C as small as possible.
Gradient Descent
Imagine C as a landscape of hills and valleys.
We want to roll downhill.
θ ← θ − η · ∇C(θ)
∇C = direction of steepest increase → go opposite.
η (learning rate) = how big each step is.
Backpropagation (Intuition)
Start at the output error.
Pass blame backward layer by layer.
For each weight: 3 factors determine the update:
· how wrong the output was
· how active the sending neuron was
· how many paths flow through it
Forward Pass
a^(l+1) = σ(W · a^(l) + b)
Data flows forward layer by layer.
Final output = prediction.
Compare prediction vs target → produce cost.
Attention Block
Query = 'what am I looking for?'
Key = 'what do I contain?'
Value = 'what do I send if selected?'
Score = Q·K / √d → softmax → weights → Σ(weight × V)
96 heads in parallel, each learns a different
type of relationship (syntax, coreference, meaning…)
Activation Functions
Sigmoid: squash to (0, 1) → 'firing probability'
ReLU: max(0, z) → keep positive, zero out negative
Purpose: add nonlinearity so the network
can learn complex patterns (not just straight lines).
Chain Rule (Calculus)
∂C/∂w = ∂C/∂a · ∂a/∂z · ∂z/∂w
↑ ↑ ↑
how wrong how responsive how active
One neuron = one chain of 3 factors.
Many neurons = sum all paths to that weight.
SGD — Mini-batch
Computing gradient over all data = too slow.
Instead: random sample ~100 examples per step.
Approximate gradient. Noisier path, but 100× faster.
This is how every modern neural network trains.
Residual Stream
Each block ADDS to the vector — never replaces it.
Attention block writes → MLP block writes → ...
Info from early layers is preserved at every layer.
12,288 dims = high-bandwidth working memory.
Token → Embedding
Text splits into tokens (sub-words, ~50K vocab).
Each token → a 12,288-dim vector via matrix W_E.
Directions in this space = meaning:
queen − king ≈ woman − man
Words with similar meaning cluster nearby.
MLP Block (Fact Storage)
Up-project → ReLU gate → Down-project
W↑ rows = pattern detectors (keys)
ReLU = fires ONLY when pattern matches
W↓ columns = facts to inject (values)
IF context = 'Michael Jordan' THEN add 'basketball'
Output: Logits → Softmax → Token
Last token's vector × W_U → 50,257 raw scores (logits)
Softmax converts → probability distribution.
Sample next token → append → repeat.
Temperature controls creativity (high T = more random).
Vectors & Dot Products
The universal currency.
Dot product = how aligned two vectors are.
Used in: every layer (Wx), attention (Q·K), CLIP similarity.
Softmax
Raw scores → valid probability distribution.
All values 0–1, sum to 1.
Used in: attention weights · output token prediction · CLIP.
Gradient / ∇
Direction of steepest increase.
Negative gradient = direction to reduce cost.
Used in: gradient descent (DL2) · CFG score matching (DL8).
Residual Connections
Output = input + f(input) — add, don't replace.
Preserves information from earlier layers.
Used in: Transformer residual stream (DL5) · U-Net skip connections (DL8).
U-Net Denoising
Neural network that predicts the noise at each step.
Encoder → bottleneck → decoder + skip connections.
Text prompt guides via cross-attention:
(same Q·K·V mechanism as DL6)
Queries from image, Keys/Values from CLIP text embedding.
Forward Diffusion
Add Gaussian noise step by step (T = 1000 steps).
Eventually: pure random noise, structure destroyed.
Shortcut formula — jump to any noise level directly:
x_t = √ᾱ_t · x₀ + √(1−ᾱ_t) · ε
CLIP (Text + Image Bridge)
Two encoders trained together:
· Text encoder (Transformer) → vector
· Image encoder (ViT) → same vector space
Matching pairs → close vectors.
Mismatched pairs → far apart.
(Contrastive training on 400M image-caption pairs)
Latent Diffusion (Speed)
Running diffusion on 512×512 pixels = 786K values. Slow.
VAE compresses image → 64×64×4 latent = 16K values.
48× smaller. Run all denoising steps in latent space.
VAE decoder reconstructs the final image at the end.
This is why Stable Diffusion fits on consumer GPUs.
Classifier-Free Guidance
Run U-Net twice per step: with prompt + without prompt.
Extrapolate toward conditional, away from unconditional:
ε̂ = (w+1)·ε_cond − w·ε_uncond
w = 7–15 is typical. Higher = more prompt-faithful,
less diverse and realistic.