Youtube: But how do AI images and videos actually work? | Guest video by Welch Labs
Official lesson: 3blue1brown.com/lessons/diffusion
Guest creator: Welch Labs — Stephen Welch
Published: July 25, 2025 (playlist index 9)

Key Takeaway

Image generation = Brownian motion in reverse: start from pure noise, remove it step by step

A U-Net predicts the noise at each step; CLIP text embeddings guide it via cross-attention

Latent diffusion runs in a VAE-compressed space (~48× smaller) — this is why Stable Diffusion is fast

Classifier-free guidance amplifies the prompt by extrapolating toward conditional and away from unconditional predictions

This chapter answers:

How do modern AI systems generate high-quality images and videos from a text prompt? What are diffusion models, what role does CLIP play, and what is the actual mathematics of turning noise into an image?

The Big Question

When you type “a photograph of an astronaut riding a horse on the moon” into Stable Diffusion or DALL-E, a photorealistic image appears in seconds. This chapter builds a complete, mathematically grounded explanation of how that is possible — going well beyond the common oversimplification that “the model just learns to remove noise.”

The key insight Welch frames early: AI image generation is Brownian motion run backwards. Ordinary diffusion describes how a drop of ink spreads into water until it is undetectable. AI diffusion models learn to reverse exactly that process — starting from pure random noise and recovering coherent structure, step by step.

Part 1: CLIP — Connecting Language and Images

Before diffusion models can turn text into images, there must be a bridge between the two modalities. That bridge is CLIP (Contrastive Language-Image Pretraining), released by OpenAI in February 2021.

The Architecture: Two Encoders, One Shared Space

CLIP consists of two neural networks trained simultaneously:

Encoder	Input	Output
Text encoder (Transformer)	A sentence or phrase	A high-dimensional embedding vector
Image encoder (ViT or ResNet)	An image	A high-dimensional embedding vector in the same space

The central design goal: matching images and their captions should produce nearby vectors in the shared embedding space. A photo of a dog and the caption “a photo of a dog” should map to very similar vectors.

Contrastive Training Objective

CLIP is trained on hundreds of millions of (image, caption) pairs scraped from the internet. During each training step, a batch of $N$ image-text pairs is processed:

The text encoder produces $N$ text embedding vectors: ${t_{1}, t_{2}, \dots, t_{N}}$
The image encoder produces $N$ image embedding vectors: ${v_{1}, v_{2}, \dots, v_{N}}$

This creates an $N \times N$ similarity matrix where entry $(i, j)$ is:

$sim (t_{i}, v_{j}) = \frac{t _{i} \cdot v _{j}}{∥ t _{i} ∥∥ v _{j} ∥}$

This is the cosine similarity — the cosine of the angle between two vectors.

The training objective (contrastive loss) simultaneously:

Maximizes cosine similarity along the diagonal (matched pairs: image $i$ with caption $i$ )
Minimizes cosine similarity off the diagonal (mismatched pairs: image $i$ with caption $j \neq = i$ )

After training, the model has learned a rich shared semantic space. Images of dogs and the phrase “a dog” land in the same neighborhood, regardless of breed, style, or framing.

Emergent Properties of the CLIP Embedding Space

The embedding space encodes rich relational structure. Vector arithmetic reveals semantic relationships, similar to word embeddings in language models:

$"man with hat" - "man without hat" \approx "hat"$

This arithmetic works because the encoder has learned to disentangle visual concepts into independent directions in the shared space.

CLIP also enables zero-shot image classification: given an image, compute its cosine similarity against text prompts like “a photo of a cat”, “a photo of a dog”, “a photo of a car”, and pick the highest-scoring label. No fine-tuning on a labeled dataset is required.

Part 2: Diffusion Models — The Forward Process

Intuition: Systematic Destruction

Take any training image. Now add a tiny amount of Gaussian random noise to every pixel. Repeat this many times. After enough steps, the image is indistinguishable from pure noise — its structure has been completely destroyed.

This is the forward diffusion process: a fixed, deterministic recipe for progressively corrupting an image over $T$ timesteps (typically $T = 1000$ ).

Mathematical Formulation

At each step $t$ , a small amount of Gaussian noise is added:

$q (x_{t} ∣ x_{t - 1}) = N (x_{t}; 1 - β_{t} x_{t - 1}, β_{t} I)$

Where:

$β_{t} \in (0, 1)$ is the noise schedule — a small value controlling how much noise is added at step $t$
$1 - β_{t} x_{t - 1}$ slightly shrinks the previous image (to keep variance bounded)
$β_{t} I$ is the added Gaussian noise

The Noise Schedule $β_{t}$

The sequence $β_{1}, β_{2}, \dots, β_{T}$ determines how quickly information is destroyed.

Linear schedule (Ho et al. 2020): $β_{t}$ grows linearly from $1 0^{- 4}$ to $0.02$ .

Cosine schedule (Nichol & Dhariwal 2021): designed so that the signal-to-noise ratio decays smoothly, avoiding excessive noise being added in the very first steps:

$\overset{α}{ˉ}_{t} = \frac{f ( t )}{f ( 0 )}, where f (t) = cos^{2} (\frac{t / T + s}{1 + s} \cdot \frac{π}{2})$

The Key Shortcut: Sampling at Any Timestep Directly

The most important mathematical fact about the forward process is that you can jump directly from $x_{0}$ to $x_{t}$ without simulating all intermediate steps.

Define:
$α_{t} = 1 - β_{t}, \overset{α}{ˉ}_{t} = \prod_{s = 1}^{t} α_{s}$

Then the closed-form expression for $x_{t}$ given the original image $x_{0}$ is:

$x_{t} = \overset{α}{ˉ}_{t} x_{0} + 1 - \overset{α}{ˉ}_{t} ε, ε \sim N (0, I)$

This is derived by repeatedly applying the Gaussian convolution formula and collapsing the telescoping product of $α$ values.

Physical interpretation:

When $t$ is small: $\overset{α}{ˉ}_{t} \approx 1$ , so $x_{t} \approx x_{0}$ — mostly original image
When $t$ is large: $\overset{α}{ˉ}_{t} \approx 0$ , so $x_{t} \approx ε$ — mostly noise
The original image and noise are blended by coefficients $\overset{α}{ˉ}_{t}$ and $1 - \overset{α}{ˉ}_{t}$ that lie on the unit circle (they satisfy $\overset{α}{ˉ}_{t} + (1 - \overset{α}{ˉ}_{t}) = 1$ )

This shortcut is critical for training efficiency: given any image, you can instantly generate a noisy version at any corruption level without running 1000 sequential steps.

Part 3: Diffusion Models — The Reverse Process (Denoising)

The Core Idea

The forward process destroys structure in a known, mathematical way. The reverse process must undo this destruction — but it cannot simply be inverted, because the noise injection is stochastic (random). Instead, a neural network $ε_{θ}$ is trained to estimate what noise was added, then remove it.

What the Network Learns

The network takes two inputs and predicts one output:

Input 1: the noisy image $x_{t}$ at some timestep $t$
Input 2: the timestep $t$ itself (so the network knows how noisy the image is)
Output: a prediction $ε_{θ} (x_{t}, t)$ of the noise $ε$ that was added

Rather than directly predicting the denoised image, the network predicts the total accumulated noise across the entire forward process. This seemingly indirect objective turns out to be more effective for training stability.

The Training Algorithm

For each training step:

Sample a clean image $x_{0}$ from the training dataset
Sample a random timestep $t \sim Uniform (1, T)$
Sample Gaussian noise $ε \sim N (0, I)$
Compute the noisy image: $x_{t} = \overset{α}{ˉ}_{t} x_{0} + 1 - \overset{α}{ˉ}_{t} ε$
Run the network forward: $\hat{ε} = ε_{θ} (x_{t}, t)$
Compute the loss: $L_{simple} = ∥ ε - \hat{ε} ∥^{2}$
Backpropagate and update weights

This is the simplified DDPM objective from Ho et al. (2020). It is the mean squared error between the true noise and the predicted noise. The full variational lower bound (ELBO) derivation shows a correct reweighting for each timestep, but empirically the simplified uniform weighting performs better in practice.

The Sampling Algorithm (Generation)

Once the network is trained, generating an image works as follows:

Start with pure Gaussian noise: $x_{T} \sim N (0, I)$
For $t = T, T - 1, \dots, 1$ :
a. Predict the noise: $\hat{ε} = ε_{θ} (x_{t}, t)$
b. Compute the estimated clean image: $\hat{x}_{0} = \frac{x _{t} - 1 - α ˉ _{t} ε ^}{α ˉ _{t}}$
c. Compute the mean for the previous step:

$μ_{θ} (x_{t}, t) = \frac{1}{α _{t}} (x_{t} - \frac{1 - α _{t}}{1 - α ˉ _{t}} \hat{ε})$

d. Sample: $x_{t - 1} = μ_{θ} + \tilde{β}_{t} z$ , where $z \sim N (0, I)$ (except at $t = 1$ , where $z = 0$ )

Return $x_{0}$

The Crucial Quirk: Noise Is Added Back Each Step

This is a key point Welch emphasizes that surprises many people. During generation, after computing the less-noisy $x_{t - 1}$ , a small amount of fresh noise is re-injected (step 2d above). This is not a bug — it is essential.

Why? The denoising network produces a mean estimate. Adding small noise at each step:

Prevents the generation from collapsing to a blurry average
Allows the model to explore different modes of the distribution
Is mathematically justified by the posterior $q (x_{t - 1} ∣ x_{t}, x_{0})$ , which has nonzero variance

This is directly analogous to how MCMC (Markov Chain Monte Carlo) methods use random perturbations to sample from complex distributions rather than greedily descending to a single point.

Part 4: The U-Net — The Neural Network Architecture

Why U-Net?

The noise predictor $ε_{θ}$ must take in an image (at varying noise levels) and output an image of the same spatial dimensions. The architecture used for this is a U-Net, originally developed for biomedical image segmentation.

U-Net Structure

Input (noisy image x_t + timestep t)
        ↓
[Encoder: Conv + GroupNorm + ResBlock]  — 64×64
        ↓ downsample
[Encoder: Conv + GroupNorm + ResBlock]  — 32×32
        ↓ downsample
[Encoder: Conv + GroupNorm + ResBlock]  — 16×16
        ↓ downsample
[Bottleneck: ResBlock + Attention]      —  8×8
        ↑ upsample + skip connection
[Decoder: ResBlock + Attention]         — 16×16
        ↑ upsample + skip connection
[Decoder: ResBlock + Attention]         — 32×32
        ↑ upsample + skip connection
[Decoder: ResBlock + Attention]         — 64×64
        ↓
Output (predicted noise, same shape as input)

The defining feature is the skip connections: intermediate encoder feature maps are concatenated with the corresponding decoder feature maps. This gives the decoder access to both global context (from the bottleneck) and fine-grained local detail (from the skip connections), producing a sharper output.

Timestep Conditioning

The timestep $t$ is encoded as a sinusoidal positional embedding (the same mechanism used for sequence position in transformers), then passed through a small MLP to produce a timestep vector. This vector is injected into every ResNet block via a learned affine transformation (scale-and-shift), conditioning the network’s behavior on its current noise level.

Part 5: Text Conditioning — Making the Model Listen to Prompts

An unconditional diffusion model generates random images. To generate images matching a text prompt, the U-Net must be conditioned on text information at every denoising step.

CLIP as the Text Encoder

The text prompt is first encoded using the CLIP text encoder (a Transformer):

Input: text prompt (up to 77 tokens)
Output: a sequence of 77 embedding vectors, each of 768 dimensions

These 77 vectors serve as the language signal that guides denoising.

Cross-Attention in the U-Net

Text conditioning is integrated into the U-Net through cross-attention layers inserted between the ResNet blocks:

$Attention (Q, K, V) = softmax (\frac{Q K ^{⊤}}{d}) V$

Where:

$Q = W_{Q} ϕ_{i}$ — queries come from the current spatial feature map $ϕ_{i}$ (image information)
$K = W_{K} τ_{θ} (y)$ — keys come from the text embeddings $τ_{θ} (y)$
$V = W_{V} τ_{θ} (y)$ — values also come from the text embeddings

At each spatial position in the feature map, the model attends over all 77 text token embeddings, weighing how relevant each token is to that image region. This allows the network to spatially associate text concepts with image regions — “sky” tokens attend to the top of the image, “grass” tokens to the bottom, and so on.

Part 6: Classifier-Free Guidance — Amplifying the Prompt

The Problem

Even with text conditioning, early diffusion models tended to loosely follow prompts. The generated images were plausible but often only vaguely related to the text. The text conditioning was too weak relative to the model’s prior over natural images.

Classifier Guidance (Predecessor)

An early approach trained a separate noise-level-aware classifier $p_{ϕ} (y ∣ x_{t})$ to score how well a noisy intermediate image matched a class label. The denoising direction was then perturbed toward higher classifier probability:

$\hat{ε} = ε_{θ} (x_{t}, t) - w 1 - \overset{α}{ˉ}_{t} \nabla_{x_{t}} lo g p_{ϕ} (y ∣ x_{t})$

This worked but required training and running a separate classifier at every denoising step — expensive and cumbersome.

Classifier-Free Guidance (CFG)

Ho & Salimans (2022) proposed a far more elegant solution requiring only a single model. The trick: during training, randomly drop the text conditioning with probability $p_{uncond}$ (typically 10–20%), replacing the text embedding with a null/empty embedding. This trains the model to operate both with and without text.

At inference, two noise predictions are computed:

$ε_{θ} (x_{t}, t, c)$ — conditioned on the text prompt $c$
$ε_{θ} (x_{t}, t, \emptyset)$ — unconditioned (empty prompt)

These are combined with a guidance scale $w$ :

$\hat{ε} = (w + 1) ε_{θ} (x_{t}, t, c) - w ε_{θ} (x_{t}, t, \emptyset)$

This extrapolates the noise prediction away from the unconditional prediction and toward the conditional prediction, amplifying the influence of the text prompt.

Effect of Guidance Scale $w$

Guidance scale $w$	Effect
$w = 0$	Fully unconditional — ignores the text prompt
$w \approx 1$	Moderate prompt adherence
$w = 7$ – $15$	Strong prompt adherence (typical default in Stable Diffusion)
$w ≫ 15$	Oversaturation, loss of naturalism — model overreaches

There is a fundamental tension: higher guidance increases prompt fidelity but decreases diversity and realism. The model is being pushed beyond the manifold of natural images toward an extreme that matches the text as forcefully as possible.

Part 7: Latent Diffusion — Why Stable Diffusion Is Fast

The Pixel Space Problem

Running diffusion directly on 512×512 pixel images means:

Each image has $512 \times 512 \times 3 = 786, 432$ values
The U-Net must process this at every one of ~1000 denoising steps
This is computationally intractable on consumer hardware

The Latent Space Solution

Latent Diffusion Models (LDM), introduced by Rombach et al. (2022), move the entire diffusion process into a compressed latent space. The system trains a Variational Autoencoder (VAE) separately, then runs diffusion on the compressed codes.

1. Variational Autoencoder (VAE)

The VAE has two components:

Encoder $E$ : compresses a pixel image $x$ into a latent code $z = E (x)$
Decoder $D$ : reconstructs the image from the latent code $\tilde{x} = D (z)$

For Stable Diffusion v1:

Space	Dimensions	Values
Pixel space	$512 \times 512 \times 3$	786,432
Latent space	$64 \times 64 \times 4$	16,384
Compression factor		$\approx 48 \times$

The VAE is pre-trained with a combination of reconstruction loss, perceptual loss (using a VGG feature network), and an adversarial (GAN-style) loss to encourage sharp, realistic reconstructions.

2. Diffusion in Latent Space

All diffusion (both forward noising and reverse denoising) happens on the latent codes $z$ , not on pixel images. The U-Net operates on $64 \times 64 \times 4$ tensors instead of $512 \times 512 \times 3$ — roughly 48 times fewer operations per step.

3. Full Generation Pipeline

Text prompt
    ↓  CLIP text encoder
Text embeddings (77 × 768)
    ↓  condition cross-attention at each step
Pure latent noise z_T ~ N(0, I)   [64×64×4]
    ↓  U-Net denoising  ×T steps
    ↓  (classifier-free guidance applied each step)
Denoised latent z_0   [64×64×4]
    ↓  VAE decoder D
Final image x_0   [512×512×3]

Why This Works: Perceptual Compression

The key insight is that most of the information in a natural image is perceptually redundant. The VAE encoder learns to preserve semantically meaningful structure (object shapes, colors, textures, composition) while discarding high-frequency noise and imperceptible detail. Diffusion in this compact space is both faster and may actually produce better results because the model operates on semantic features rather than raw pixel values.

Part 8: Accelerated Sampling — From 1000 Steps to 20

The DDIM Shortcut

1000 denoising steps is slow even in latent space. DDIM (Denoising Diffusion Implicit Models), by Song et al. (2020), derives a non-Markovian generalization of the diffusion process that enables deterministic sampling in far fewer steps (20–50 is typical).

The DDIM update formula:

$x_{t - 1} = \overset{α}{ˉ}_{t - 1} predicted \hat{x}_{0} \frac{x _{t} - 1 - α ˉ _{t} ε ^}{α ˉ _{t}} + 1 - \overset{α}{ˉ}_{t - 1} - σ_{t}^{2} \hat{ε} + σ_{t} z$

where $σ_{t}$ controls stochasticity:

$σ_{t} = 0$ : fully deterministic — same noise $\to$ same image every time (DDIM mode)
$σ_{t} = \tilde{β}_{t}$ : recovers standard stochastic DDPM

The key property: DDIM’s deterministic trajectory allows step skipping. Instead of $t = 1000, 999, 998, \dots$ , you can jump $t = 1000, 980, 960, \dots$ with only minor quality loss.

Practical Tradeoffs

Sampler	Steps	Speed	Diversity	Notes
DDPM	1000	Slow	High	Original method
DDIM	50	Fast	Low (deterministic)	Good for iterative edits
DPM-Solver	20–25	Very fast	Medium	Adaptive step solver
LCM	4–8	Extremely fast	Medium	Distilled consistency model

Part 9: Connection to Score Matching

The Score Function

Score-based generative models provide an alternative but deeply equivalent perspective on diffusion.

The score function of a distribution $p (x)$ is its gradient-of-log-density:

$s (x) = \nabla_{x} lo g p (x)$

Geometrically, the score points in the direction of steepest increase in probability density — toward more likely data points. If you know the score function, you can sample from the distribution using Langevin dynamics:

$x_{k + 1} = x_{k} + \frac{η}{2} \nabla_{x} lo g p (x_{k}) + η ξ, ξ \sim N (0, I)$

Starting from any random point and following noisy gradient ascent on $lo g p$ converges samples to the true distribution as step size $η \to 0$ and iterations $\to \infty$ .

Connection to DDPM

The noise prediction network and the score network are equivalent up to a scaling factor:

$\nabla_{x_{t}} lo g q (x_{t}) = - \frac{ε _{θ} ( x _{t} , t )}{1 - α ˉ _{t}}$

Predicting the noise added to $x_{t}$ is mathematically identical to estimating the score of the noised distribution at level $t$ . This unification, formalized by Song et al. (2021) as a stochastic differential equation (SDE) framework, shows that DDPM, DDIM, and score-based models are all instances of the same underlying process.

SDE Formulation

The forward diffusion process can be written as a continuous-time SDE:

$d x = f (x, t) d t + g (t) d w$

where $f$ is a drift term and $g$ controls noise magnitude. The corresponding reverse-time SDE that undoes the diffusion is:

$d x = [f (x, t) - g (t)^{2} \nabla_{x} lo g p_{t} (x)] d t + g (t) d \overset{ˉ}{w}$

This opens up probability flow ODEs for likelihood computation, and principled design of noise schedules and samplers.

Part 10: Extending to Video Generation

The same diffusion framework used for images extends naturally to video by treating a short video clip as a 4D tensor: spatial dimensions $H \times W$ , channels $C$ , and a temporal dimension $F$ (frames).

Key Challenges

Challenge	Description
Temporal consistency	Adjacent frames must be coherent — objects cannot teleport or flicker
Computational cost	Video has $F \times$ more data than a single image
Motion realism	Physics, camera movement, and motion blur must be plausible

Architectural Approaches

Spatial-Temporal Attention

The U-Net is augmented with temporal attention layers that operate along the frame axis:

Spatial attention (existing): pixels in the same frame attend to each other
Temporal attention (new): the same spatial position across different frames attends to itself

This allows the model to maintain consistent appearance of objects while learning coherent motion patterns.

Latent Video Diffusion

Extend the VAE to encode entire video clips into a latent tensor, then run diffusion in that compressed space. The spatial compression factor remains the same (~48×), but now frames are also compressed temporally, making the computation tractable.

Conditioning Signals for Video

Modern video generation models condition on:

Text prompt (via CLIP)
A reference image (for image-to-video generation)
Camera motion parameters
Optical flow or depth maps
Motion magnitude signals

The “Brownian Motion Backwards” Connection

Welch’s framing crystallizes elegantly for video: Brownian motion describes the random walk of a particle being buffeted by thermal noise — a path that looks like a video of nothing but jitter. AI video generation is precisely the reverse: starting from a spatiotemporal field of pure random noise and recovering coherent motion — physics, narrative, causality — step by step, guided by a text description.

Summary: The Full Stack

Component	Role	Key Technique
CLIP text encoder	Convert text to embedding	Contrastive pretraining
VAE encoder	Compress image to latent	Reconstruction + perceptual loss
Forward process	Add noise for training	$x_{t} = \overset{α}{ˉ}_{t} x_{0} + 1 - \overset{α}{ˉ}_{t} ε$
U-Net	Predict noise at each step	ResBlocks + cross-attention
Classifier-free guidance	Amplify prompt adherence	Extrapolate conditioned vs. unconditioned predictions
Reverse process	Denoise latent iteratively	DDPM/DDIM step
VAE decoder	Reconstruct image from latent	Transposed convolutions

Key Equations Reference

Equation	Meaning
$x_{t} = \overset{α}{ˉ}_{t} x_{0} + 1 - \overset{α}{ˉ}_{t} ε$	Sample any noise level directly from clean image
$L = E [∥ ε - ε_{θ} (x_{t}, t) ∥^{2}]$	DDPM training objective (simplified MSE)
$\hat{ε} = (w + 1) ε_{θ} (x_{t}, t, c) - w ε_{θ} (x_{t}, t, \emptyset)$	Classifier-free guidance
$\nabla_{x_{t}} lo g q (x_{t}) = - ε_{θ} / 1 - \overset{α}{ˉ}_{t}$	Noise prediction = score function
$sim (t_{i}, v_{j}) = (t_{i} \cdot v_{j}) / (∥ t_{i} ∥∥ v_{j} ∥)$	CLIP cosine similarity

Key Concepts Summary

Concept	One-line meaning
CLIP	Shared embedding space for images and text via contrastive training
Contrastive loss	Push matched pairs together, mismatched pairs apart in embedding space
Forward diffusion	Progressively corrupt images with Gaussian noise over $T$ steps
Noise schedule	Controls the rate of corruption; cosine schedule preferred over linear
DDPM objective	Train network to predict noise added at a random timestep — simple MSE
Noise re-injection	Fresh noise added back each generation step to prevent blurry averages
U-Net	Encoder-decoder with skip connections; backbone of the denoising network
Cross-attention	Text tokens attend to image spatial features; how prompts guide generation
Classifier-free guidance	Extrapolate toward conditional, away from unconditional noise prediction
Guidance scale $w$	Dial controlling prompt fidelity vs. diversity tradeoff
Latent diffusion	Compress images ~48× with VAE; run diffusion in compact latent space
VAE	Variational autoencoder; encodes/decodes between pixel and latent space
DDIM	Deterministic sampling; enables quality generation in 20–50 steps instead of 1000
Score function	$\nabla_{x} lo g p (x)$ — gradient of log-probability density
Langevin dynamics	Noisy gradient ascent on log-probability to sample from a distribution

Connections

This series: builds directly on DL5 (Transformers), DL6 (Attention), and DL7 (MLPs) — the CLIP text encoder and the cross-attention layers in the U-Net are attention mechanisms of exactly the same type covered in those chapters
Key papers:
- Ho et al. (2020) — “Denoising Diffusion Probabilistic Models” (DDPM)
- Song et al. (2020) — “Denoising Diffusion Implicit Models” (DDIM)
- Ho & Salimans (2022) — “Classifier-Free Diffusion Guidance”
- Rombach et al. (2022) — “High-Resolution Image Synthesis with Latent Diffusion Models” (Stable Diffusion)
- Song et al. (2021) — “Score-Based Generative Modeling through Stochastic Differential Equations”
- Radford et al. (2021) — “Learning Transferable Visual Models From Natural Language Supervision” (CLIP)
Guest creator: Welch Labs — see also The Welch Labs Illustrated Guide to AI (2025), Chapter 9: “Video and Image Generation — AI videos are Brownian motion backwards”

Notes

Explorer

DL8 AI Image and Video Generation

The Big Question

Part 1: CLIP — Connecting Language and Images

The Architecture: Two Encoders, One Shared Space

Contrastive Training Objective

Emergent Properties of the CLIP Embedding Space

Part 2: Diffusion Models — The Forward Process

Intuition: Systematic Destruction

Mathematical Formulation

The Noise Schedule βt​

The Key Shortcut: Sampling at Any Timestep Directly

Part 3: Diffusion Models — The Reverse Process (Denoising)

The Core Idea

What the Network Learns

The Training Algorithm

The Sampling Algorithm (Generation)

The Crucial Quirk: Noise Is Added Back Each Step

Part 4: The U-Net — The Neural Network Architecture

Why U-Net?

U-Net Structure

Timestep Conditioning

Part 5: Text Conditioning — Making the Model Listen to Prompts

CLIP as the Text Encoder

Cross-Attention in the U-Net

Part 6: Classifier-Free Guidance — Amplifying the Prompt

The Problem

Classifier Guidance (Predecessor)

Classifier-Free Guidance (CFG)

Effect of Guidance Scale w

Part 7: Latent Diffusion — Why Stable Diffusion Is Fast

The Pixel Space Problem

The Latent Space Solution

1. Variational Autoencoder (VAE)

2. Diffusion in Latent Space

3. Full Generation Pipeline

Why This Works: Perceptual Compression

Part 8: Accelerated Sampling — From 1000 Steps to 20

The DDIM Shortcut

Practical Tradeoffs

Part 9: Connection to Score Matching

The Score Function

Connection to DDPM

SDE Formulation

Part 10: Extending to Video Generation

Key Challenges

Architectural Approaches

Spatial-Temporal Attention

Latent Video Diffusion

Conditioning Signals for Video

The “Brownian Motion Backwards” Connection

Summary: The Full Stack

Key Equations Reference

Key Concepts Summary

Connections

Graph View

Table of Contents

Backlinks

The Noise Schedule $β_{t}$

Effect of Guidance Scale $w$