Key Takeaway

  • Image generation = Brownian motion in reverse: start from pure noise, remove it step by step
  • A U-Net predicts the noise at each step; CLIP text embeddings guide it via cross-attention
  • Latent diffusion runs in a VAE-compressed space (~48× smaller) — this is why Stable Diffusion is fast
  • Classifier-free guidance amplifies the prompt by extrapolating toward conditional and away from unconditional predictions

This chapter answers:

How do modern AI systems generate high-quality images and videos from a text prompt? What are diffusion models, what role does CLIP play, and what is the actual mathematics of turning noise into an image?


The Big Question

When you type “a photograph of an astronaut riding a horse on the moon” into Stable Diffusion or DALL-E, a photorealistic image appears in seconds. This chapter builds a complete, mathematically grounded explanation of how that is possible — going well beyond the common oversimplification that “the model just learns to remove noise.”

The key insight Welch frames early: AI image generation is Brownian motion run backwards. Ordinary diffusion describes how a drop of ink spreads into water until it is undetectable. AI diffusion models learn to reverse exactly that process — starting from pure random noise and recovering coherent structure, step by step.


Part 1: CLIP — Connecting Language and Images

Before diffusion models can turn text into images, there must be a bridge between the two modalities. That bridge is CLIP (Contrastive Language-Image Pretraining), released by OpenAI in February 2021.

The Architecture: Two Encoders, One Shared Space

CLIP consists of two neural networks trained simultaneously:

EncoderInputOutput
Text encoder (Transformer)A sentence or phraseA high-dimensional embedding vector
Image encoder (ViT or ResNet)An imageA high-dimensional embedding vector in the same space

The central design goal: matching images and their captions should produce nearby vectors in the shared embedding space. A photo of a dog and the caption “a photo of a dog” should map to very similar vectors.

Contrastive Training Objective

CLIP is trained on hundreds of millions of (image, caption) pairs scraped from the internet. During each training step, a batch of image-text pairs is processed:

  • The text encoder produces text embedding vectors:
  • The image encoder produces image embedding vectors:

This creates an similarity matrix where entry is:

This is the cosine similarity — the cosine of the angle between two vectors.

The training objective (contrastive loss) simultaneously:

  1. Maximizes cosine similarity along the diagonal (matched pairs: image with caption )
  2. Minimizes cosine similarity off the diagonal (mismatched pairs: image with caption )

After training, the model has learned a rich shared semantic space. Images of dogs and the phrase “a dog” land in the same neighborhood, regardless of breed, style, or framing.

Emergent Properties of the CLIP Embedding Space

The embedding space encodes rich relational structure. Vector arithmetic reveals semantic relationships, similar to word embeddings in language models:

This arithmetic works because the encoder has learned to disentangle visual concepts into independent directions in the shared space.

CLIP also enables zero-shot image classification: given an image, compute its cosine similarity against text prompts like “a photo of a cat”, “a photo of a dog”, “a photo of a car”, and pick the highest-scoring label. No fine-tuning on a labeled dataset is required.


Part 2: Diffusion Models — The Forward Process

Intuition: Systematic Destruction

Take any training image. Now add a tiny amount of Gaussian random noise to every pixel. Repeat this many times. After enough steps, the image is indistinguishable from pure noise — its structure has been completely destroyed.

This is the forward diffusion process: a fixed, deterministic recipe for progressively corrupting an image over timesteps (typically ).

Mathematical Formulation

At each step , a small amount of Gaussian noise is added:

Where:

  • is the noise schedule — a small value controlling how much noise is added at step
  • slightly shrinks the previous image (to keep variance bounded)
  • is the added Gaussian noise

The Noise Schedule

The sequence determines how quickly information is destroyed.

Linear schedule (Ho et al. 2020): grows linearly from to .

Cosine schedule (Nichol & Dhariwal 2021): designed so that the signal-to-noise ratio decays smoothly, avoiding excessive noise being added in the very first steps:

The Key Shortcut: Sampling at Any Timestep Directly

The most important mathematical fact about the forward process is that you can jump directly from to without simulating all intermediate steps.

Define:

Then the closed-form expression for given the original image is:

This is derived by repeatedly applying the Gaussian convolution formula and collapsing the telescoping product of values.

Physical interpretation:

  • When is small: , so — mostly original image
  • When is large: , so — mostly noise
  • The original image and noise are blended by coefficients and that lie on the unit circle (they satisfy )

This shortcut is critical for training efficiency: given any image, you can instantly generate a noisy version at any corruption level without running 1000 sequential steps.


Part 3: Diffusion Models — The Reverse Process (Denoising)

The Core Idea

The forward process destroys structure in a known, mathematical way. The reverse process must undo this destruction — but it cannot simply be inverted, because the noise injection is stochastic (random). Instead, a neural network is trained to estimate what noise was added, then remove it.

What the Network Learns

The network takes two inputs and predicts one output:

  • Input 1: the noisy image at some timestep
  • Input 2: the timestep itself (so the network knows how noisy the image is)
  • Output: a prediction of the noise that was added

Rather than directly predicting the denoised image, the network predicts the total accumulated noise across the entire forward process. This seemingly indirect objective turns out to be more effective for training stability.

The Training Algorithm

For each training step:

  1. Sample a clean image from the training dataset
  2. Sample a random timestep
  3. Sample Gaussian noise
  4. Compute the noisy image:
  5. Run the network forward:
  6. Compute the loss:
  7. Backpropagate and update weights

This is the simplified DDPM objective from Ho et al. (2020). It is the mean squared error between the true noise and the predicted noise. The full variational lower bound (ELBO) derivation shows a correct reweighting for each timestep, but empirically the simplified uniform weighting performs better in practice.

The Sampling Algorithm (Generation)

Once the network is trained, generating an image works as follows:

  1. Start with pure Gaussian noise:
  2. For :
    a. Predict the noise:
    b. Compute the estimated clean image:
    c. Compute the mean for the previous step:

d. Sample: , where (except at , where )

  1. Return

The Crucial Quirk: Noise Is Added Back Each Step

This is a key point Welch emphasizes that surprises many people. During generation, after computing the less-noisy , a small amount of fresh noise is re-injected (step 2d above). This is not a bug — it is essential.

Why? The denoising network produces a mean estimate. Adding small noise at each step:

  • Prevents the generation from collapsing to a blurry average
  • Allows the model to explore different modes of the distribution
  • Is mathematically justified by the posterior , which has nonzero variance

This is directly analogous to how MCMC (Markov Chain Monte Carlo) methods use random perturbations to sample from complex distributions rather than greedily descending to a single point.


Part 4: The U-Net — The Neural Network Architecture

Why U-Net?

The noise predictor must take in an image (at varying noise levels) and output an image of the same spatial dimensions. The architecture used for this is a U-Net, originally developed for biomedical image segmentation.

U-Net Structure

Input (noisy image x_t + timestep t)
        ↓
[Encoder: Conv + GroupNorm + ResBlock]  — 64×64
        ↓ downsample
[Encoder: Conv + GroupNorm + ResBlock]  — 32×32
        ↓ downsample
[Encoder: Conv + GroupNorm + ResBlock]  — 16×16
        ↓ downsample
[Bottleneck: ResBlock + Attention]      —  8×8
        ↑ upsample + skip connection
[Decoder: ResBlock + Attention]         — 16×16
        ↑ upsample + skip connection
[Decoder: ResBlock + Attention]         — 32×32
        ↑ upsample + skip connection
[Decoder: ResBlock + Attention]         — 64×64
        ↓
Output (predicted noise, same shape as input)

The defining feature is the skip connections: intermediate encoder feature maps are concatenated with the corresponding decoder feature maps. This gives the decoder access to both global context (from the bottleneck) and fine-grained local detail (from the skip connections), producing a sharper output.

Timestep Conditioning

The timestep is encoded as a sinusoidal positional embedding (the same mechanism used for sequence position in transformers), then passed through a small MLP to produce a timestep vector. This vector is injected into every ResNet block via a learned affine transformation (scale-and-shift), conditioning the network’s behavior on its current noise level.


Part 5: Text Conditioning — Making the Model Listen to Prompts

An unconditional diffusion model generates random images. To generate images matching a text prompt, the U-Net must be conditioned on text information at every denoising step.

CLIP as the Text Encoder

The text prompt is first encoded using the CLIP text encoder (a Transformer):

  • Input: text prompt (up to 77 tokens)
  • Output: a sequence of 77 embedding vectors, each of 768 dimensions

These 77 vectors serve as the language signal that guides denoising.

Cross-Attention in the U-Net

Text conditioning is integrated into the U-Net through cross-attention layers inserted between the ResNet blocks:

Where:

  • — queries come from the current spatial feature map (image information)
  • — keys come from the text embeddings
  • — values also come from the text embeddings

At each spatial position in the feature map, the model attends over all 77 text token embeddings, weighing how relevant each token is to that image region. This allows the network to spatially associate text concepts with image regions — “sky” tokens attend to the top of the image, “grass” tokens to the bottom, and so on.


Part 6: Classifier-Free Guidance — Amplifying the Prompt

The Problem

Even with text conditioning, early diffusion models tended to loosely follow prompts. The generated images were plausible but often only vaguely related to the text. The text conditioning was too weak relative to the model’s prior over natural images.

Classifier Guidance (Predecessor)

An early approach trained a separate noise-level-aware classifier to score how well a noisy intermediate image matched a class label. The denoising direction was then perturbed toward higher classifier probability:

This worked but required training and running a separate classifier at every denoising step — expensive and cumbersome.

Classifier-Free Guidance (CFG)

Ho & Salimans (2022) proposed a far more elegant solution requiring only a single model. The trick: during training, randomly drop the text conditioning with probability (typically 10–20%), replacing the text embedding with a null/empty embedding. This trains the model to operate both with and without text.

At inference, two noise predictions are computed:

  • — conditioned on the text prompt
  • — unconditioned (empty prompt)

These are combined with a guidance scale :

This extrapolates the noise prediction away from the unconditional prediction and toward the conditional prediction, amplifying the influence of the text prompt.

Effect of Guidance Scale

Guidance scale Effect
Fully unconditional — ignores the text prompt
Moderate prompt adherence
Strong prompt adherence (typical default in Stable Diffusion)
Oversaturation, loss of naturalism — model overreaches

There is a fundamental tension: higher guidance increases prompt fidelity but decreases diversity and realism. The model is being pushed beyond the manifold of natural images toward an extreme that matches the text as forcefully as possible.


Part 7: Latent Diffusion — Why Stable Diffusion Is Fast

The Pixel Space Problem

Running diffusion directly on 512×512 pixel images means:

  • Each image has values
  • The U-Net must process this at every one of ~1000 denoising steps
  • This is computationally intractable on consumer hardware

The Latent Space Solution

Latent Diffusion Models (LDM), introduced by Rombach et al. (2022), move the entire diffusion process into a compressed latent space. The system trains a Variational Autoencoder (VAE) separately, then runs diffusion on the compressed codes.

1. Variational Autoencoder (VAE)

The VAE has two components:

  • Encoder : compresses a pixel image into a latent code
  • Decoder : reconstructs the image from the latent code

For Stable Diffusion v1:

SpaceDimensionsValues
Pixel space786,432
Latent space16,384
Compression factor

The VAE is pre-trained with a combination of reconstruction loss, perceptual loss (using a VGG feature network), and an adversarial (GAN-style) loss to encourage sharp, realistic reconstructions.

2. Diffusion in Latent Space

All diffusion (both forward noising and reverse denoising) happens on the latent codes , not on pixel images. The U-Net operates on tensors instead of — roughly 48 times fewer operations per step.

3. Full Generation Pipeline

Text prompt
    ↓  CLIP text encoder
Text embeddings (77 × 768)
    ↓  condition cross-attention at each step
Pure latent noise z_T ~ N(0, I)   [64×64×4]
    ↓  U-Net denoising  ×T steps
    ↓  (classifier-free guidance applied each step)
Denoised latent z_0   [64×64×4]
    ↓  VAE decoder D
Final image x_0   [512×512×3]

Why This Works: Perceptual Compression

The key insight is that most of the information in a natural image is perceptually redundant. The VAE encoder learns to preserve semantically meaningful structure (object shapes, colors, textures, composition) while discarding high-frequency noise and imperceptible detail. Diffusion in this compact space is both faster and may actually produce better results because the model operates on semantic features rather than raw pixel values.


Part 8: Accelerated Sampling — From 1000 Steps to 20

The DDIM Shortcut

1000 denoising steps is slow even in latent space. DDIM (Denoising Diffusion Implicit Models), by Song et al. (2020), derives a non-Markovian generalization of the diffusion process that enables deterministic sampling in far fewer steps (20–50 is typical).

The DDIM update formula:

where controls stochasticity:

  • : fully deterministic — same noise same image every time (DDIM mode)
  • : recovers standard stochastic DDPM

The key property: DDIM’s deterministic trajectory allows step skipping. Instead of , you can jump with only minor quality loss.

Practical Tradeoffs

SamplerStepsSpeedDiversityNotes
DDPM1000SlowHighOriginal method
DDIM50FastLow (deterministic)Good for iterative edits
DPM-Solver20–25Very fastMediumAdaptive step solver
LCM4–8Extremely fastMediumDistilled consistency model

Part 9: Connection to Score Matching

The Score Function

Score-based generative models provide an alternative but deeply equivalent perspective on diffusion.

The score function of a distribution is its gradient-of-log-density:

Geometrically, the score points in the direction of steepest increase in probability density — toward more likely data points. If you know the score function, you can sample from the distribution using Langevin dynamics:

Starting from any random point and following noisy gradient ascent on converges samples to the true distribution as step size and iterations .

Connection to DDPM

The noise prediction network and the score network are equivalent up to a scaling factor:

Predicting the noise added to is mathematically identical to estimating the score of the noised distribution at level . This unification, formalized by Song et al. (2021) as a stochastic differential equation (SDE) framework, shows that DDPM, DDIM, and score-based models are all instances of the same underlying process.

SDE Formulation

The forward diffusion process can be written as a continuous-time SDE:

where is a drift term and controls noise magnitude. The corresponding reverse-time SDE that undoes the diffusion is:

This opens up probability flow ODEs for likelihood computation, and principled design of noise schedules and samplers.


Part 10: Extending to Video Generation

The same diffusion framework used for images extends naturally to video by treating a short video clip as a 4D tensor: spatial dimensions , channels , and a temporal dimension (frames).

Key Challenges

ChallengeDescription
Temporal consistencyAdjacent frames must be coherent — objects cannot teleport or flicker
Computational costVideo has more data than a single image
Motion realismPhysics, camera movement, and motion blur must be plausible

Architectural Approaches

Spatial-Temporal Attention

The U-Net is augmented with temporal attention layers that operate along the frame axis:

  • Spatial attention (existing): pixels in the same frame attend to each other
  • Temporal attention (new): the same spatial position across different frames attends to itself

This allows the model to maintain consistent appearance of objects while learning coherent motion patterns.

Latent Video Diffusion

Extend the VAE to encode entire video clips into a latent tensor, then run diffusion in that compressed space. The spatial compression factor remains the same (~48×), but now frames are also compressed temporally, making the computation tractable.

Conditioning Signals for Video

Modern video generation models condition on:

  • Text prompt (via CLIP)
  • A reference image (for image-to-video generation)
  • Camera motion parameters
  • Optical flow or depth maps
  • Motion magnitude signals

The “Brownian Motion Backwards” Connection

Welch’s framing crystallizes elegantly for video: Brownian motion describes the random walk of a particle being buffeted by thermal noise — a path that looks like a video of nothing but jitter. AI video generation is precisely the reverse: starting from a spatiotemporal field of pure random noise and recovering coherent motion — physics, narrative, causality — step by step, guided by a text description.


Summary: The Full Stack

ComponentRoleKey Technique
CLIP text encoderConvert text to embeddingContrastive pretraining
VAE encoderCompress image to latentReconstruction + perceptual loss
Forward processAdd noise for training
U-NetPredict noise at each stepResBlocks + cross-attention
Classifier-free guidanceAmplify prompt adherenceExtrapolate conditioned vs. unconditioned predictions
Reverse processDenoise latent iterativelyDDPM/DDIM step
VAE decoderReconstruct image from latentTransposed convolutions

Key Equations Reference

EquationMeaning
Sample any noise level directly from clean image
DDPM training objective (simplified MSE)
Classifier-free guidance
Noise prediction = score function
CLIP cosine similarity

Key Concepts Summary

ConceptOne-line meaning
CLIPShared embedding space for images and text via contrastive training
Contrastive lossPush matched pairs together, mismatched pairs apart in embedding space
Forward diffusionProgressively corrupt images with Gaussian noise over steps
Noise scheduleControls the rate of corruption; cosine schedule preferred over linear
DDPM objectiveTrain network to predict noise added at a random timestep — simple MSE
Noise re-injectionFresh noise added back each generation step to prevent blurry averages
U-NetEncoder-decoder with skip connections; backbone of the denoising network
Cross-attentionText tokens attend to image spatial features; how prompts guide generation
Classifier-free guidanceExtrapolate toward conditional, away from unconditional noise prediction
Guidance scale Dial controlling prompt fidelity vs. diversity tradeoff
Latent diffusionCompress images ~48× with VAE; run diffusion in compact latent space
VAEVariational autoencoder; encodes/decodes between pixel and latent space
DDIMDeterministic sampling; enables quality generation in 20–50 steps instead of 1000
Score function — gradient of log-probability density
Langevin dynamicsNoisy gradient ascent on log-probability to sample from a distribution

Connections

  • This series: builds directly on DL5 (Transformers), DL6 (Attention), and DL7 (MLPs) — the CLIP text encoder and the cross-attention layers in the U-Net are attention mechanisms of exactly the same type covered in those chapters
  • Key papers:
    • Ho et al. (2020) — “Denoising Diffusion Probabilistic Models” (DDPM)
    • Song et al. (2020) — “Denoising Diffusion Implicit Models” (DDIM)
    • Ho & Salimans (2022) — “Classifier-Free Diffusion Guidance”
    • Rombach et al. (2022) — “High-Resolution Image Synthesis with Latent Diffusion Models” (Stable Diffusion)
    • Song et al. (2021) — “Score-Based Generative Modeling through Stochastic Differential Equations”
    • Radford et al. (2021) — “Learning Transferable Visual Models From Natural Language Supervision” (CLIP)
  • Guest creator: Welch Labs — see also The Welch Labs Illustrated Guide to AI (2025), Chapter 9: “Video and Image Generation — AI videos are Brownian motion backwards”