Learn AI Concepts | Interactive Guide

Latent Diffusion Pipeline

Modern systems typically diffuse in latent space for efficiency: encode image latents with a VAE, denoise those latents, then decode back to pixels.

VAE encode

Compress high-resolution pixels into a lower-dimensional latent tensor.

Latent denoise

Run iterative denoising with a conditional backbone over latent noise.

VAE decode

Project final latents back into RGB image space.

U-Net vs DiT Backbones

U-Nets dominate early diffusion models with strong inductive biases for spatial detail. DiTs replace convolutions with transformer blocks and scale efficiently with data and compute.

U-Net

Convolutional encoder-decoder with skip connections for multi-scale spatial reconstruction.

DiT (Diffusion Transformer)

Transformer backbone over patchified latents, often strong at scale with large training budgets.

Text Conditioning (CLIP/T5 + Cross-Attn)

Prompts are encoded (for example by CLIP or T5) and injected into denoising layers via cross-attention, aligning generated content with text semantics.

Classifier-Free Guidance (CFG)

CFG blends conditional and unconditional predictions. Higher guidance strengthens prompt fidelity, but too high can reduce diversity and introduce artifacts.

Sampling Steps vs Quality Tradeoff

More denoising steps often improve fidelity but increase latency. Practical deployments tune step count, scheduler, and guidance for target quality-per-second.

Interactive Noise and Denoise Demo

Use the visualizer to simulate forward noising and reverse denoising in a diffusion-style process.

Forward & Reverse Diffusion Process

Watch a clean image get progressively corrupted by noise, then recovered — the core idea behind diffusion models.

Clean image (t=0)

Clean imagePure noise

Diffusion models learn to reverse the noise process. During training, the model sees images at every noise level and learns to predict the clean version. During generation, it starts from pure noise and iteratively denoises — producing a new image from nothing.

Key Takeaways

Latent diffusion improves compute efficiency while preserving output quality.
U-Net and DiT represent different inductive bias and scaling tradeoffs.
Text conditioning and CFG control prompt alignment strength.
Image quality depends on the joint tuning of steps, scheduler, and guidance.