Latent Diffusion Pipeline
Modern systems typically diffuse in latent space for efficiency: encode image latents with a VAE, denoise those latents, then decode back to pixels.
VAE encode
Compress high-resolution pixels into a lower-dimensional latent tensor.
Latent denoise
Run iterative denoising with a conditional backbone over latent noise.
VAE decode
Project final latents back into RGB image space.
U-Net vs DiT Backbones
U-Nets dominate early diffusion models with strong inductive biases for spatial detail. DiTs replace convolutions with transformer blocks and scale efficiently with data and compute.
U-Net
Convolutional encoder-decoder with skip connections for multi-scale spatial reconstruction.
DiT (Diffusion Transformer)
Transformer backbone over patchified latents, often strong at scale with large training budgets.
Text Conditioning (CLIP/T5 + Cross-Attn)
Prompts are encoded (for example by CLIP or T5) and injected into denoising layers via cross-attention, aligning generated content with text semantics.
Classifier-Free Guidance (CFG)
CFG blends conditional and unconditional predictions. Higher guidance strengthens prompt fidelity, but too high can reduce diversity and introduce artifacts.
Sampling Steps vs Quality Tradeoff
More denoising steps often improve fidelity but increase latency. Practical deployments tune step count, scheduler, and guidance for target quality-per-second.
Interactive Noise and Denoise Demo
Use the visualizer to simulate forward noising and reverse denoising in a diffusion-style process.
Forward & Reverse Diffusion Process
Watch a clean image get progressively corrupted by noise, then recovered — the core idea behind diffusion models.
Diffusion models learn to reverse the noise process. During training, the model sees images at every noise level and learns to predict the clean version. During generation, it starts from pure noise and iteratively denoises — producing a new image from nothing.
Key Takeaways
- Latent diffusion improves compute efficiency while preserving output quality.
- U-Net and DiT represent different inductive bias and scaling tradeoffs.
- Text conditioning and CFG control prompt alignment strength.
- Image quality depends on the joint tuning of steps, scheduler, and guidance.