Learn AI Concepts | Interactive Guide

Discrete vs Continuous Formulations

Language is fundamentally discrete, so text diffusion often uses token masking/replacement processes. Some approaches embed tokens into continuous spaces and diffuse there before projecting back.

Mask-and-Predict Paradigm

Text diffusion commonly begins with heavily masked sequences and repeatedly predicts missing tokens. Confidence-based remasking/refinement can improve coherence over multiple passes.

MDLM-style models

Masked diffusion language models denoise token grids through iterative unmasking rather than left-to-right decoding.

SEDD-style models

Score-entropy variants adapt score-based ideas to discrete vocabularies with principled probabilistic objectives.

Padding Tokens and Fixed Length

Batched diffusion often uses fixed sequence length. [PAD] tokens fill unused positions and attention masks prevent them from influencing content tokens.

How It Differs from Autoregressive LMs

Autoregressive models predict the next token conditioned on previous tokens. Diffusion-style text models refine many positions in parallel over multiple denoising iterations.

Diffusion-style decoding

Parallel token refinement, repeated denoising steps, and optional remasking for error correction.

Autoregressive decoding

Strict left-to-right generation with causal dependency and one-pass token commitment.

Mask Refinement Demo

Watch [MASK] positions reveal and refine token by token, with [PAD] positions held out via masking.

Text Diffusion: Parallel Token Denoising

Unlike autoregressive models (left-to-right), diffusion reveals tokens in parallel, random order with confidence scores.

All tokens masked — ready to denoiseStep 0 / 8

Reveal order: 74561382

[MASK]

Masked

Revealed

High confidence

[MASK] = unknownHigh confidence (≥80%)Medium confidenceLow confidence — may re-mask

Key Takeaways

Text diffusion adapts denoising to discrete token spaces.
Mask-and-predict enables parallel refinement instead of strict left-to-right decoding.
[PAD] plus attention masks are essential for fixed-length batching.
Compared to autoregressive LMs, text diffusion trades extra steps for iterative correction.