Discrete vs Continuous Formulations
Language is fundamentally discrete, so text diffusion often uses token masking/replacement processes. Some approaches embed tokens into continuous spaces and diffuse there before projecting back.
Mask-and-Predict Paradigm
Text diffusion commonly begins with heavily masked sequences and repeatedly predicts missing tokens. Confidence-based remasking/refinement can improve coherence over multiple passes.
MDLM-style models
Masked diffusion language models denoise token grids through iterative unmasking rather than left-to-right decoding.
SEDD-style models
Score-entropy variants adapt score-based ideas to discrete vocabularies with principled probabilistic objectives.
Padding Tokens and Fixed Length
Batched diffusion often uses fixed sequence length. [PAD] tokens fill unused positions and attention masks prevent them from influencing content tokens.
How It Differs from Autoregressive LMs
Autoregressive models predict the next token conditioned on previous tokens. Diffusion-style text models refine many positions in parallel over multiple denoising iterations.
Diffusion-style decoding
Parallel token refinement, repeated denoising steps, and optional remasking for error correction.
Autoregressive decoding
Strict left-to-right generation with causal dependency and one-pass token commitment.
Mask Refinement Demo
Watch [MASK] positions reveal and refine token by token, with [PAD] positions held out via masking.
Text Diffusion: Parallel Token Denoising
Unlike autoregressive models (left-to-right), diffusion reveals tokens in parallel, random order with confidence scores.
Key Takeaways
- Text diffusion adapts denoising to discrete token spaces.
- Mask-and-predict enables parallel refinement instead of strict left-to-right decoding.
- [PAD] plus attention masks are essential for fixed-length batching.
- Compared to autoregressive LMs, text diffusion trades extra steps for iterative correction.