Text Diffusion

Intermediate

Understand how diffusion ideas adapt to discrete language tokens and iterative mask refinement.

Last updated: Feb 25, 2026

Discrete vs Continuous Formulations

Language is fundamentally discrete, so text diffusion often uses token masking/replacement processes. Some approaches embed tokens into continuous spaces and diffuse there before projecting back.

Mask-and-Predict Paradigm

Text diffusion commonly begins with heavily masked sequences and repeatedly predicts missing tokens. Confidence-based remasking/refinement can improve coherence over multiple passes.

MDLM-style models

Masked diffusion language models denoise token grids through iterative unmasking rather than left-to-right decoding.

SEDD-style models

Score-entropy variants adapt score-based ideas to discrete vocabularies with principled probabilistic objectives.

Padding Tokens and Fixed Length

Batched diffusion often uses fixed sequence length. [PAD] tokens fill unused positions and attention masks prevent them from influencing content tokens.

How It Differs from Autoregressive LMs

Autoregressive models predict the next token conditioned on previous tokens. Diffusion-style text models refine many positions in parallel over multiple denoising iterations.

Diffusion-style decoding

Parallel token refinement, repeated denoising steps, and optional remasking for error correction.

Autoregressive decoding

Strict left-to-right generation with causal dependency and one-pass token commitment.

Mask Refinement Demo

Watch [MASK] positions reveal and refine token by token, with [PAD] positions held out via masking.

Text Diffusion: Parallel Token Denoising

Unlike autoregressive models (left-to-right), diffusion reveals tokens in parallel, random order with confidence scores.

All tokens masked — ready to denoiseStep 0 / 8
Reveal order: 74561382
#1
[MASK]
#2
[MASK]
#3
[MASK]
#4
[MASK]
#5
[MASK]
#6
[MASK]
#7
[MASK]
#8
[MASK]
8
Masked
0
Revealed
0
High confidence
[MASK] = unknownHigh confidence (≥80%)Medium confidenceLow confidence — may re-mask

Key Takeaways

  • Text diffusion adapts denoising to discrete token spaces.
  • Mask-and-predict enables parallel refinement instead of strict left-to-right decoding.
  • [PAD] plus attention masks are essential for fixed-length batching.
  • Compared to autoregressive LMs, text diffusion trades extra steps for iterative correction.