Di[M]O: Distilling Masked Diffusion Models into One-step Generator

Abstract

Masked Diffusion Models (MDMs) have emerged as a powerful generative modeling technique. Despite their remarkable results, they typically suffer from slow inference with several steps. In this paper, we propose Di[M]O, a novel approach that distills masked diffusion models into a one-step generator. Di[M]O addresses two key challenges: (1) the intractability of using intermediate-step information for one-step generation, which we solve through token-level distribution matching that optimizes model output logits by an 'on-policy framework' with the help of an auxiliary model; and (2) the lack of entropy in the initial distribution, which we address through a token initialization strategy that injects randomness while maintaining similarity to teacher training distribution. We show Di[M]O's effectiveness on both class-conditional and text-conditional image generation, impressively achieving performance competitive to multi-step teacher outputs while drastically reducing inference time. To our knowledge, we are the first to successfully achieve one-step distillation of masked diffusion models and the first to apply discrete distillation to text-to-image generation, opening new paths for efficient generative modeling.

DiMO Method Overview

Di[M]O Pipeline. Our method distills a costly multi-step MDM teacher into a one-step generator. Given \( x_{\text{init}} \) sampled using our proposed token initialization strategy, the one-step generator (student \( \theta \)) produces logits \( z_\theta \), from which image token sequence \( x_{\theta}=\{x_{\theta}^i\}_{i=1}^{L} \) are sampled. These tokens are then processed to obtain an intermediate state \( \tilde{x}_t \) through a forward mask diffusion process. For each intermediate state \( \tilde{x}_t \), we update the one-step generator \( \theta \) and auxiliary model \( \psi \) alternately: the one-step generator is optimized by minimizing the conditional divergence \( D(p_\phi || p_\psi)(\tilde{x}_t) \) at token-level, while the auxiliary model is trained using a cross-entropy loss to model the distribution of generated tokens \( x_{\theta} \) and to form the gradient to update \( \theta \). The teacher \( \phi \) is frozen during training. The detailed algorithm is shown in the figure below.

Visual Comparison

ImageNet Generation Comparison: Initialization Strategies and Teacher Model. A comparison of one-step generated images from generators trained with different \( r_\text{init} \) values against images generated by the teacher model using 16 sampling steps. The class labels of the samples from top to bottom are 388, 979, and 207 respectively.

Text-to-image Visual Comparison of Different Initialization Strategies and Teacher Model

Text-to-image generation comparison between one-step student and teacher model with different sampling steps. Comparison with the teacher on different steps, we see clearly that the teacher model's results drop very quickly when reducing the sampling steps (e.g., around 4 steps).

Quantitative Results

Class-conditional image generation on ImageNet 256×256 benchmark. ^* denotes numbers estimated from the original plot.

Comparison of text-to-image generation methods on HPS v2.0 benchmark. Scores are collected from https://github.com/tgxs002/HPSv2.

More Visualizations

Additional one-step text-to-image generation results.

BibTeX

@article{zhu2025di,
      title={Di$\mathtt{[M]}$O: Distilling Masked Diffusion Models into One-step Generator},
      author={Zhu, Yuanzhi and Wang, Xi and Lathuili{\`e}re, St{\'e}phane and Kalogeiton, Vicky},
      journal={arXiv preprint arXiv:2503.15457},
      year={2025}
    }

Acknowledgments

This work was supported by ANR-22-CE23-0007, ANR-22-CE39-0016, Hi!Paris grant and fellowship, DATAIA Convergence Institute as part of the “Programme d'Investissement d'Avenir” (ANR-17-CONV-0003) operated by Ecole Polytechnique, IP Paris, and was granted access to the IDRIS High-Performance Computing (HPC) resources under the allocation 2024-AD011014300R1 and 2025-AD011015894 made by GENCI and mesoGIP of IP Paris. We also sincerely thank Nacereddine Laddaoui for the help with infrastructure, Haoge Deng and Yao Teng for their insightful discussions that contributed to this work. We are also grateful to Nicolas Dufour, Robin Courant, and Lucas Degeorge for their meticulous proofreading.