01 / 13
// Research Talk · April/2026

Distillation as
Divergence Minimization

$$\min_{\theta} \; \mathbb{E}\!\left[ D\!\left( p_{\text{target}} \,\|\, p_{\theta} \right) \right]$$
Yuanzhi Zhu
Diffusion Models · Distillation · Reinforcement Learning
CRD teaser
Di[M]O · arXiv:2503.15457
Di-Bregman · arXiv:2510.16983
CRD · arXiv:2603.14128
02 Motivation

Almost every learning objective
minimizes a divergence

// The universal learning objective $$\min_{\theta} \; D\!\left( p \,\|\, p_{\theta} \right)$$
Classical
p = empirical data; D = KL
→ maximum likelihood estimation
Modern Post-Training
p = $p_{\text{target}}$: teacher output dist. or reward-tilted pretrained dist.
D = KL, Bregman divergence, $f$-divergence…
Diffusion distillation and diffusion RL are two instances of the same divergence minimization framework.
03 Framework

The On-Policy Generator

  • $x_1 = G_\theta(x_0, c)$
  • $G_\theta$ runs $T$ denoising steps; $c_{\text{ext}}$ is an optional external signal (reward, expert demo…)
  • Induces output distribution $p_{\theta}(x_1 \mid c)$
// Instantiating $\min_\theta D(p \,\|\, p_\theta)$ with $p = p_{\text{target}}$ $$\mathcal{L}(\theta) = \mathbb{E}_{x_0, c, c_{\text{ext}}}\!\left[ D\!\left( p_{\text{target}} \,\|\, p_{\theta} \right) \right]$$
On-policy generator: x₀, c → Model θ → x₁
On-policy distillation objective diagram
04 Instantiation I — Distillation
$p_{\text{target}} = p_{\text{teacher}}$

Diffusion Step Distillation

Compress a slow multi-step teacher into a few(or one)-step student.

// Target = teacher distribution $$p_{\text{target}}(x_1 \mid c) = p_{\text{teacher}}(x_1 \mid c)$$
  • (a)Score-based — estimate generator score; no discriminator  DMD · Diff-Instruct
  • (b)Discriminator-based — classify student vs teacher
     Diffusion GAN
Distillation pipeline: (a) score-based, (b) discriminator-based approaches
05 Instantiation I — Distillation
arXiv:2503.15457

Di[M]O Distilling Masked Diffusion Models into One-step Generator

Masked Diffusion Models (MDMs) operate on discrete tokens — no PF-ODE, no score function. Existing distillation breaks. Two unique challenges:

Challenge 1
Intermediate states intractable for one-step generation
→ Estimate with auxiliary model $\psi$
Challenge 2
Fully-masked prior lacks diversity — mode collapse
→ Hybrid token init: random + mask tokens
Result
First one-step distillation of MDMs — c2i & t2i
DiMO token-level distribution matching pipeline
OVERVIEW OF DI[M]O PIPELINE
06 Instantiation I — Distillation
arXiv:2510.16983

Di-Bregman One-step Diffusion via Bregman Density-Ratio Matching

Key insight: minimize $D_h(r \,\|\, 1)$ — match the density ratio to $\mathbf{1}$, not distributions directly:

// Distillation objective = matching r(x) to 1 $$r(x) = \frac{p_{\theta}(x)}{p_{\text{teacher}}(x)}, \qquad \min_\theta\; \mathbb{E}_{p_{\text{teacher}}}\!\left[D_h\!\left(r(x) \,\|\, 1\right)\right]$$
// Gradient — h''(r)·r reweights the score difference $$\nabla_\theta \mathcal{L} \;\propto\; h''(r) \cdot r \cdot \nabla_x \log r \cdot \nabla_\theta G_\theta$$
  • Choice of $h$ selects the divergence: KL, reverse-KL, and more
  • $r$ estimated via lightweight discriminator
  • General form $D_h(r \,\|\, r^*)$: $r^*$ is the normalized reward
07 Instantiation II — Diffusion RL
ptarget = pref · exp(r)

Diffusion Reinforcement Learning

Same framework — only the target changes. Instead of a teacher, it encodes a reward:

// Boltzmann (reward-tilted) target distribution $$p_{\text{target}}(x_1 \mid c) \;\propto\; p_{\text{ref}} \cdot \exp\!\left( r(x_1) \right)$$
  • $r(x_1)$: reward model — encodes desired behavior (alignment, text fidelity…)
  • $p_{\text{ref}}$: reference model — KL regularizer, prevents collapse
Diffusion RL training loop: reward signal + KL regularization
RL training loop: reward maximization penalized by KL from $p_{\text{ref}}$
08 Instantiation II — Diffusion RL

Expanding the Objective

Substituting the Boltzmann target into the reverse-KL divergence and expanding:

// Reward maximization + KL regularization $$\mathcal{L}(\theta) \;\propto\; -\mathbb{E}_{p_{\text{ref}}}\!\left[ \frac{p_{\theta}}{p_{\text{ref}}} \cdot r(x_1) \right] + \mathbb{E}_{p_{\theta}}\!\left[ \log \frac{p_{\theta}}{p_{\text{ref}}} \right]$$
Reward Maximization
$-\mathbb{E}_{p_{\text{ref}}}\!\left[\dfrac{p_{\theta}}{p_{\text{ref}}} \cdot r(x_1)\right]$
Importance-weighted reward maximization
KL Regularization
$\mathbb{E}_{p_{\theta}}\!\left[\log\dfrac{p_{\theta}}{p_{\text{ref}}}\right]$
Penalizes drift from the reference model

Diffusion ELBO makes this tractable: $\log(p_\theta/p_\text{ref})$ ≈ difference in denoising losses at each noise level $t$ — a weighted score-matching loss.

09 Instantiation II — Diffusion RL
arXiv:2603.14128

CRD Diffusion RL via Centered Reward Distillation

Motivation
The optimal policy satisfies $\log(p_\theta/p_\text{ref}) \propto r$ — but only up to an unknown normalizer $Z(c)$ per prompt.
Key Insight: Centered Reward
$\beta\log Z(c)$ is intractable but prompt-constant. Subtracting the group mean cancels it exactly — well-posed objective, no approximation.
  • 01 Decouple sampler from reference — preserves meaningful log-ratio signal
  • 02 KL anchoring to CFG-guided pretrained model — prevents long-run drift
  • 03 Reward-adaptive KL strength $\propto$ reward to block loopholes
CRD training pipeline: centered reward distillation
CRD: centered reward + decoupled sampling + adaptive KL
10 Unification

The Unified Picture

Both paradigms minimize $D(p_{\text{target}} \,\|\, p_{\theta})$ — they differ only in what the target means.

Method Target $p_{\text{target}}$ Divergence $D$
Di[M]O 2503.15457 $p_{\text{teacher}}(x_1 \mid c)$ Gen. Jeffrey Div.
Di-Bregman 2510.16983 $p_{\text{teacher}}(x_1 \mid c)$ Bregman family
CRD 2603.14128 $p_{\text{ref}} \cdot \exp\!\left( r(x_1) \right)$ Reverse KL
The two lines of research are not parallel — they are the same river.
11 Summary

Key Takeaways

  • Di[M]O (masked diffusion distillation) — pushes distillation into discrete token space
  • Di-Bregman (continuous diffusion distillation) — unifies distillation and reward fine-tuning under a principled divergence framework
  • CRD (diffusion RL alignment) — fast, stable diffusion RL training without reward hacking
// Thank You · Questions Welcome

Questions?

$$D\!\left( p_{\text{target}} \,\|\, p_{\theta} \right) \; \text{— one framework, many post-training methods}$$
Yuanzhi Zhu
yuanzhi-zhu.github.io
Di[M]O — arXiv:2503.15457
Di-Bregman — arXiv:2510.16983
CRD — arXiv:2603.14128