// Research Talk · April/2026

Distillation as

Divergence Minimization

$$\min_{\theta} \; \mathbb{E}\!\left[ D\!\left( p_{\text{target}} \,\|\, p_{\theta} \right) \right]$$

Yuanzhi Zhu

Diffusion Models · Distillation · Reinforcement Learning

Di[M]O · arXiv:2503.15457

Di-Bregman · arXiv:2510.16983

CRD · arXiv:2603.14128

02 Motivation

Almost every learning objective
minimizes a divergence

// The universal learning objective $$\min_{\theta} \; D\!\left( p \,\|\, p_{\theta} \right)$$

Classical

p = empirical data; D = KL
→ maximum likelihood estimation

Modern Post-Training

p = $p_{\text{target}}$: teacher output dist. or reward-tilted pretrained dist.
D = KL, Bregman divergence, $f$-divergence…

Diffusion distillation and diffusion RL are two instances of the same divergence minimization framework.

03 Framework

The On-Policy Generator

→ $x_1 = G_\theta(x_0, c)$
→ $G_\theta$ runs $T$ denoising steps; $c_{\text{ext}}$ is an optional external signal (reward, expert demo…)
→ Induces output distribution $p_{\theta}(x_1 \mid c)$

// Instantiating $\min_\theta D(p \,\|\, p_\theta)$ with $p = p_{\text{target}}$ $$\mathcal{L}(\theta) = \mathbb{E}_{x_0, c, c_{\text{ext}}}\!\left[ D\!\left( p_{\text{target}} \,\|\, p_{\theta} \right) \right]$$

On-policy generator: x₀, c → Model θ → x₁

On-policy distillation objective diagram

04 Instantiation I — Distillation

$p_{\text{target}} = p_{\text{teacher}}$

Diffusion Step Distillation

Compress a slow multi-step teacher into a few(or one)-step student.

// Target = teacher distribution $$p_{\text{target}}(x_1 \mid c) = p_{\text{teacher}}(x_1 \mid c)$$

(a)Score-based — estimate generator score; no discriminator DMD · Diff-Instruct
(b)Discriminator-based — classify student vs teacher
Diffusion GAN

Distillation pipeline: (a) score-based, (b) discriminator-based approaches

05 Instantiation I — Distillation

arXiv:2503.15457

Di[M]O Distilling Masked Diffusion Models into One-step Generator

Masked Diffusion Models (MDMs) operate on discrete tokens — no PF-ODE, no score function. Existing distillation breaks. Two unique challenges:

Challenge 1

Intermediate states intractable for one-step generation

→ Estimate with auxiliary model $\psi$

Challenge 2

Fully-masked prior lacks diversity — mode collapse

→ Hybrid token init: random + mask tokens

Result

First one-step distillation of MDMs — c2i & t2i

DiMO token-level distribution matching pipeline

OVERVIEW OF DI[M]O PIPELINE

06 Instantiation I — Distillation

arXiv:2510.16983

Di-Bregman One-step Diffusion via Bregman Density-Ratio Matching

Key insight: minimize $D_h(r \,\|\, 1)$ — match the density ratio to $\mathbf{1}$, not distributions directly:

// Distillation objective = matching r(x) to 1 $$r(x) = \frac{p_{\theta}(x)}{p_{\text{teacher}}(x)}, \qquad \min_\theta\; \mathbb{E}_{p_{\text{teacher}}}\!\left[D_h\!\left(r(x) \,\|\, 1\right)\right]$$

// Gradient — h''(r)·r reweights the score difference $$\nabla_\theta \mathcal{L} \;\propto\; h''(r) \cdot r \cdot \nabla_x \log r \cdot \nabla_\theta G_\theta$$

→ Choice of $h$ selects the divergence: KL, reverse-KL, and more
→ $r$ estimated via lightweight discriminator
→ General form $D_h(r \,\|\, r^*)$: $r^*$ is the normalized reward

07 Instantiation II — Diffusion RL

p_target = p_ref · exp(r)

Diffusion Reinforcement Learning

Same framework — only the target changes. Instead of a teacher, it encodes a reward:

// Boltzmann (reward-tilted) target distribution $$p_{\text{target}}(x_1 \mid c) \;\propto\; p_{\text{ref}} \cdot \exp\!\left( r(x_1) \right)$$

→ $r(x_1)$: reward model — encodes desired behavior (alignment, text fidelity…)
→ $p_{\text{ref}}$: reference model — KL regularizer, prevents collapse

Diffusion RL training loop: reward signal + KL regularization

RL training loop: reward maximization penalized by KL from $p_{\text{ref}}$

08 Instantiation II — Diffusion RL

Expanding the Objective

Substituting the Boltzmann target into the reverse-KL divergence and expanding:

// Reward maximization + KL regularization $$\mathcal{L}(\theta) \;\propto\; -\mathbb{E}_{p_{\text{ref}}}\!\left[ \frac{p_{\theta}}{p_{\text{ref}}} \cdot r(x_1) \right] + \mathbb{E}_{p_{\theta}}\!\left[ \log \frac{p_{\theta}}{p_{\text{ref}}} \right]$$

Reward Maximization

$-\mathbb{E}_{p_{\text{ref}}}\!\left[\dfrac{p_{\theta}}{p_{\text{ref}}} \cdot r(x_1)\right]$

Importance-weighted reward maximization

KL Regularization

$\mathbb{E}_{p_{\theta}}\!\left[\log\dfrac{p_{\theta}}{p_{\text{ref}}}\right]$

Penalizes drift from the reference model

Diffusion ELBO makes this tractable: $\log(p_\theta/p_\text{ref})$ ≈ difference in denoising losses at each noise level $t$ — a weighted score-matching loss.

09 Instantiation II — Diffusion RL

arXiv:2603.14128

CRD Diffusion RL via Centered Reward Distillation

Motivation

The optimal policy satisfies $\log(p_\theta/p_\text{ref}) \propto r$ — but only up to an unknown normalizer $Z(c)$ per prompt.

Key Insight: Centered Reward

$\beta\log Z(c)$ is intractable but prompt-constant. Subtracting the group mean cancels it exactly — well-posed objective, no approximation.

01 Decouple sampler from reference — preserves meaningful log-ratio signal
02 KL anchoring to CFG-guided pretrained model — prevents long-run drift
03 Reward-adaptive KL strength $\propto$ reward to block loopholes

CRD training pipeline: centered reward distillation

CRD: centered reward + decoupled sampling + adaptive KL

10 Unification

The Unified Picture

Both paradigms minimize $D(p_{\text{target}} \,\|\, p_{\theta})$ — they differ only in what the target means.

Method	Target $p_{\text{target}}$	Divergence $D$
Di[M]O 2503.15457	$p_{\text{teacher}}(x_1 \mid c)$	Gen. Jeffrey Div.
Di-Bregman 2510.16983	$p_{\text{teacher}}(x_1 \mid c)$	Bregman family
CRD 2603.14128	$p_{\text{ref}} \cdot \exp\!\left( r(x_1) \right)$	Reverse KL

The two lines of research are not parallel — they are the same river.

11 Summary

Key Takeaways

01 Diffusion distillation and diffusion RL both minimize $D(p_{\text{target}} \| p_{\theta})$ — the framework is unified
02 The target is the only semantic difference: $p_{\text{teacher}}$ vs $p_{\text{ref}} \cdot e^{r(x)}$ (reward-tilted Boltzmann)
03 Progress in estimating D transfers freely across both regimes — share methods, not just intuitions

→Di[M]O (masked diffusion distillation) — pushes distillation into discrete token space
→Di-Bregman (continuous diffusion distillation) — unifies distillation and reward fine-tuning under a principled divergence framework
→CRD (diffusion RL alignment) — fast, stable diffusion RL training without reward hacking

// Thank You · Questions Welcome

Questions?

$$D\!\left( p_{\text{target}} \,\|\, p_{\theta} \right) \; \text{— one framework, many post-training methods}$$

Yuanzhi Zhu

yuanzhi-zhu.github.io

Di[M]O — arXiv:2503.15457

Di-Bregman — arXiv:2510.16983

CRD — arXiv:2603.14128

Distillation as Divergence Minimization

Almost every learning objectiveminimizes a divergence