In this blog, I want to show a simple but intuitive connection between Variational Score Distillation (VSD/DMD) [1,2,3] and Diffusion-GAN [4] for diffusion distillation. This connection is noticed when I was trying to extend our recent workshop paper Di-Bregman [5].
The final conclusion is not bonded to Di-Bregman, I just want to use Di-Bregman as an example to illustrate the connection.
Preliminaries
In Di-Bregman, we derived a new distillation loss, whose loss gradient extend the DMD loss with an extra coefficient $h^{\prime\prime}(r_t(x_t))\,r_t(x_t)$:
where
- $r_t(x_t) = \frac{q_{\theta,t}(x_t)}{p_t(x_t)}$ is the density ratio between the student marginal $q_{\theta,t}$ and the teacher marginal $p_t$ at time $t$;
- $G_\theta(\epsilon)$ is the student generative model that maps noise $\epsilon$ to clean sample;
- $h(r)$ is a convex function defining the Bregman divergence $\mathbb{D}_h$;
- $w(t)$ is a time-dependent weight function;
- $\epsilon \sim \mathcal{N}(0,I)$, $t \sim \mathcal{U}(0,1)$, and $x_t = \alpha_t G_\theta(\epsilon) + \sigma_t z_t$ with $z_t \sim \mathcal{N}(0,I)$.
From Scores to a Discriminator
One difference between Di-Bregman and DMD is that we need to estimate the density ratio $r_t(x_t)$ in the extra coefficiency $h^{\prime\prime}(r_t(x_t))\,r_t(x_t)$, usually achieved by training a discriminator $D(x_t,t)$ to distinguish samples from $p_t$ and $q_{\theta,t}$.
With the usual logistic convention (where $D_t(x_t)$ outputs the probability that $x_t$ comes from $p_t$), we have the following optimal discriminator $D_t^*(x_t)=\frac{p_t(x_t)}{p_t(x_t)+q_{\theta,t}(x_t)}$.
A natural question is: can we rewrite the Di-Bregman loss gradient entirely in terms of the discriminator $D_t$, eliminating explicit score functions?
Step 1: Substitute the density-ratio gradient identity
We have the identity
Substituting into the Di-Bregman gradient gives
Step 2: Express $\nabla_{x_t} r_t$ and $r_{t}$ in terms of $D_t$
Since $r_t = \frac{q_{\theta,t}}{p_t} = \frac{1-D_t}{D_t}$, we have $\frac{dr_t}{dD_t} = -\frac{1}{D_t^2}$, hence
Substitute this back:
Step 3: Convert $\nabla_{x_t} D_t$ to $\nabla_\theta D_t$
Using the chain rule, we have:
The conversion from $\nabla_{x_t}D_t$ to $\nabla_\theta D_t$ introduces the factor $\alpha_t$, which is absorted into $w’(t)$.
The boxed equation shows that the Di-Bregman loss gradient can be computed by backpropagating through the discriminator $D_t$ only, without explicitly estimating the score functions. This suggests that Di-Bregman / VSD and Diffusion-GAN are closely connected.
Verify with Special Case: reverse KL Divergence (DMD)
When $h(r)=r\log r$, we have $h^{\prime\prime}(r)=\frac{1}{r}$, hence the coefficiency in the boxed equation becomes $\frac{h^{\prime\prime}(\frac{1-D_t}{D_t})}{D_t^2} = \frac{1}{D_t(1-D_t)}$, and the loss gradient becomes
This resonates with the GAN generate loss for KL divergence $\mathrm{KL}(q_\theta||p)$:
Given that both DMD and Di-Bregman are equivalent to diffusion-GAN training, and with the belief that scalable pre-training requires algorithms that directly optimize the data likelihood or an associated ELBO, I am skeptical that these approaches can serve as general-purpose pre-training methods. For example, it seems unlikely that we could pre-train a next-frame prediction video generative model using methods such as self-forcing.
References
[1] Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation.
Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. Advances in neural information processing systems 36 (2023).
[2] One-step diffusion with distribution matching distillation.
Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T. Freeman, and Taesung Park. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 6613-6623. 2024.
[3] Diff-instruct: A universal approach for transferring knowledge from pre-trained diffusion models.
Weijian Luo, Tianyang Hu, Shifeng Zhang, Jiacheng Sun, Zhenguo Li, and Zhihua Zhang. Advances in Neural Information Processing Systems 36 (2023).
[4] Diffusion-gan: Training gans with diffusion.
Zhendong Wang, Huangjie Zheng, Pengcheng He, Weizhu Chen, and Mingyuan Zhou. arXiv preprint arXiv:2206.02262 (2022).
[5] One-step Diffusion Models with Bregman Density Ratio Matching.
Yuanzhi Zhu, Eleftherios Tsonis, Lucas Degeorge, Vicky Kalogeiton. arXiv preprint, arXiv:2510.16983, 2025.