A KL-Regularized Reward-Tilting View of DiffusionNFT

Offline DiffusionNFT[1] as KL-Regularized Reward Tilting

In my personal view, DiffusionNFT is a significant work because it is the first native algorithm for diffusion RL without relying on differentiable reward models.

The paper derives an optimal velocity field update and interprets it as a positive/negative split, but the corresponding density-level optimum can be written explicitly as a reward-tilted distribution. This section provides a detailed derivation of this closed-form solution.

Setup (From DiffusionNFT)

Introduce a binary optimality variable $o\in{0,1}$ and a prompt/context $c$. DiffusionNFT defines the reward as an optimality probability

$$ \begin{align*} r(x_0,c)\;:=\;p(o=1\mid x_0,c)\in[0,1]. \tag{1} \end{align*} $$

Given a reference (old) model $\pi_{\mathrm{old}}(x_0\mid c)$, define the positive (optimality-conditioned) distribution

$$ \begin{align*} \pi_{+}(x_0\mid c) :=\pi_{\mathrm{old}}(x_0\mid o=1,c) =\frac{p(o=1\mid x_0,c)}{p_{\mathrm{old}}(o=1\mid c)}\,\pi_{\mathrm{old}}(x_0\mid c)= \frac{r(x_0,c)}{\mathbb{E}_{x_0\sim \pi_{\mathrm{old}}(\cdot\mid c)}[r(x_0,c)]}\; \pi_{\mathrm{old}}(x_0\mid c), \label{pi_plus}\tag{2} \end{align*} $$

and the corresponding forward-noised marginals at diffusion time $t$ under a fixed kernel $\pi_{t\mid 0}(x_t\mid x_0)$:

$$ \begin{align*} p^{\mathrm{old}}_t(x_t\mid c)=\int \pi_{t\mid 0}(x_t\mid x_0)\,\pi_{\mathrm{old}}(x_0\mid c)\,dx_0, \qquad p^{+}_t(x_t\mid c)=\int \pi_{t\mid 0}(x_t\mid x_0)\,\pi_{+}(x_0\mid c)\,dx_0. \tag{3} \end{align*} $$

DiffusionNFT’s training objective¹ yields an optimal velocity field of the form

$$ \begin{align*} v^*(x_t,c,t)=v^{\mathrm{old}}(x_t,c,t)+\frac{2}{\beta}\,\Delta(x_t,c,t), \label{optimal_v} \tag{4} \end{align*} $$

with guidance direction $\Delta(x_t,c,t)=\alpha(x_t,c)\bigl(v^{+}(x_t,c,t)-v^{\mathrm{old}}(x_t,c,t)\bigr)$.

And the mixture coefficient is defined as

$$ \begin{align*} \alpha(x_t,c):=\frac{p^{+}_t(x_t\mid c)}{p^{\mathrm{old}}_t(x_t\mid c)}\;\mathbb{E}_{x_0\sim \pi_{\mathrm{old}}(\cdot\mid c)}[r(x_0,c)]. \tag{5} \end{align*} $$

Derivation of $\alpha(x_t,c)=p(o=1\mid x_t,c)$

Apply the forward diffusion to the positive distribution:

$$ \begin{align*} \frac{p^{+}_t(x_t\mid c)}{p^{\mathrm{old}}_t(x_t\mid c)} = \frac{p_{\mathrm{old}}(o=1\mid x_t,c)}{p_{\mathrm{old}}(o=1\mid c)}. \tag{6} \end{align*} $$

where we use the identity

$$ \begin{align*} p_{\mathrm{old}}(o=1\mid c)=\mathbb{E}_{x_0\sim \pi_{\mathrm{old}}(\cdot\mid c)}[r(x_0,c)], \tag{7} \end{align*} $$

we obtain

$$ \begin{align*} \boxed{ \alpha(x_t,c)=p_{\mathrm{old}}(o=1\mid x_t,c). } \tag{8} \end{align*} $$

Note that $\alpha(x_t,c)=\mathbb{E}_{x_0\sim \pi _{\mathrm{old}}(\cdot \mid x_t,c)}[r(x_0,c)]$. This shows that the mixture coefficient equals the optimality posterior distribution at the noisy state $x_t$ under the old model.

Remark

This equality is a direct consequence of the definition of $\pi_{+}$ as an optimality-conditioned distribution and Bayes’ rule under the fixed forward noising kernel. It may be omitted or only implicit in the original DiffusionNFT paper. With this finding, the Lemma A.2 (Posterior Split) in the paper becomes obvious.

DiffusionNFT optimal distribution at each step

In order to compute the optimal distribution, we need to simplify the residual term $\Delta(x_t,c,t)=\alpha(x_t,c)\bigl(v^{+}(x_t,c,t)-v^{\mathrm{old}}(x_t,c,t)\bigr)$.

Using the Bayes relation for the positive marginal

$$ \begin{align*} p^{+}_t(x_t\mid c)=p^{\mathrm{old}}_t(x_t\mid c)\,\frac{p(o=1\mid x_t,c)}{p(o=1\mid c)}, \tag{9} \label{bayes_positive} \end{align*} $$

and i) utilizing the relation between velocity fields and score functions under fixed Gaussian noising², and ii) applying log and gradient, one has

$$ \begin{align*} v^{+}(x_t,c,t)-v^{\mathrm{old}}(x_t,c,t) &= \kappa(t)\Big(\nabla_{x_t}\log p^{+}_t(x_t\mid c)-\nabla_{x_t}\log p^{\mathrm{old}}_t(x_t\mid c)\Big) && \color{gray}{\text{// v to score}} \\ &= \kappa(t)\,\nabla_{x_t}\log\frac{p^{+}_t(x_t\mid c)}{p^{\mathrm{old}}_t(x_t\mid c)} \\ &= \kappa(t)\,\nabla_{x_t}\log\frac{p(o=1\mid x_t,c)}{p(o=1\mid c)} && \color{gray}{\text{// substitute eq(\ref{bayes_positive})}} \\ &= \kappa(t)\,\nabla_{x_t}\log p(o=1\mid x_t,c) && \color{gray}{\text{// } \nabla_{x_t}\log p(o=1\mid c)=0}. \tag{10} \end{align*} $$

Thus we can rewrite the guidance direction as:

$$ \begin{align*} \Delta(x_t,c,t) = \kappa(t)\,\alpha(x_t,c)\,\nabla_{x_t}\log p(o=1\mid x_t,c). \tag{11} \end{align*} $$

Closed-form optimal distribution (density-level)

Since $\alpha(x_t,c)=p(o=1\mid x_t,c)$ and $\alpha(x_t,c)\nabla\log\alpha(x_t,c)=\nabla\alpha(x_t,c)$, we can rewrite eq(\ref{optimal_v}) in the form of score and have:

$$ \require{cancel} \begin{align*} &v^*(x_t,c,t)-v^{\mathrm{old}}(x_t,c,t) =\frac{2}{\beta}\,\Delta(x_t,c,t)=\kappa(t)\frac{2}{\beta}\,\nabla_{x_t}\alpha(x_t,c)\\ &\implies \cancel{\kappa(t)}\Big(\nabla_{x_t}\log p^{*}_t(x_t\mid c)-\nabla_{x_t}\log p^{\mathrm{old}}_t(x_t\mid c)\Big) =\cancel{\kappa(t)}\frac{2}{\beta}\,\nabla_{x_t}\alpha(x_t,c) && \color{gray}{\text{// v to score}} \\ &\implies \nabla_{x_t}\log\frac{p^{*}_t(x_t\mid c)}{p^{\mathrm{old}}_t(x_t\mid c)} =\frac{2}{\beta}\,\nabla_{x_t}\alpha(x_t,c). \tag{12} \end{align*} $$

This implies the explicit density:

$$ \begin{align*} p^{*}_t(x_t\mid c) = \frac{1}{Z_t(c)}\;p^{\mathrm{old}}_t(x_t\mid c)\; \exp\!\Big(\frac{2}{\beta}\,p(o=1\mid x_t,c)\Big), \qquad Z_t(c)=\int p^{\mathrm{old}}_t(x_t\mid c)\exp\!\Big(\frac{2}{\beta}p(o=1\mid x_t,c)\Big)\,dx_t. \tag{13} \end{align*} $$

At $t=0$ this reduces to $p^{*}(x_0\mid c)\propto p^{\mathrm{old}}(x_0\mid c)\exp(\frac{2}{\beta}r(x_0,c))$.

Thus, DiffusionNFT with fixed $p^{\mathrm{old}}=p^{\mathrm{ref}}$ can be viewed as learning a KL-regularized exponential tilt of the reference model, where the “reward” is the optimality posterior at the noisy state.

Online DiffusionNFT Leads to Reward Hacking via Accumulated Tilting

In practice, DiffusionNFT can be run online: after fitting $v_\theta$ for one epoch, the old model is updated (e.g., by copying weights (hard) or by EMA (soft)) and the next epoch is trained against this updated old model. This induces a recursion over the initial reference distributions.

Idealized recursion with exact per-epoch optima

Let $p^{(k)}_t(x_t\mid c)$ denote the old marginal at epoch $k$ (the distribution induced by the current old velocity field). Treat the reward/optimality posterior $\alpha_t(x_t,c)=p(o=1\mid x_t,c)$ as fixed across epoches.

From the closed-form optimum, the population-optimal update at epoch $k$ is

$$ \begin{align*} p^{(k+1)}_t(x_t\mid c) = \frac{1}{Z^{(k)}_t(c)}\;p^{(k)}_t(x_t\mid c)\; \exp\!\Big(\lambda\,\alpha_t(x_t,c)\Big), \qquad \lambda=\frac{2}{\beta}. \tag{14} \end{align*} $$

Unrolling the recursion yields

$$ \begin{align*} p^{(K)}_t(x_t\mid c) \propto p^{(0)}_t(x_t\mid c)\; \exp\!\Big(\lambda K\,\alpha_t(x_t,c)\Big). \tag{15} \end{align*} $$

As $K\to\infty$, the previous expression concentrates mass on the set of maximizers of the reward.

Remarks on EMA references

If the reference is updated via EMA in parameter space, the induced distribution recursion is not exactly the simple multiplicative update above. Nevertheless, EMA typically acts as a trust-region mechanism that interpolates between keeping $p^{(k)}_t$ fixed and fully replacing it by the latest student, effectively reducing the rate at which the tilt coefficient grows with $k$. The qualitative conclusion remains: in the absence of an anchoring term to the initial model, repeated online improvement tends to accumulate reward tilt and can become increasingly peaked around high-reward regions and thus prone to reward hacking.

Online DiffusionNFT with an Additional KL to the Initial Reference

To prevent unbounded drift from the original model, one can augment the per-epoch objective with an additional regularizer that penalizes deviation from the initial reference distribution $p^{(0)}_t(\cdot\mid c)=p^{\mathrm{ref}}_t(\cdot\mid c)$.

At the distribution level, consider the KL-regularized problem at epoch $k$ [2,3]:

$$ \begin{align*} p^{(k+1)}_t = \arg\max_{p_t}\; \mathbb{E}_{x_t\sim p_t}\!\big[\alpha_t(x_t,c)\big] -\frac{1}{\eta_1}\mathrm{KL}\!\big(p_t\|p^{(k)}_t\big) -\frac{1}{\eta_0}\mathrm{KL}\!\big(p_t\|p^{(0)}_t\big), \qquad \eta_0,\eta_1>0. \tag{16} \end{align*} $$

Here $\eta_1$ controls the trust region to the current reference and $\eta_0$ controls anchoring to the initial reference.

The unique optimizer has the closed form

$$ \begin{align*} p^{(k+1)}_t(x_t\mid c) = \frac{1}{\widetilde Z^{(k)}_t(c)}\; \Big(p^{(k)}_t(x_t\mid c)\Big)^{w}\, \Big(p^{(0)}_t(x_t\mid c)\Big)^{1-w}\, \exp\!\Big(\lambda_{\mathrm{eff}}\,\alpha_t(x_t,c)\Big), \tag{17} \end{align*} $$

with weights and effective tilt coefficient

$$ \begin{align*} w=\frac{\eta_0}{\eta_0+\eta_1}\in(0,1), \qquad \lambda_{\mathrm{eff}}=\frac{\eta_0\eta_1}{\eta_0+\eta_1}. \tag{18} \end{align*} $$

Thus, the per-epoch solution is an exponential tilt of a geometric mixture of the current and initial references.

It is straightforward to see that the iterates converge to a limiting distribution:

$$ \begin{align*} \boxed{ p^{(\infty)}_t(x_t\mid c) = \frac{1}{Z^{(\infty)}_t(c)}\;p^{(0)}_t(x_t\mid c)\exp\!\big(\eta_0\,\alpha_t(x_t,c)\big). } \tag{19} \end{align*} $$

Notably, the limiting distribution depends only on the anchoring strength $\eta_0$; the trust-region parameter $\eta_1$ affects only the dynamics (convergence rate and stability) through $w$.

In practice, the finetuned model in DiffusionNFT is initialized from a pre-trained diffusion model without CFG. Moreover, the initial reference model is also the pre-trained diffusion model without CFG. Thus, with large inital KL strength, the online DiffusionNFT procedure effectively regularizes the finetuned model toward the pre-trained diffusion model without CFG and the learned model can only generate blurry samples; with small initial KL strength used in the experiments in the paper, the finetuned model generates samples with high reward but low diversity and pure color background, which suggests severe reward hacking. Based on the analysis above, we suggest adding an medium-strength initial KL regularization and inference the pre-trained model with CFG as the initial reference to mitigate reward hacking in online DiffusionNFT.

Acknowledgements

The author thanks Huayu Chen for his insightful work, helpful discussions, and valuable feedback. The author also thanks Ruiqing Wang for proofreading.

References

[1] DiffusionNFT: Online Diffusion Reinforcement with Forward Process.
Kaiwen Zheng, Huayu Chen, Haotian Ye, Haoxiang Wang, Qinsheng Zhang, Kai Jiang, Hang Su, Stefano Ermon, Jun Zhu, and Ming-Yu Liu. arXiv preprint arXiv:2509.16117 (2025).

[2] Test-time Alignment of Diffusion Models without Reward Over-optimization.
Sunwoo Kim, Minkyu Kim, and Dongmin Park. The Thirteenth International Conference on Learning Representations (ICLR 2025).

[3] Feedback Efficient Online Fine-Tuning of Diffusion Models.
Masatoshi Uehara, Yulai Zhao, Kevin Black, Ehsan Hajiramezanali, Gabriele Scalia, Nathaniel Lee Diamant, Alex M. Tseng, Sergey Levine, and Tommaso Biancalani. Proceedings of the 41st International Conference on Machine Learning (ICML 2024).