Bayesian Posterior Sampling (3) Langevin Dynamics and Diffusion Models

Introduction

In the previous post, I presented some of the most classical MCMC method, Metropolis-Hastings (MH) and its variants. In this post I’d like to focus on another important MCMC method: Unadjusted Langevin Algorithm (ULA) in Bayesian statistics or Langevin Monte Carlo (LMC) in machine learning. LMC is a MCMC method which utilize Langevin dynamics for obtaining random samples from a probability distribution $\pi$ for which direct sampling is difficult (high-dimension and large number of data).

Langevin dynamics provides an MCMC procedure to sample from a distribution $p(x)$ using only its score function $\nabla_x \log p(x)$. Specifically, it initializes the chain from an arbitrary prior distribution $x_0 \sim \pi(x)$, and then iterates the following:

$$ \begin{align*} x_{i+1} \leftarrow x_i + \epsilon \nabla_x \log p(x) + \sqrt{2 \epsilon} z_i \label{Des_LD}\tag{1} \end{align*}$$

—— Generative Modeling by Estimating Gradients of the Data Distribution by Yang Song. Personal Blog, 2021.

Langevin Dynamics Sampling

Langevin Diffusion

The overdamped¹ Langevin Itô diffusion can be written as the following Stochastic Differential Equation (SDE):

$$ \begin{align*} d{X_t}=\underbrace{-\nabla U(X_t)\mathrm{d}t}_{\text{drift term}} + \underbrace{\sqrt {2}{B_t}\mathrm{d}t}_{\text{diffusion term}}, \quad X_0\sim p_0 \label{LD}\tag{2} \end{align*} $$

where $U(X)$ is the (time-dependent) potential energy function on $\mathbb{R}^d$ and $B_t$ is a d-dimensional standard Brownian Motion (or called the Wiener process). We assume that $U(X)$ is $L$-smooth (or $\nabla U(X)$ is $L$-Lipschitz): i.e. continuously differentiable and $||\nabla U(x)-\nabla U(y)|| \leq L||x-y||, \exists L > 0$.

In order to sample the diffusion paths, we can discrete eq(\ref{LD}) using the Euler-Maruyama (EM) scheme as following² (similar to eq(\ref{Des_LD})):

$$ \begin{align*} x_{i+1} = x_i - \gamma_{i+1} \nabla_x U(x_i) + \sqrt{2 \gamma_{i+1}} z_i \label{Des_LD2}\tag{3} \end{align*} $$

where $z_i$ is i.i.d $\mathcal{N}(0,I_d)$ and $\gamma_{i}$ is the stepsize, either constant or decreasing to 0.

The subject of mixing time in MCMC algorithms is quite complex and necessitates substantial analysis skills. Hence, we won’t delve into it in this blog post.

Fokker-Plank Equation

The Fokker-Planck (FP) equation, a form of partial differential equation (PDE), outlines how a probability distribution changes over time due to the influences of deterministic drift forces and stochastic fluctuations.

Let’s denote the law of $X_t$ as $p_t$ and the FP equation for $p_t$ in eq(\ref{LD}) can be written as³^,⁴:

$$ \begin{align*} \partial_t p_t = \nabla\cdot(p_t\nabla U) + \Delta p_t \label{FP}\tag{4} \end{align*} $$

To verify that $\pi$ is the unique stationary distribution for this FP equation (\ref{FP}) and hence the corresponding Langevin diffusion — which motivate us to use Langevin dynamics for sampling from $\pi$ — we can do any of the following⁵:

substitute $p_t$ with $\pi$ to verify that it’s a stationary distribution ($\partial_t \pi = 0$)
assume we already have a stationary distribution $p_\infty(x), s.t. \partial_t p_\infty=0$ and verify that $p_\infty(x) \propto \exp(-U(x))$.

Unadjusted Langevin Algorithm

When we using eq(\ref{Des_LD2}) for sampling we are implementing the ULA. Even though the stationary distribution of the Langevin diffusion is $\pi$, the stationary distribution of ULA is not, due to the discretization⁶! Note that when $\gamma_i$ is constant, the value of it controls the trade-off between the convergence speed and the accuracy⁷. ULA can be biased easily with larger step-size $\gamma_i$ and one could use a decreasing step-size or introduce the HM scheme to eliminate the bias.

Metropolis-Adjusted Langevin Algorithm

The well-known MALA can be considered as MH with improved proposals or in this post ULA with an extra rejection step:

$$ \begin{align*} g(x'|x) &= \mathcal{N}(x';x-\tau \nabla \tilde{\pi}(x),2\tau I) \\ p_{\mathrm{accept}}(x') &= \min (1,\alpha) =\min \left(1, \frac{\tilde\pi\left(x^{\prime}\right) g\left(x | x^{\prime}\right)}{\tilde\pi(x) g\left(x^{\prime} | x\right)}\right) \label{MALA}\tag{5} \end{align*} $$

where the additional term $\tau \nabla \tilde{\pi}(x)$ drives the samples toward areas with high probability density, which makes the proposed samples more likely to be accepted. One significant advantage of MALA over Random Walk Metropolis (RWM) lies in the optimal step-size selection. In MALA, the asymptotic value for this parameter is larger compared to the RWM equivalent, leading to a decrease in the dependence observed between subsequent data points.

Stochastic Gradient Langevin Dynamics

Stochastic Gradient Langevin Dynamics (SGLD) is a method proposed by Max Welling and Yee Whye Teh where the traget density $\pi$ is the density of the posterior distribution $p(\theta|x)\propto {p(\theta)}\prod_{i=1}^{N}{p(x_i|\theta)}$. The authors initially observe the parallels between stochastic gradient algorithms (that utilize a batch of samples in each iteration) and Langevin dynamics (that employ all available samples in each iteration; one can consider ULA as gradient descent with some Gaussian noise added to the gradient at each iteration), and then propose a novel update rule that merges these two concepts with unbiased estimates of the gradient:

$$ \begin{align*} \Delta\theta_{t}=\frac{\epsilon_{t}}{2}\left(\nabla\log p(\theta_{t})+\frac{N}{n}\sum_{i=1}^{n}\nabla\log p(x_{t i}|\theta_{t})\right)+\eta_{t} \label{SGLD}\tag{6} \end{align*} $$

where the step-sizes $\epsilon_{t}$ decrease towards zero, n is the batch size, and $\eta_{t} \sim \mathcal{N}(0,\epsilon_{t})$.

Yes, SGLD is proposed to train a model given the dataset⁸, and the idea is quite simple: We train the model using regular SGD, but add some Gaussian noise to each step. There are two sources of randomness: estimates of the gradient and Gaussian added noise to sample.

From Langevin Dynamics to Diffusion Models

Pitfalls with Unadjusted Langevin Dynamics

⁹A Connection between Score Matching and Denoising Autoencoders P. Vincent. Neural computation, Vol 23(7), pp. 1661--1674. MIT Press. 2011.

¹⁰Generative Modeling by Estimating Gradients of the Data Distribution Y. Song, S. Ermon. NeuIPS 2019.

¹¹In this blog by Jianlin Su you can find some discussion (in Chinese) about the relationship between denoising score matching and score matching.

¹²In this blog the authors introduced a very similar technique called Simulated Tempering Langevin Monte Carlo.

Assume now we have access to oracle $s_\theta \approx \nabla \log p_\infty = -\nabla U$ through score matching⁹:

$$ \begin{align*} \theta^{\star} = & \arg \min \frac{1}{2}\mathbb{E}_{p_{\mathrm{data}}(x)} \left[||s_\theta (x)-\nabla_{x} \log p_{\mathrm{data}}(x)||_{2}^{2} \right] \\ = & \arg \min \mathbb{E}_{p_{\mathrm{data}}(x)}\left[\mathrm{tr}(\nabla_{x}{s}_\theta(x))+\frac{1}{2}||{s}_\theta(x)||_{2}^{2}\right], \label{sm}\tag{7} \end{align*} $$

are we safe to use Langevin Algorithm (\ref{Des_LD2}) for image generation in high dimension? No, there are pitfalls¹⁰!

The manifold hypothesis: the score function $s_\theta$ is ill-defined when $x$ is confined to a low dimensional manifold in a high-dimensional space; and the score matching objective in eq(\ref{sm}) will provides a inconsistent score estimator when the data reside on a low-dimensional manifold.
The scarcity of data in low density regions can cause difficulties for both score estimation with score matching and MCMC sampling with Langevin dynamics (suffers from slow mixing time due to separate regions of data manifold).

Annealed Langevin Dynamics

To overcome these issues, Yang Song et al. proposed Noise Conditional Score Network (NCSN) $s_\theta(x,\sigma)$, which learns the Gaussian-perturbed data distribution $p_{\sigma_i}(x) = \mathbb{E}_{x’ \sim p(x’)}\left[p_{\sigma_i}(x|x’)\right] = \int p(x’) \mathcal{N}(x; x’, \sigma_i^2 I) \mathrm{d} x’ $ with various levels of noise $\sigma_i$. $\{\sigma_i\}^T_{i=1}$ is a positive geometric sequence that satisfies $\frac{\sigma_1}{\sigma_2}=\frac{\sigma_i}{\sigma_{i+1}}=\frac{\sigma_{T-1}}{\sigma_T}>1$. Since the distributions $\{p_{\sigma_i}\}^T_{i=1}$ are all perturbed by Gaussian noise, their supports span the whole space and their scores are well-defined, avoiding difficulties from the manifold hypothesis. To generate samples with annealed Langevin dynamics, we start with the largest noise level $\sigma_0$ and anneal the noise level to $\sigma_T\approx 0$:

$$ \begin{align*} x_{i+1} = x_i + {\alpha_{i+1}} s_\theta(x_i,\sigma_{i+1}) + \sqrt{2\alpha_{i+1}} z_i \label{ALD}\tag{8} \end{align*} $$

where $\alpha_{i}=\frac{\gamma}{2} \frac{\sigma_i^2}{\sigma_{i+1}^2}$ is the step-size. The only difference between the above eq(\ref{ALD}) and eq(\ref{Des_LD2}) is that the drift term $s_\theta(x_i,\sigma_{i+1}) \approx -\nabla U_i(x_i)$ is no longer fixed and will change with time.

The sampling algorithm of NCSN, which includes a double loop, can be located in ¹⁰. The outer loop is responsible for determining the noise levels, while the inner loop takes $T$ steps to guarantee that the samples are from the distribution $p_{\sigma_i}$.

To learn the score function of the perturbed data distribution, we can use the objective of denoising score matching⁹:

$$ \begin{align*} \arg\min_\theta \mathbb{E}_{x_i\sim p_{\sigma_i}(x_i)} \left[||s_\theta (x_i)-\nabla_{x_i} \log p_{\sigma_i}(x_i)||_{2}^{2} \right] = \arg\min_\theta \mathbb{E}_{x,{x}_i \sim p_{\mathrm{data}}(x)p_{\sigma_i}({x}_i|{x})}\left[||s_\theta (x_i)-\nabla_{x_i} \log p_{\sigma_i}(x_i|x)||_{2}^{2}\right], \label{dsm}\tag{9} \end{align*} $$

From score matching to denoising score matching, the difference is a constant term independent of $\theta$¹¹:

$$ \mathbb{E}_{x_i\sim p_{\sigma_i}(x_i)}\left[{\mathbb{E}_{x\sim p_{\sigma_i}(x|x_i)}\left[\left||\nabla_{x_i}\log p_{\sigma_i}(x_i|x)\right||^2\right] - \left||\nabla_{x_i}\log p_{\sigma_i}(x_i)\right||^2}\right] $$

By matching the score, we are actually minimize the KL divergence between the parameterized and the target (data) probability distribution: $D_\mathrm{KL}(p_{data}(x)||p_{\theta}(x))$.

With the learnt scores $s_\theta(x,\sigma_i) \approx \nabla_x \log p_{\sigma_i}(x)$, we can generate samples according to eq(\ref{ALD}). When $\sigma_i$ is large, modes in $p(x)$ are smoothed out by the Gaussian kernel and $s_\theta(x,\sigma_i)$ points to the mean of the modes; as $\sigma_i$ annealed down, the dynamic will be attracted to the actual modes of the target distribution¹².

In the next section, we will explore another family of sampling algorithm: reverse SDEs.

Score-based Stochastic Differential Equations

The continuous SDE form of eq(\ref{ALD}) can be written as:

$$ \begin{align*} d{X_t}=\underbrace{-\nabla U_t(X_t)\mathrm{d}t}_{\text{drift term}} + \underbrace{\sqrt {2}{B_t}\mathrm{d}t}_{\text{diffusion term}}, \quad X_0\sim p_0 \label{ALD2}\tag{10} \end{align*} $$

where $U_t\propto -\log p_t$ now also depends on time $t$ comparing to $U\propto -\log p_\infty$ in eq(\ref{LD})¹³.

In another wonderful work by Yang Song et al.¹⁴, the authors construct forward diffusion process, which maps the target distribution to a (usually) simple distribution, in a more general form of SDEs:

$$ \begin{align*} d{x_t}={f_t(x_t)\mathrm{d}t} + {g_t{B_t}\mathrm{d}t}, \label{SDE}\tag{11} \end{align*} $$

Each pair of $f_t$ and $g_t$ define the unique forward process and the corresponding marginal $p_t$ given the initial distribution.

From now on, we will swap the notation of time $t$, such that for $t=0$ we have target distribution $p_0 = \pi$ and for $t=T$ we have simple distribution $p_T$.

Furthermore, we can initiate with samples from the simple distribution $p_T$, and generate samples from target distribution $p_0$ following the dynamics of the corresponding reverse SDE:

$$ \begin{align*} \mathrm{d}{x_t} &= \left[f_t(x_t) - g_t^2 \nabla_{x_t} \log p_t({x_t})\right]\mathrm{d}t + g_t {\bar{B}_t}\mathrm{d}t, \quad x_0\sim p_0 % \\ x_t&=x_{t+1}-f_{t+1}(x_{t+1})+g_{t+1}g^T_{t+1}\nabla_{x_t} \log p_t({x_t})+g_{t+1}z_{t+1} \label{reverseSDE}\tag{12} \end{align*} $$

This SDE is only meant for time flows backwards from $T$ to 0, and $dt$ is an infinitesimal negative timestep. Now we can sample from the target distribution use this reverse SDE as long as we have access to $s_\theta \approx \nabla \log p_t$.

Note that eq(\ref{ALD2}) and eq(\ref{reverseSDE}) are two different sampling strategies, and we can apply both for sampling¹⁴ (aka. corrector and predictor): we use the corrector to ensure $x_t \sim p_t$ and use the predictor to jump to $p_{t-1}$.

Still, for $f_t=\nabla_{x} \log p_t({x})$ and $g_t=\sqrt{2}$, we get eq(\ref{ALD2}) as a special case of eq(\ref{reverseSDE}) representing backward sampling (note that eq(\ref{reverseSDE}) is reversed in time). But now the forward diffusion does not lead to the simple Gaussian distribution that we can sample from easily. In other words, we can’t enjoy the fast sampling trajectory from Gaussian to target distribution anymore in this scenario.

For score-based models (or diffusion models) with forward SDE in the form of eq(\ref{SDE}), we can write the corresponding FP equation in the form of the continuity equation¹⁵:

$$ \begin{align*} \partial_t p_t = -\nabla\cdot(f_t p_t) + \frac{g_t^2}{2}\Delta p_t = -\nabla\cdot((f_t - \frac{g_t^2}{2} \nabla\log p_t) p_t) \label{FP2}\tag{13} \end{align*} $$

where we define the vector field as $w_t = f_t - \frac{g_t^2}{2} \nabla\log p_t$, $g_t$ is somehow related to the temperature of the stationary distribution, and $p_\infty$ is the initial distribution (not necessarily Gaussian).

The corresponding Ordinary Differential Equations (ODE) of a particle moving along this vector field is the so-called probability flow ODE (PFODE)¹⁴:

$$ \begin{align*} \mathrm{d} x = \bigg[f_t(x) - \frac{1}{2}g_t^2 \nabla_{x} \log p_t({x})\bigg] \mathrm{d}t \quad \Longleftrightarrow \quad \frac{\mathrm{d}x}{\mathrm{d}t} = v_t \label{prob_ode} \tag{14} \end{align*} $$

And we can use this probability flow ODE to travel either forward or backward in time.

Given $f_t$ and $g_t$ which define the forward diffusion process, we can travel backward in time with learned score function $s_\theta \approx \nabla \log p_t$ or the learned vector field $v_\theta \approx f_t - \frac{g_t^2}{2} \nabla\log p_t$.

It’s worthy noting that, in order to sample from a target distribution according to the reverse SDEs (\ref{reverseSDE}), the score functions with different $t$ are explicitly required (no matter the form of $f_t$ and $g_t$) rather than only the score function of the target distribution. Moreover, my understanding to the success of diffusion models is that, diffusion models explicitly define paths in both the data space and the probability measure space: given the initial state $x_0$, for each time-step $t$ on the paths, we know where we are ($p_t$) and where we are aiming for ($v_t$).