Bayesian Posterior Sampling (2) MCMC Basics

2023/06/26 Research Bayesian, Acceptance-Rejection, MCMC, Metropolis-Hastings

Introduction

In the previous post, I showed that we can use MCMC methods to draw samples from a target distribution $\pi$. In this post, I will present the most classical MCMC method, Metropolis-Hastings (MH) and its variants. Prior to MCMC methods, I want to first get the readers familiar with another Monte Carlo approach named acceptance-rejection method, which is also able to generate samples from a (unnormalized) distribution $\tilde\pi$ and whose shortcomings motivate us to use Markov chain.

Acceptance-Rejection Method

To sample directly from the target distribution $\pi(x)$ is unfeasible. The technique named acceptance-rejection sampling, also known as rejection sampling, provides a solution to this issue. The fundamental concept here is to propose a different probability distribution, $G$, with a density function $g(x)$, which not only have an efficient sampling algorithm1 readily available, but should also span (at least) the same domain as $\tilde\pi(x)$ (Otherwise, there would be parts of the curved area we want to sample from $\pi(x)$ that could never be reached.). Consequently, it’s seem natural now to construct a decision rule2 to determine whether to accept samples from the proposed distribution $g$, of course based on our knowledge of $\tilde\pi(x)$. This decision rule should ensure that the long-term fraction of time spent in each state is proportional to $\tilde\pi(x)$.

Let first have a look at the algorithm, which is quite simple by repeating the following two steps until enough samples are accepted:

  1. sample $x$ from proposal $g$.
  2. accept with probability $p_{\mathrm{accept}}(x) = \frac{\tilde\pi(x)}{Mg(x)}$.

where $M$ is a sufficient large constant to ensure $\left[\frac{\tilde\pi(x)}{Mg(x)}\right]$ can be interpreted as probability, that is $0 < p_{\mathrm{accept}} < 1$ for all values of $x$. This also require $g(x)>0$ whenever $\tilde\pi(x)>0$ as mentioned above.

Let’s now establish that the accepted samples indeed come from the distribution $\pi$ (asymptotically correct). Suppose that the act of accepting a sample is denoted as $A$. Therefore, for any given sample $s$, we can state that:

$$ \begin{align*} p(s|A) &= \frac{p(A|s)p(s)}{p(A)} = \frac{p_{\mathrm{accept}}(s)g(s)}{\int p_{\mathrm{accept}}(x)g(x) dx} = \frac{\tilde\pi(s)/M}{\int \tilde\pi(x)/M dx} = \frac{\tilde\pi(s)}{\int \tilde\pi(x) dx} = \pi(s) \label{AR}\tag{1} \end{align*} $$

The denominator $p(A)$ is the unconditional acceptance probability and is equal $\frac{Z}{M}$ (For a more thorough explanation, please refer to the detailed derivation available at this wiki):

$$ \begin{align*} p(A) = \int p(A|x)p(x)dx = \int \frac{\tilde\pi(x)}{M} dx = \frac{Z}{M} \label{denominator}\tag{2} \end{align*} $$

The unconditional acceptance probability $p(A)$ tell us, we need on average $\frac{M}{Z}$ draws to get an accepted sample, which can be very inefficient when $M$ is very large. And especially when we have random variable $x$ in high dimension, rejection sampling suffers from the “curse of dimensionality”3.

Moreover, in acceptance-rejection method, each draw is independent and information from the last draw is discarded completely (i.e, we may want to explore its neighbor if the sample from last draw get accepted). It’s natural to consider adding some correlation between the closest samples, and this is when Markov chain (or the Markov transition kernel $p(x’|x)$) is introduced.

Recap of Markov Chain

A (discrete-time) Markov chain is a sequence of random variables $X_1, X_2, X_3, …$ with the Markov property:

$$ \begin{align*} p(X_{t}=x_{t}| X_{t-1}=x_{t-1},\dots ,X_{0}=x_{0})=p(X_{t}=x_{t}| X_{t-1}=x_{t-1}) \label{Markov_property}\tag{3} \end{align*} $$

and a Markov process is uniquely defined by its transition probabilities $p(x’| x)$.

In our case, we always want the Markov chain to have a unique stationary distribution that is indeed $\pi$.

Markov Chain Properties

  • Prior $p(X_1)$ and transition probabilities $p(X_{t+1} | X_{t} )=p(x’| x)$ independent of $t$.
  • Ergodic Markov Chains: there exists a finite $t$ such that every state can be reached from every state in exactly $t$ steps.
  • Stationary Distributions: $\lim\limits_{T \rightarrow \infty} p\left(X_{T}=x\right)=\pi(x)$, which is independent of $p(X_1)$.

Detailed Balance Equation

A chain satisfies detailed balance4 if we have:

$$ \begin{align*} \pi(\mathbf{x}) p\left(\mathbf{x}^{\prime} | \mathbf{x}\right)=\pi\left(\mathbf{x}^{\prime}\right) p\left(\mathbf{x} | \mathbf{x}^{\prime}\right) \label{Detailed_Balance}\tag{4} \end{align*} $$

This means in the in-flow to state $x’$ from $x$ (the probability of being in state $x$ times the transitional probability from $x$ to $x’$) is equal to the out-flow from state $x’$ back to $x$, and vice versa (Or each elementary process is in equilibrium with its reverse process at equilibrium / stationary). If a chain with transitional probability $p\left(\mathbf{x}^{\prime} | \mathbf{x}\right)$ satisfies the above detailed balance, then $\pi(\mathbf{x})$ is its stationary distribution. And it worth to mention that detailed balance is in general sufficient but not necessary condition for stationarity.

Metropolis-Hastings Methods

In the previous section, we learnt that the key of rejection sampling is proposal and decision rule. These principles remain applicable in MH techniques. In MH methods, instead generate samples independently from proposal $g$, the proposal (with Markov property) $g(x_{t+1}|x_t)$ generate new candidate based on the current sample value (The samples are correlated.). And we still need the decision rule to ensure $\pi$ is the stationary distribution.

Again, let’s have a look at the algorithm first, which is similar to rejection sampling. For each current state $x_t$:

  1. sample $x’$ from proposal $g(x’|x=x_t)$.
  2. accept $x_{t+1}=x’$ with probability $p_{\mathrm{accept}}(x’) := p_{\mathrm{accept}}(x’|x)$, set $x_{t+1}=x_{t}$ with probability $1-p_{\mathrm{accept}}(x’)$.

where the probability $p_{\mathrm{accept}}(x’)$ is defined as:

$$ \begin{align*} p_{\mathrm{accept}}(x') = \min (1,\alpha) =\min \left(1, \frac{\tilde\pi\left(x^{\prime}\right) g\left(x | x^{\prime}\right)}{\tilde\pi(x) g\left(x^{\prime} | x\right)}\right) \label{accept_probability}\tag{5} \end{align*} $$

The transition probability of the Markov chain defined in the MH algorithm can be written as:

$$ \begin{align*} p(x'|x)=\begin{cases} g(x'|x)p_{\mathrm{accept}}(x'|x), & \text{if } x'\neq x \\ g(x|x)+\sum_{x'\neq x}g(x'|x)(1-p_{\mathrm{accept}}(x'|x)), & \text{otherwise} \end{cases} \label{transition_kernel}\tag{6} \end{align*} $$

We need to ascertain that our proposal along with the decision rule satisfies the detailed balance equation (\ref{Detailed_Balance}). To accomplish this, we can verify (left as exercise) that the left and right sides of the equation are equal using the target distribution $\pi$ and the transition probability $p(x’|x)$, which is defined in eq(\ref{transition_kernel}).

Rejection Sampling as Special Case

When the proposal $g(x’|x)$ is independent of previous state $x_t$ — in other words, $g(x’|x) = g(x’)$ is an independence sampler — we end up with the method rejection sampling introduced in the beginning.

Random Walk Metropolis-Hastings5

When the proposal $g(x’|x)$ is Gaussian, we get random walk Metropolis-Hastings (RWM), where the new state is a random walk based on the previous state. This proposal can be written as following:

$$ \begin{align*} g(x'|x) = \mathcal{N}(x';x,\tau^2\mathbf{I}) \label{Random_Walk}\tag{7} \end{align*} $$

where $\tau$ is a scale factor chosen to facilitate rapid mixing, often referred to as the random walk step size. This step size plays a crucial role in balancing the trade-off between exploration, which is essential for covering the distribution, and maintaining an acceptable acceptance rate.

RR01b prove that, if the (target) posterior is Gaussian, the asymptotically optimal value is to use $\tau^2 = 2.382/D$, where $D$ is the dimensionality of $x$; this results in an acceptance rate of 0.234, which (in this case) is the optimal tradeoff …

—— Probabilistic Machine Learning: Advanced Topics by Kevin Patrick Murphy. MIT Press, 2023.

Since the proposal $g(x’|x)$ is now symmetric with $g(x’|x)=g(x|x’)$, we have simplified $p_{\mathrm{accept}}(x’) = \min \left(1, \frac{\tilde\pi\left(x^{\prime}\right)}{\tilde\pi(x)}\right)$. This simplified acceptance probability is more easily understood: if $x’$ is more probable than $x$, we unquestionably transition there, otherwise, we may still move there anyway by change, depending on the relative probabilities.

However, in high-dimensional space6, this classical MH algorithm is not applicable as it will end up without moving at all for a long time7,8. This is because due to the concentration of measure, most of the probability mass are concentrating around the typical set which is a narrow crust shape area away from the (high density) mode; almost all proposed jumps will be outside — outside the crust and away from the mode because the ratio between volume of n-ball and n-cube tends to 06 (thus there is almost 0 probability of falling inside the ball from the sphere) — the typical set and thus will be rejected for mode-seeking algorithms such as MH.

Figure 1: Visual illustration of random walk Metropolis-Hastings in 2D space9.

Metropolis-Adjusted Langevin Algorithm (MALA)

The well-known Metropolis-Adjusted Langevin Algorithm, which we will revisit in the next post, can be considered as MH with improved proposals:

$$ \begin{align*} g(x'|x) = \mathcal{N}(x';x-\tau \nabla \tilde{\pi}(x),2\tau I) \label{MALA}\tag{8} \end{align*} $$

where the additional term $\tau \nabla \tilde{\pi}(x)$ drives the samples toward areas with high probability density, which makes the proposed samples more likely to be accepted.

Remarks

In this post we mainly discussed the basic idea of MCMC method, especially the propose-accept scheme in Metropolis-Hastings algorithm. In the next posts, we will explore a more intuitive family of methods, the dynamic-based MCMC.


"