17 The basics of Markov chain Monte Carlo
Before we jump in to using Stan, our go-to MCMC sampling software, we need to lay some groundwork, first understaning why we want to sample out of arbitrary distributions in the context of Bayesian inference. We then go over the basic ideas of MCMC.
17.1 Why MCMC?
When doing Bayesian inference, our goal is very often to understand a posterior distribution, \(g(\theta \mid y)\), where \(\theta = \{\theta^{(1)}, \theta^{(2)},\ldots\}\) is a set of possibly many parameters and \(y\) is the data set.1 However, just having an analytical expression for the posterior is of little use if we cannot understand any properties about it. Importantly, we often want to marginalize the posterior; that is, we want to integrate over parameters we are not immediately interested in and get distributions only for those we are. This is often necessary to understand all but the simplest models. Doing these marginalizations requires computing (nasty) integrals, which is often impossible to do analytically.
Furthermore, we almost always want to compute statistical functionals, and expectations in particular, of the posterior. For example, we might want the mean, or expectation value, of parameter \(\theta^{(1)}\). If we know the posterior, this is
\[\begin{aligned} \langle \theta^{(1)}\rangle = \int \mathrm{d}\theta \,\theta^{(1)}\,g(\theta\mid y). \end{aligned} \tag{17.1}\]
Generally, we can compute the expectation of any function of the parameters, \(h(\theta)\), and we often want to. This is
\[\begin{aligned} \langle h(\theta)\rangle = \int \mathrm{d}\theta \,h(\theta)\,g(\theta\mid y). \end{aligned} \tag{17.2}\]
So, most things we want to know about the posterior require computation of an integral.
Markov chain Monte Carlo (MCMC) allows us to sample out of an arbitrary probability distribution, which includes pretty much any posterior we could write down.2 By sampling, I mean that we can choose values of the parameters, \(\theta\), where the probability that we choose a given value is proportional to the posterior probability or probability density. Note that each sample consists of a complete set of parameters \(\theta\); that is a sample contains a value for \(\theta^{(1)}\), a value for \(\theta^{(2)}\), .…
Using MCMC, we can collect a large number of these samples. From these samples, we can trivially perform marginalizations. Say we are interested in the marginalized distribution
\[\begin{aligned} g(\theta^{(1)}\mid y) = \left[\int\mathrm{d}\theta^{(2)} \, \int \mathrm{d}\theta^{(3)}\cdots \right] g(\theta\mid y). \end{aligned} \tag{17.3}\]
Given a set of MCMC samples out of \(g(\theta\mid y)\), to get a set of samples out of \(g(\theta^{(1)}\mid y)\), we simply ignore the values of \(\theta^{(2)}\), \(\theta^{(3)}\), ...! Then, given the samples of the marginalized posterior, we can plot the CDF of the marginalized posterior as an ECDF of the samples, and the PDF of the marginalized posterior as a histogram of the samples.
To compute other expectations, the MCMC samples are again very convenient. Now, we just approximate the integral with an average over samples.
\[\begin{aligned} \langle h(\theta)\rangle = \int \mathrm{d}\theta \,h(\theta)\,g(\theta\mid y) \approx \frac{1}{N}\sum_{i=1}^N h(\theta_i), \end{aligned} \tag{17.4}\]
where \(\theta_i\) is the \(i\)th of \(N\) MCMC samples taken from the posterior.
Finally, we can compute other statistical functionals, such as quantiles, directly from the samples. For example, the median of \(\theta^{(1)}\) is found by computing the median value of all of the samples of \(\theta^{(1)}\).
It is now abundantly clear why the ability to generate samples from the posterior is so powerful. But generating samples that actually come from the probability distribution of interest is not a trivial matter. We will discuss how this is accomplished through MCMC.
17.2 The basic idea behind MCMC
Our goal is to draw independent samples from a target distribution. As we have seen, in order get independent samples, we need to be able to define a transform the converts draws from a Uniform distribution to draws from the target distribution. Unfortunately, we almost never can derive such a transformation for an arbitrary distribution, including the posterior distributions we get from Bayesian models.
But the samples need not be independent! Instead, we only need that the samples be generated from a process that generates samples from the target distribution in the correct proportions. If we draw enough3 of them, we do not need to worry about the lack of independence. In the case of the parameter estimation problem, the distribution we wish to draw from is the posterior distribution \(g(\theta\mid y)\). For notational simplicity in what follows, since we know we are always talking about a posterior distribution, we will use \(P(\theta)\) for shorthand notation for an arbitrary distribution of \(\theta\). The techniques I discuss are not limited to sampling out of posteriors; we can sample out of any distribution.
The approach of MCMC is to take random walks in parameter space such that the probability that a walker arrives at point \(\theta\) is proportional to \(P(\theta)\). This is the main concept. You should re-read that bolded sentence.
If we can achieve such a walk, we can just take the walker positions as samples from the distributions. To implement this random walk, we define a transition kernel, \(T(\theta_{i+1}\mid \theta_i)\), the probability of a walker stepping from position \(\theta_i\) in parameter space to position \(\theta_{i+1}\). The transition kernel defines a Markov chain, which you can think of as a random walker whose next step depends only on where the walker is right now; i.e., it has no memory. Where the walker goes for step \(i+1\) depends only on where it was at step \(i\).
The condition that the probability of arrival at point \(\theta_{i+1}\) is proportional to \(P(\theta_{i+1})\) may be stated as
\[\begin{aligned} P(\theta_{i+1}) = \int \mathrm{d}\theta_i\, T(\theta_{i+1}\mid \theta_i)\, P(\theta_i). \end{aligned} \tag{17.5}\]
Here, we have taken \(\theta\) to be continuous. Were it discrete, we just replace the integral with a sum. When this relation holds, it is said that the target distribution is an invariant distribution or stationary distribution of the transition kernel. When this invariant distribution is unique, it is called a limiting distribution. We want to choose our transition kernel \(T(\theta_{i+1}\mid \theta_i)\) such that the target distribution \(P(\theta)\) is limiting. This is the case if the above equation holds and the chain is ergodic. An ergodic Markov chain has the following properties:
- It is aperiodic. A periodic Markov chain can only return to a given point in parameter space after \(k\), \(2k\), \(3k,\ldots\) steps, where \(k\) is the period. An aperiodic chain is not periodic.
- It is irreducible, which means that any point in parameter space is accessible to the walker from any other.
- It is positive recurrent, which means that the walker will surely revisit any point in parameter space in a finite number of steps.
So, if our transition kernel satisfies this checklist and the invariance condition, it will eventually sample the posterior distribution. We will discuss how to come up with such a transition kernel in a moment; first we focus on the important concept of “eventually” in the preceding sentence.
17.3 Warming-up an MCMC sampler
Imagine for a moment that we devised a transition kernel that satisfies the desired properties we have laid out. Say we start a walker at position \(\theta_0\) in parameter space and it starts walking according to the transition kernel. Most likely, for those first few steps, the walker is traversing a part of parameter space that has incredibly low probability. Once it gets to regions of high probability, the walker will almost never return to the region of parameter space in which it began. So, unless we sample for an incredibly long time, those first few samples visited are over-weighted. So, we need to let the walker walk for a while without keeping track of the samples so that it can arrive at the limiting distribution. This is called warm-up, otherwise known as burn-in or tuning4.
There is no general way to tell if a walker has reached the limiting distribution, so we do not know how many warm-up steps are necessary. Fortunately, there are several heuristics. For example, Gelman and coauthors proposed generating several warm-up chains and computing the Gelman-Rubin \(\hat{R}\) statistic,
\[\begin{aligned} \hat{R} = \frac{\text{variance between the chains}}{\text{mean variance within the chains}}.\end{aligned} \tag{17.6}\]
Limiting chains have \(\hat{R} \approx 1\), and this is commonly used as a metric for having achieved stationarity. Gelman and his coauthors in their famous book Bayesian Data Analysis suggest that \(|1 - \hat{R}| < 0.1\) as a good rule of thumb for stationary chains.
This simple metric, though widely used, has flaws. We will not go into them in great detail here, but will defer discussion of MCMC diagnostics to a later lecture. You can read more about ensuring chains reach stationarity in this paper by Vehtari and coworkers (2020). We will use the methods outlined in that paper, along with their more stringent suggestion that \(|1 - \hat{R}| < 0.01\).
17.4 Generating a transition kernel: The Metropolis-Hastings algorithm
The Metropolis-Hastings algorithm covers a widely used class of algorithms for MCMC sampling. I will first state the algorithm here, and then will show that it satisfies the necessary conditions for the walkers to be sampling out of the target posterior distribution.
17.4.1 The algorithm/kernel
Say our walker is at position \(\theta_i\) in parameter space.
We randomly choose a candidate position \(\theta'\) to step to next from an arbitrary proposal distribution \(K(\theta'\mid \theta_i)\).
We compute the Metropolis ratio,
\[\begin{aligned} r = \frac{P(\theta')\,K(\theta_i\mid \theta')} {P(\theta_i)\,K(\theta'\mid \theta_i)}. \end{aligned} \tag{17.7}\]
If \(r \ge 1\), accept the step and set \(\theta_{i+1} = \theta'\). Otherwise, accept the step with probability \(r\). If we do not accept the step, set \(\theta_{i+1} = \theta_i\).
The last two steps are used to define the transition kernel \(T(\theta_{i+1}\mid \theta_i)\). We can define the acceptance probability of the proposal step as
\[\begin{aligned} \alpha(\theta_{i+1}\mid \theta_i) = \min(1, r) = \min\left(1, \frac{P(\theta_{i+1})\,K(\theta_i\mid \theta_{i+1})} {P(\theta_i)\,K(\theta_{i+1}\mid \theta_i)}\right). \end{aligned} \tag{17.8}\]
Then, the transition kernel is
\[\begin{aligned} T(\theta_{i+1}\mid \theta_i) = \alpha(\theta_{i+1}\mid \theta_i)\,K(\theta_{i+1}\mid \theta_i). \end{aligned} \tag{17.9}\]
17.4.2 Detailed balance
This algorithm seems kind of nuts! How on earth does this work? To investigate this, we consider the joint probability, \(P(\theta_{i+1}, \theta_i)\), that the walker is at \(\theta_i\) and \(\theta_{i+1}\) at sequential steps. We can write this in terms of the transition kernel,
\[\begin{aligned} P(\theta_{i+1}, \theta_i) &= P(\theta_i)\,T(\theta_{i+1}\mid \theta_i) \nonumber \\[1em] &= P(\theta_i)\,\alpha(\theta_{i+1}\mid \theta_i)\,K(\theta_{i+1}\mid \theta_i) \nonumber \\[1em] &= P(\theta_i)\,K(\theta_{i+1}\mid \theta_i)\,\min\left(1, \frac{P(\theta_{i+1})\,K(\theta_i\mid \theta_{i+1})} {P(\theta_i)\,K(\theta_{i+1}\mid \theta_i)}\right) \nonumber \\[1em] &= \min\left[P(\theta_i)\,K(\theta_{i+1}\mid \theta_i), P(\theta_{i+1})\,K(\theta_i\mid \theta_{i+1})\right] \nonumber \\[1em] &= P(\theta_{i+1})\,K(\theta_i\mid \theta_{i+1})\,\min\left(1,\frac{P(\theta_i)\,K(\theta_{i+1}\mid \theta_i)}{P(\theta_{i+1})\,K(\theta_i\mid \theta_{i+1})}\right) \nonumber \\[1em] &= P(\theta_{i+1})\, \alpha(\theta_i\mid \theta_{i+1})\,K(\theta_i\mid \theta_{i+1})\, \nonumber \\[1em] &= P(\theta_{i+1})\, T(\theta_i\mid \theta_{i+1}). \end{aligned} \tag{17.10}\]
Thus, we have
\[\begin{aligned} P(\theta_i)\,T(\theta_{i+1}\mid \theta_i) = P(\theta_{i+1})\, T(\theta_i\mid \theta_{i+1}). \end{aligned} \tag{17.11}\]
This says that the rate of transition from \(\theta_i\) to \(\theta_{i+1}\) is equal to the rate of transition from \(\theta_{i+1}\) to \(\theta_i\). In this case, the transition kernel is said to satisfy detailed balance.
Any transition kernel that satisfies detailed balance has \(P(\theta)\) as an invariant distribution. This is easily shown.
\[\begin{aligned} \int \mathrm{d}\theta_i \,P(\theta_i)\,T(\theta_{i+1}\mid \theta_i) &= \int \mathrm{d}\theta_i\,P(\theta_{i+1})\, T(\theta_i\mid \theta_{i+1}) \nonumber \\[1em] &= P(\theta_{i+1})\left[\int \mathrm{d}\theta_i\, T(\theta_i\mid \theta_{i+1})\right] \nonumber \\[1em] &= P(\theta_{i+1}), \end{aligned} \tag{17.12}\]
since the bracketed term is unity because the transition kernel is a probability density5.
Note that all transition kernels that satisfy detailed balance have an invariant distribution. (If the chain is ergodic, this is a limiting distribution.) But not all kernels that have an invariant distribution satisfy detailed balance. So, detailed balance is a sufficient condition for a transition kernel having an invariant distribution.
17.4.3 Choosing the transition kernel
There is an art to choosing the transition kernel. The original Metropolis algorithm (1953), took \(K(\theta_{i+1}\mid \theta_i) = 1\). As a rule of thumb, you want to choose a proposal distribution such that you get an acceptance rate of about 0.4. If you accept every step, the walker just wanders around and it takes a while to get to the limiting distribution. If you reject too many steps, the walkers never move, and it again takes a long time to get to the limiting distribution. There are tricks to “tune” the walkers to achieve the target acceptance rate.
Gibbs sampling is a more modern approach to defining transition kernels. It was very popular for many years, though we will not go into the details. It is a special case of a Metropolis-Hastings sampler. These days, most samplers use Hamiltonian Monte Carlo (HMC). Stan uses a highly optimized HMC sampler which gives significant performance improvements for important subclasses of problems. To my knowledge and in my opinion, Stan is the current state-of-the-art in MCMC sampling software. (I cannot overemphasize how much of a performance improvement Stan’s sampler offers; it allows sampling of posteriors that are impossible with a more naive implementation of Metropolis-Hastings.)
Importantly, the HMC sampler can only handle continuous variables; it cannot sample discrete variables. Depending on your problem, this could be a limitation, but in most applications we are interested in estimating continuous parameters. There are also ways around sampling discrete variables, such as explicitly marginalizing over them, such that we rarely need to sample them anyhow.
It will become clear why I am using this strange superscripted parenthetical indexing of the parameters as we continue with the MCMC discussion. I need other subscripts later on.↩︎
Well, not any. For some cases, we may not be able to make a transition kernel that satisfies the necessary properties, which I describe in the following pages.↩︎
The word "enough" is pretty loaded here. We need to draw enough samples such that we get a sufficient number effectively independent samples. One algorithm for drawing samples might require 1000 samples for each effectively independent one. Another, better one might only require 10. Margossian and Gelman have a nice discussion on this.↩︎
When using Stan’s sampler, the warm-up is a bit more than just burn-in, which is just regular sampling where we simply neglect the first bunch of samples. Stan’s algorithm is actively choosing stepping strategies during the warm-up phase.↩︎
In the event that we are dealing with discrete parameters \(\theta\), then the integral is replaced with a sum and the transition kernel is a probability, and not a probability density. The same results still hold and \(P(\theta)\) is an invariant distribution.↩︎