30 MCMC diagnostics via a case study: Artificial funnel of hell

Code

# Colab setup ------------------
import os, shutil, sys, subprocess, urllib.request
if "google.colab" in sys.modules:
    cmd = "pip install --upgrade iqplot colorcet bebi103 arviz cmdstanpy watermark"
    process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    stdout, stderr = process.communicate()
    from cmdstanpy.install_cmdstan import latest_version
    cmdstan_version = latest_version()
    cmdstan_url = f"https://github.com/stan-dev/cmdstan/releases/download/v{cmdstan_version}/"
    fname = f"colab-cmdstan-{cmdstan_version}.tgz"
    urllib.request.urlretrieve(cmdstan_url + fname, fname)
    shutil.unpack_archive(fname)
    os.environ["CMDSTAN"] = f"./cmdstan-{cmdstan_version}"
    data_path = "https://s3.amazonaws.com/bebi103.caltech.edu/data/"
else:
    data_path = "../data/"
# ------------------------------

import numpy as np

import cmdstanpy
import arviz as az

import bebi103

import bokeh.io
bokeh.io.output_notebook()

Loading BokehJS ...

In previous lessons, we have seen that we can sample out of arbitrary probability distributions, most notably posterior probability distributions in the context of Bayesian inference, using Markov chain Monte Carlo. However, there are a few questions we need to answer to make sure our MCMC samplers are in fact sampling the target distribution.

Have we achieved stationarity? That is, have the chains sampled enough that we are effectively getting independent samples out of the target distribution?
Can the chains access all areas of parameter space?
Have we taken enough samples to get a good picture of the posterior?

There are diagnostic checks we can do to address these questions, and these checks are the topic of this lesson.

30.1 The funnel of hell

While we will not be directly considering the posterior, it is worthwhile to work with a distribution that is difficult to sample out of when learning how to diagnose issues with the sampler. We will therefore consider the funnel of hell that we visited in ?exr-funnel-of-hell. As you will see in forthcoming lessons, it has many of the same pathologies that are often present in hierarchical models. Here is our simple-looking, but difficult distribution for sampling.

\[\begin{align} & v \sim \text{Norm}(0, 3),\\[1em] & \theta \sim \text{Norm}(0, \mathrm{e}^{v/2}). \end{align}\]

That is, \(v\) is Normally distribution with mean zero and variance 9, and \(\theta\) is Normally distributed with mean zero and variance \(\mathrm{e}^v\). The joint distribution is then

\[\begin{align} P(\theta, v) = P(\theta\mid v) \,P(v) = \frac{\mathrm{e}^{-v/2}}{6\pi}\,\exp\left[-\frac{1}{2}\left(\frac{v^2}{9} + \frac{\theta^2}{\mathrm{e}^v}\right)\right] \end{align}\]

We can compute this analytically, so let’s make a plot of it so we know what we’re sampling out of.

theta = np.linspace(-4, 4, 400)
v = np.linspace(-15, 5, 400)

THETA, V = np.meshgrid(theta, v)
P = np.exp(-V/2) / 6 / np.pi * np.exp(-(V**2 / 9 + THETA**2 / np.exp(V))/2)

# Show it hacking contour to show image, but no contours
bokeh.io.show(bebi103.viz.contour(THETA, V, P, overlaid=True, line_kwargs=dict(alpha=0)))

Much of the probability density lies deep in the funnel, which is a region of high curvature. The sampler may have some real troubles down there.

Before proceeding to attempt to sample this, I note that use of this funnel originates from section 8 of this paper by Radford Neal, and this section of this tutorial draws from this paper by Betancourt and Girolami.

30.1.1 Sampling out of the funnel

Now, we’ll code up a Stan model for the funnel and draw some samples using MCMC. The Stan code is short and simple.

parameters {
  real theta;
  real v; 
}


model {
v ~ normal(0, 3);
theta ~ normal(0, exp(v/2));
}

Let’s compile and sample!

with bebi103.stan.disable_logging():
    sm = cmdstanpy.CmdStanModel(stan_file='funnel.stan')
    samples = sm.sample(seed=3252)

samples = az.from_cmdstanpy(samples)

With samples in hand, we will proceed to define and compute diagnostics for the sampler.

30.2 Diagnostics for any MCMC sampler

We will first investigate diagnostics that apply to any MCMC sampler, not just Hamiltonian Monte Carlo samplers like Stan uses.

30.2.1 The Gelman-Rubin R-hat statistic

The Gelman-Rubin R-hat statistic is a useful metric to determine if we have achieved stationarity with our chains. The idea is that we run multiple chains in parallel (at least four). For a given parameter, we then compute the variance in the samples between the chains, and then the variance of samples within the chains. The ratio of these two is the Gelman-Rubin R-hat statistic, usually denoted as \(\hat{R}\), and we compute \(\hat{R}\) for each chain.

\[\begin{align} \hat{R} = \frac{\text{variance between chains}}{\text{variance within chains}}. \end{align}\]

The value of \(\hat{R}\) approaches unity if the chains are properly sampling the target distribution because the chains should be identical in their sampling of the posterior if they have all reached the limiting distribution. As a rule of thumb, recommended by Vehtari, et al., 2021, the value of \(\hat{R}\) should be less than 1.01. There are more details involved in calculation of \(\hat{R}\), and you may read about them in the Vehtari, et al. paper.

ArviZ automatically computes \(\hat{R}\) using state-of-the-art rank normalization techniques (published in Vehtari, et al.).

az.rhat(samples)

<xarray.Dataset> Size: 16B
Dimensions:  ()
Data variables:
    theta    float64 8B 1.054
    v        float64 8B 1.099

We see that Rhat for each of the two parameters is above 1.01, violating the rule of thumb.

If we want to see a quick summary of the results of MCMC, including mean parameter values, we can use az.summary(). This gives a Pandas data frame (which has an index, in this case the names of the parameters), which is convenient for display.

az.summary(samples)

	mean	sd	hdi_3%	hdi_97%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat
theta	0.060	7.233	-9.078	8.971	0.478	1.507	467.0	298.0	1.05
v	0.667	2.627	-4.464	5.145	0.392	0.313	41.0	18.0	1.10

We will discuss what some of these other statistics aside from \(\hat{R}\) mean momentarily.

We can take more samples to boost \(\hat{R}\), so let’s do that.

with bebi103.stan.disable_logging():
    samples = sm.sample(iter_sampling=100_000, seed=3252)

samples = az.from_cmdstanpy(samples)

# Check R-hat
az.rhat(samples)

<xarray.Dataset> Size: 16B
Dimensions:  ()
Data variables:
    theta    float64 8B 1.003
    v        float64 8B 1.005

We are now in compliance with the rule of thumb. However, taking so many samples to achieve a good \(\hat{R}\) is indicative of other problems with the sampler. We will therefore continue out tour of diagnostics with the standard 1000 samples per chain.

with bebi103.stan.disable_logging():
    samples = sm.sample(seed=3252, show_progress=False)

samples = az.from_cmdstanpy(samples)

30.2.2 Effective samples size

Recall that MCMC samplers do not draw independent samples from the target distribution. Rather, the samples are correlated. Ideally, though, we would draw independent samples. We would like to get an estimate for the number of effectively independent samples we draw. This is referred to either as effective samples size (ESS) or number of effective samples (\(n_\mathrm{eff}\)).

ArviZ computes ESS according to the prescription laid out in the Vehtari, et al. paper using az.ess(). In the summary, this is given in the ess_bulk column.

az.summary(samples)

	mean	sd	hdi_3%	hdi_97%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat
theta	0.060	7.233	-9.078	8.971	0.478	1.507	467.0	298.0	1.05
v	0.667	2.627	-4.464	5.145	0.392	0.313	41.0	18.0	1.10

We took a total of 4000 steps (1000 on each of four chains), and got an ESS of about 450 for \(\theta\), but a tiny 40 for \(v\). As a rule of thumb, according to Vehtari, et al., you should have ESS > 400, so we have a problem here.

We will also consider ess_tail, commonly referred to as tail-ESS. Again, I will not go into detail of how this is calculated, but this is the effective sample size when considering the more extreme values of the posterior (by default the lower and upper 5th percentiles). Note that this is not the number of samples that landed in the tails, but rather a measure of what the total number of effective samples would be if we were effectively sampling the tails. Again, we want tail-ESS to be greater than 400 as a rule of thumb. We have problems here.

Bear in mind that the ESS calculation is approximate and subject to error. There are, as usual, other caveats, which are discussed in the Vehtari, et al. paper and the Stan manual.

30.2.3 Monte Carlo standard error

The Monte Carlo standard errors (MCSE) are reported as msce_mean and mcse_sd. They are measurements of the standard error of the mean and the standard error of the standard deviation of the chains. They provide an estimate as to how accurate the expectation values given from MCMC samples of the mean and standard deviation are. In practice, if the MCSE of the mean is less than the standard deviation of the samples themselves (that is the mcse_mean column is much less than the sd column), we have taken plenty of samples. The only reason to use the MCSE is if we have a particular strong interest in getting very precise measurement of the mean in particular.

I was hesitant to even discuss this here, since I agree with Gelman, “For Bayesian inference, I don’t think it’s generally necessary or appropriate to report Monte Carlo standard errors of posterior means and quantiles…”

30.3 Diagnostics for HMC

Both \(\hat{R}\) and ESS are useful diagnostics for any MCMC sampler, but Hamiltonian Monte Carlo offers other diagnostics to help ensure that the sampling is going as it should. It is important to note that these diagnostics are a feature of HMC, not a bug. By that I mean that the absence of these diagnostics, particularly divergences, from other sampling methods means that it is harder to ensure that they are sampling properly. The ability to check that it is working properly makes HMC all the more powerful.

30.3.1 Divergences

Hamiltonian Monte Carlo enables large step sizes by taking into account the shape of the target distribution and tracing trajectories along it. (This is of course a very loose description. You should read Michael Betancourt’s wonderful introduction to HMC to get a more complete picture.) When a trajectory encounters a region of parameter space where the posterior (target) distribution has high curvature, the trajectory can veer sharply. These events can be detected and are registered as divergences. A given Monte Carlo step ends in a divergence if this happens. This does not necessarily mean that there is a problem with the sample, but there is a good chance that there is.

Stan keeps track of divergences and reports them. In ArviZ InferenceData objects, they are stored in the sample_stats attribute. Let’s look first at our good samples where we properly warmed up the sampler.

samples.sample_stats.diverging

We can check how many divergences we had by summing them.

int(np.sum(samples.sample_stats.diverging))

Oof! The funnel gave us 222 divergences. This is indicative of a sampler in trouble! We will deal with this momentarily.

30.3.2 Tree depth

The explanation of this diagnostic is a little computer-sciencey, so you can skip to the last sentence of this section if the CS terms are unfamiliar to you.

The HMC algorithm used by Stan uses [recursion](https://en.wikipedia.org/wiki/Recursion_(computer_science). In practice when doing recursive calculations, you need to put a bound on how deep the recursion can go, i.e., you need to cap the tree depth, lest you get stack overflow. Stan therefore has to have a limit on tree depth, the default of which is 10. If this tree depth is hit while trying to take a sample, the sampling is not wrong, but less efficient. Stan therefore reports the tree depth information for each sample. These are also included in the sample_stats.

samples.sample_stats.tree_depth

We can look how many hit a tree depth of 10.

int(np.sum(samples.sample_stats.tree_depth == 10))

So, in this case, we never hit the tree depth. When we do hit the tree depth often, it typically results in a less efficient sampler and the ESS will decrease.

30.3.3 E-BFMI

The energy-Bayes fraction of missing information, or E-BFMI is another metric that is specific to HMC samplers. Loosely speaking (again), it is a measure of how effective the sampler is at taking long steps. Some details are given in the Betancourt paper on HMC, and we will not go into them here, but say that as a rule of thumb, values below 0.2 can be indicative of inefficient sampling.

Stan also automatically computes the E-BFMI.

samples.sample_stats.energy

We see some problematic energies; let’s do a quick check to see how many small ones we have.

int(np.sum(samples.sample_stats.energy < 0.2))

We do have several small values, but we will not dwell on that here.

30.4 Quickly checking the diagnostics

I wrote a function, based on work by Michael Betancourt, to quickly check these diagnostics for a set of samples. It is available in the bebi103.stan submodule.

bebi103.stan.check_all_diagnostics(samples)

tail-ESS for parameter theta is 297.52415027291516.
ESS for parameter v is 40.841114456860105.
tail-ESS for parameter v is 17.763583583602472.
  ESS or tail-ESS below 100 per chain indicates that expectation values
  computed from samples are unlikely to be good approximations of the
  true expectation values.

Rhat for parameter theta is 1.0538396525577298.
Rhat for parameter v is 1.0992399132159116.
  Rank-normalized Rhat above 1.01 indicates that the chains very likely have not mixed.

222 of 4000 (5.55%) iterations ended with a divergence.
  Try running with larger adapt_delta to remove divergences.

0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.

E-BFMI indicated no pathological behavior.

This is a quick check you can do to make sure everything is in order after obtaining samples. But it is very important to note that passing all of these diagnostic checks does not ensure that you achieved effective sampling. And perhaps even more importantly, getting effective sampling certainly does not guarantee that your model is a good one. Nonetheless, good, identifiable models tend to pass the diagnostic checks more often than poor ones.

The funnel is a tough one, and we are failing in many diagnostics.

30.5 Fixing the model: Be nice to your sampler

The diagnostics indicated several divergences, which, as I mentioned before, tend to happen in regions where the target distribution has high curvature. We also have poor effective sample sizes for the parameter \(v\), and the R-hats are large.

Let’s look at a plot of the samples, overlaid with the samples we trust that we can compute by sampling directly with Numpy as we did in ?exr-funnel-of-hell. (You can click on the legend to display or hide respective samples.)

# Sample out of distribution using Numpy
np.random.seed(3252)
v = np.random.normal(0, 3, size=4000)
theta = np.random.normal(0, np.exp(v / 2))

p = bokeh.plotting.figure(
    height=400, width=450, x_range=[-100, 100], x_axis_label="θ", y_axis_label="v"
)
p.scatter(theta, v, alpha=0.3, color="#66c2a5", legend_label="indep. samples")
p.legend.location = "bottom_left"

# Overlay MCMC samples
p.scatter(
    samples.posterior.theta.values.flatten(),
    samples.posterior.v.values.flatten(),
    color="#fc8d62",
    alpha=0.3,
    legend_label="default sampling",
)
p.legend.click_policy = "hide"

bokeh.io.show(p)

Stan’s sampler is clearly not penetrating to the lower regions of the funnel. If we did not have the correctly generated independent samples to compare to, we might not ever discover that this is an issue. So how can we be aware of sampling issues like this?

First off, the divergences clue us in that there is a problem. We can start to investigate what the chains are doing by taking a graphical approach. We can start with the trace plot.

bokeh.io.show(bebi103.viz.trace(samples, parameters=['theta', 'v']))

We immediately see a pathology in the trace plot for \(v\). When \(v\) is small, the chains get stuck and keep rejecting steps. They cannot move. This is because the proposal steps keep ending in divergences and the steps cannot be taken.

We can look at this another way using a parallel coordinate plot. To allow for easy comparison, we will apply a transformation to \(\theta\) such that we show its logarithm (of the absolute value). The function bebi103.viz.parcoord() displays divergent samples in orange.

bokeh.io.show(
    bebi103.viz.parcoord(
        samples,
        transformation={'theta': lambda x: np.log10(np.abs(x))},
        divergence_kwargs={"line_width": 1, "line_alpha": 0.15},
    )
)

From the parallel coordinate plot, most divergences come when \(v\) is small and \(\theta\) is close to zero, which is the bottom of the funnel. The log posterior is also high for these divergences. There is substantial probability mass in the funnel, so we do really need to sample it.

As an alternative plot, we can plot the divergent samples in a different color in a scatter plot of our samples. The bebi103.viz.corner() function automatically does this.

bokeh.io.show(bebi103.viz.corner(samples, parameters=["theta", "v"]))

The graphical display of divergences, in particular in the colored scatter plots as above and in the parallel coordinate plot help diagnose the problem.

30.6 Conquering the Funnel of Hell

How can we get our MCMC sampler to get deep into the funnel? The funnel is caused by the variance of the distribution of \(\theta\) getting very small. This narrows the funnel and any step the sampler takes is too large such that it steps out of the funnel. We need to sample down into the funnel to get true samples out of the target distribution.

30.6.1 Adjusting `adapt_delta`

We could try to take the advice of Stan’s warning messages and decrease the adapt_delta parameter to take smaller steps. The default value is 0.8, so let’s crank it up to 0.99 and see if that works.

with bebi103.stan.disable_logging():
    samples = sm.sample(seed=3252, adapt_delta=0.99)
samples = az.from_cmdstanpy(samples)

# Check diagnostics
bebi103.stan.check_all_diagnostics(samples)

# Add plot of samples
p.scatter(
    samples.posterior.theta.values.flatten(),
    samples.posterior.v.values.flatten(),
    color="#8da0cb",
    alpha=0.3,
    legend_label="small adapt_delta",
)
bokeh.io.show(p)

                                                                                                                                                                                                                                                                                                                                
tail-ESS for parameter theta is 353.4367993429466.
ESS for parameter v is 157.8386124897571.
tail-ESS for parameter v is 333.49820097378057.
  ESS or tail-ESS below 100 per chain indicates that expectation values
  computed from samples are unlikely to be good approximations of the
  true expectation values.

Rhat for parameter theta is 1.0183414104652264.
Rhat for parameter v is 1.0245451664600984.
  Rank-normalized Rhat above 1.01 indicates that the chains very likely have not mixed.

35 of 4000 (0.875%) iterations ended with a divergence.
  Try running with larger adapt_delta to remove divergences.

0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.

E-BFMI indicated no pathological behavior.

That helped. We have far fewer divergences. However, we are still just a bit shy of the bottom of the funnel.

30.6.2 Noncentering

Instead of making the sampler sample out of a distribution with tiny variance, we can make it sample out of a distribution that has a more reasonable variance, and then apply a transformation to those samples to get samples from the tiny variance distribution. To devise a strategy for doing this, we use the change of variables formula for probability distributions. Imagine we have a probability distribution of \(\theta\) with probability density function \(\pi(\theta)\). If we wish to instead had a probability density function of another variable \(\tilde{\theta}\), which we can express as a function of \(\theta\), \(\tilde{\theta} = \tilde{\theta}(\theta)\), we need to ensure that \(\pi(\tilde{\theta})\) is normalized,

\[\begin{align} \int \mathrm{d}\tilde{\theta}\,\pi(\tilde{\theta}) = 1. \end{align}\]

To relate this integral to the integral of \(\pi(\theta)\), we need to properly change variables in the integral. This leads to the change of variables formula,

\[\begin{align} \pi(\tilde{\theta}) = \left|\frac{\mathrm{d}\theta}{\mathrm{d}\tilde{\theta}}\right|\,\pi(\theta). \end{align}\]

Now, if we choose

\[\begin{align} \tilde{\theta} = \frac{\theta - \mu}{\sigma}, \end{align}\]

then

\[\begin{align} \left|\frac{\mathrm{d}\theta}{\mathrm{d}\tilde{\theta}}\right| = \sigma \end{align}\]

and

\[\begin{align} \pi(\tilde{\theta}) = \sigma \pi(\theta). \end{align}\]

If \(\theta\) is Normally distributed with mean \(\mu\) and variance \(\sigma^2\), we have

\[\begin{align} \pi(\theta) = \frac{1}{\sqrt{2\pi\sigma^2}}\,\mathrm{e}^{-(\theta-\mu)^2/2\sigma^2}. \end{align}\]

Then, to satisfy the change of variables formula,

\[\begin{align} \pi(\tilde{\theta}) = \frac{1}{\sqrt{2\pi}}\,\mathrm{e}^{-\tilde{\theta}^2/2}. \end{align}\]

This means that \(\tilde{\theta} \sim \text{Norm}(0, 1)\). Thus, we can reparametrize using the fact that \(\theta \sim \text{Norm}(\mu, \sigma)\) is equivalent to

\[\begin{align} &\tilde{\theta} \sim \text{Norm}(0, 1),\\[1em] &\theta = \mu + \sigma\,\tilde{\theta}. \end{align}\]

So, in our case, we can instead sample using \(\tilde{\theta}\) with

\[\begin{align} &\tilde{\theta} \sim \text{Norm}(0, 1),\\[1em] &\theta = \mathrm{e}^{v/2}\,\tilde{\theta}. \end{align}\]

This process is called uncentering. A non-centered parametrization has the sampler exploring away from the mean of the target distribution (hence, it is non-centered), and then a transformation ensures that the samples come from the target.

Let’s implement the non-centered parametrization of this pathological distribution in Stan. The Stan code is

parameters {
  real theta_tilde;
  real v; 
}


transformed parameters {
  real theta = exp(v/2) * theta_tilde;
}


model {
  v ~ normal(0, 3);
  theta_tilde ~ normal(0, 1);
}

Let’s compile and sample. We won’t bother adjusting adapt_delta; we’ll just see what we get.

with bebi103.stan.disable_logging():
    sm = cmdstanpy.CmdStanModel(stan_file='funnel_noncentered.stan')
    samples = sm.sample(seed=3252)

samples = az.from_cmdstanpy(samples)

bebi103.stan.check_all_diagnostics(samples)

                                                                                                                                                                                                                                                                                                                                
Effective sample size looks reasonable for all parameters.

Rhat looks reasonable for all parameters.

0 of 4000 (0.0%) iterations ended with a divergence.

0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.

E-BFMI indicated no pathological behavior.

Excellent! No divergences and all diagnostics check out. Let’s overlay a plot of the samples to see if we got the whole funnel.

p.scatter(
    samples.posterior.theta.values.flatten(),
    samples.posterior.v.values.flatten(),
    color="#e78ac3",
    alpha=0.3,
    legend_label="non-centered",
)
bokeh.io.show(p)

Look at that! We have managed to sample all the way down the funnel! We have conquered the Funnel of Hell.

30.6.3 Hierarchical models feature a Funnel of Hell

It turns out that many hierarchical models feature a Funnel of Hell, so uncentering is often crucial. We will explore this in the future lessons.

bebi103.stan.clean_cmdstan()

30.7 Computing environment

%load_ext watermark
%watermark -v -p numpy,cmdstanpy,bokeh,bebi103,jupyterlab
print("cmdstan   :", bebi103.stan.cmdstan_version())

Python implementation: CPython
Python version       : 3.12.11
IPython version      : 9.1.0

numpy     : 2.1.3
cmdstanpy : 1.2.5
bokeh     : 3.6.2
bebi103   : 0.1.27
jupyterlab: 4.3.7

cmdstan   : 2.36.0