34  Model comparison in practice

Open in Google Colab | Download notebook

Data set download


Code
# Colab setup ------------------
import os, shutil, sys, subprocess, urllib.request
if "google.colab" in sys.modules:
    cmd = "pip install --upgrade polars iqplot colorcet bebi103 arviz cmdstanpy watermark"
    process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    stdout, stderr = process.communicate()
    from cmdstanpy.install_cmdstan import latest_version
    cmdstan_version = latest_version()
    cmdstan_url = f"https://github.com/stan-dev/cmdstan/releases/download/v{cmdstan_version}/"
    fname = f"colab-cmdstan-{cmdstan_version}.tgz"
    urllib.request.urlretrieve(cmdstan_url + fname, fname)
    shutil.unpack_archive(fname)
    os.environ["CMDSTAN"] = f"./cmdstan-{cmdstan_version}"
    data_path = "https://s3.amazonaws.com/bebi103.caltech.edu/data/"
else:
    data_path = "../data/"
# ------------------------------
import numpy as np
import polars as pl

import cmdstanpy
import arviz as az

import bebi103

import bokeh.io
import bokeh.plotting
bokeh.io.output_notebook()
Loading BokehJS ...

When comparing models, posterior predictive checks are a must. As discussed in the previous lesson, we can use information criteria to assess relative predictive effectiveness of models. In order to understand this lesson, you will need have carefully read and understand the previous lesson on the theory behind model comparison.

The key quantity we compute for more comparison is the expected log pointwise predictive density, or elpd. There were a few key ideas and assumptions in using elpd.

  1. The elpd is an approximation of the difference in the Kullback-Leibler divergence between the posterior predictive distribution and the true generative distribution.
  2. For our set of measurements \(y = (y_1, y_2, \ldots y_N)\), where each \(y_i\) may be multidimensional (as you would have, for example, of beak length/beak depth measurements for a single finch), we assume that \(y_i\)’s are independently distributed, both in the model and in the true generative distribution.

With these assumptions, we can approximately compute the elpd using the Watanabe-Akaike information criterion (WAIC) or leave-one-out cross validation (LOO). See this paper by Vehtari, Gelman, and Gabry (arXiv version) for more details about the implementation. As described in that paper, LOO, when computed using Pareto-smoothed importance sampling, is the preferred method for computing an approximate elpd.

Importantly, the (approximate) elpd by itself is not terribly useful in assessing a model. The elpd of one prospective model needs to be compared to another. For this comparison, we can compute Akaike weights. This is the most straightforward calculation of relative weights of respective models, and perhaps easiest to understand. However, it may not be the best way to assess the predictive capabilities of a model, especially in situations where the true generative model is not known (which is often the case for us as scientists). As we think about generative models, and we are not sure which model best generates observed data, it is useful to think about model averaging if our aim is to be predictive. The idea here is that we do not know which model generates data. Instead, we try to find a combination of models that spans all of the models we are considering, that best generate the data. The respective weights of the models give their contributions to this combination of models. As a scientist, I tend to shy away from model averaging; I am rather seeking to understand how nature generates the observations I see, and nature is not trying to predict, nor average models. However, taking a model averaging approach with an eye for optimizing predictive performance leads to more robust estimates of model weights, as outlined in this paper by Yao and coworkers, which describes a technique known as stacking.

In this tutorial, we will demonstrate how to calculate model weights both by using Akaike weights (and variants thereof), and stacking. Conveniently, this may be done approximately directly from samples out of the posterior distributions. The ArviZ package provides much of the functionality we need.

34.1 An example model comparison

To show how we can do an model comparison, we will consider a contrived data set that may come from a single Normal distribution or from a mixture of two Normals.

34.1.1 Computing the pointwise log likelihood

Recalling from Chapter 33, the elpd is the the logarithm of the posterior predictive distribution, \(f(\tilde{y}_i\mid y)\), averaged over the true generative distribution \(f_t(\tilde{y}_i)\).

\[\begin{align} \text{elpd} = \sum_{i=1}^N\int\mathrm{d}\tilde{y}_i\,\,f_t(\tilde{y}_i)\,\ln f(\tilde{y}_i\mid y). \end{align}\]

When generating our samples, we therefore also need to compute samples of the value of the log likelihood. To do this, for each set of parameters \(\theta\) that we sample out of the posterior, we compute the pointwise log likelihood of the data set, using the parameters \(\theta\). The Stan code below includes these log likelihood samples as well as posterior predictive checks for a univariate Gaussian mixture model. Be sure to read the code carefully.

data {
    int<lower=1> N;
    array[N] real y;

    // Number of mixtures
    int<lower=1> K;
}


transformed data {
    // Prior parameters
    real mu_mu = 0;
    real sigma_mu = 100;
    real sigma_sigma = 100;
}


parameters {
    // location and scale parameters of each component of the mixture
    ordered[K] mu;
    array[K] real<lower=0> sigma;

    // Weights of components of mixture
    simplex[K] pi_;
}


model {
    // Priors
    mu ~ normal(mu_mu, sigma_mu);
    sigma ~ normal(0.0, sigma_sigma);

    // Likelihood. Use logsumexp trick to compute likelihood of mixture model
    array[K] real log_pdf_components;
    for (i in 1:N) {
        for (k in 1:K) {
            log_pdf_components[k] = log(pi_[k]) + normal_lpdf(y[i] | mu[k], sigma[k]);
        }
        target += log_sum_exp(log_pdf_components);
    }
}


generated quantities {
    // Posterior predictive checks
    array[N] real y_ppc;
    
    for (i in 1:N) {
        int component = categorical_rng(pi_);
        y_ppc[i] = normal_rng(mu[component], sigma[component]);
    }

    // Pointwise log likelihood
    array[N] real log_lik;

    for (i in 1:N) {
        array[K] real log_pdf_components_log_lik;
        for (k in 1:K) {
            log_pdf_components_log_lik[k] = log(pi_[k]) + normal_lpdf(y[i] | mu[k], sigma[k]);
        }
        log_lik[i] = log_sum_exp(log_pdf_components_log_lik);
    }

}

In the array log_lik, I store the pointwise log likelihood. That is, for each measurement (in this case for each \(y_i\)), I compute the log likelihood for that data point using the parameters that I sampled out of the posterior. Conveniently, Stan’s distributions all have a function that ends in _lpdf that compute the log probability density function for the distribution (with _lpmf for discrete distributions that computes the log probability mass function for the distribution).

Let’s sample out of this generative model, keeping samples of posterior predictive data sets and pointwise log likelihoods.

# Load data and make data dictionary
df = pl.read_csv(os.path.join(data_path, "possibly_bimodal.csv"), comment_prefix="#")
data = dict(N=len(df), y=df["y"].to_numpy(), K=1)

with bebi103.stan.disable_logging():
    # Compile the model
    sm = cmdstanpy.CmdStanModel(stan_file="normal_mixture.stan")
    
    # Perform sampling
    samples_single = sm.sample(data=data)
                                                                                                                                                                                                                                                                                                                                

When we convert the samples to an ArviZ object, we need to specify the posterior predictive and log likelihood.

# Convert to ArviZ object
samples_single = az.from_cmdstanpy(
    posterior=samples_single, posterior_predictive="y_ppc", log_likelihood="log_lik"
)

Now, we will check the diagnostics (as we always should) and make a corner plot.

# Check diagnostics
bebi103.stan.check_all_diagnostics(samples_single)

# Make a corner plot
bokeh.io.show(bebi103.viz.corner(samples_single, parameters=['mu[0]', 'sigma[0]']))
Effective sample size looks reasonable for all parameters.

Rhat for parameter pi_[0] is NaN.

0 of 4000 (0.0%) iterations ended with a divergence.

0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.

E-BFMI indicated no pathological behavior.

Everything looks good. Actually, it doesn’t. The sampling looks good, but we should do posterior predictive checks.

y_ppc = samples_single.posterior_predictive.y_ppc.stack(
    {"sample": ("chain", "draw")}
).transpose("sample", "y_ppc_dim_0")

bokeh.io.show(
    bebi103.viz.predictive_ecdf(
        y_ppc,
        name="y_ppc",
        data=df["y"].to_numpy(),
        x_axis_label="y",
    )
)

We have clearly failed the posterior predictive checks. We can stop here, but we will continue to compute the WAIC and LOO for illustrative purposes. As we do that, let’s take a quick look at the output so we can see how the log likelihood samples are organized. The log likelihood is stored in ArviZ InferenceData objects in the log_likelihood attribute.

samples_single.log_likelihood
<xarray.Dataset> Size: 13MB
Dimensions:        (chain: 4, draw: 1000, log_lik_dim_0: 394)
Coordinates:
  * chain          (chain) int64 32B 0 1 2 3
  * draw           (draw) int64 8kB 0 1 2 3 4 5 6 ... 994 995 996 997 998 999
  * log_lik_dim_0  (log_lik_dim_0) int64 3kB 0 1 2 3 4 5 ... 389 390 391 392 393
Data variables:
    log_lik        (chain, draw, log_lik_dim_0) float64 13MB -4.635 ... -3.176
Attributes:
    created_at:                 2025-06-24T06:33:44.724251+00:00
    arviz_version:              0.21.0
    inference_library:          cmdstanpy
    inference_library_version:  1.2.5

The log likelihood is three dimensional, one dimension each for chain and draw, and then the final dimension is for the pointwise log-likelihood estimates.

Given that the log likelihood is stored in the ArviZ object, we can directly use the samples to compute the WAIC and LOO.

34.1.2 Computing the WAIC and LOO

We will start with the WAIC. We use the scale="deviance" kwarg to get the WAIC. By default, az.waic() will return the estimate for the elpd and not the traditionally used \(-2\mathrm{elpd}_\mathrm{WAIC}\), which is why we use the scale="deviance" kwarg.

az.waic(samples_single, scale="deviance")
Computed from 4000 posterior samples and 394 observations log-likelihood matrix.

              Estimate       SE
deviance_waic  2908.20    30.09
p_waic            2.18        -

The output gives an estimate for the WAIC, and also the contribution of \(p_\mathrm{waic}\). It also gives an estimate of the standard error in the WAIC.

Let’s now compute the LOO.

single_loo = az.loo(samples_single, scale="deviance")

single_loo
Computed from 4000 posterior samples and 394 observations log-likelihood matrix.

             Estimate       SE
deviance_loo  2908.21    30.10
p_loo            2.19        -
------

Pareto k diagnostic values:
                         Count   Pct.
(-Inf, 0.70]   (good)      394  100.0%
   (0.70, 1]   (bad)         0    0.0%
   (1, Inf)   (very bad)    0    0.0%

We see that the LOO and WAIC give almost identical results (as they should). The Pareto k diagnostic is also good, meaning that the LOO calculation does not have obvious pathologies. Remember, though, that LOO has better performance across a wider variety of models.

34.1.3 Calculations with the mixture model

Now, let’s do the same calculation for the mixture model. Since we wrote Stan code for a general univariate Gaussian mixture model, we can just change out inputted \(K\) value (the number of components in the mixture) to 2 and sample.

data['K'] = 2

with bebi103.stan.disable_logging():
    samples_mix = sm.sample(data=data)

samples_mix = az.from_cmdstanpy(samples_mix, posterior_predictive='y_ppc', log_likelihood='log_lik')
                                                                                                                                                                                                                                                                                                                                

We’ll do our usual diagnostic checks and make a corner plot.

# Check diagnostics
bebi103.stan.check_all_diagnostics(samples_mix)

# Make corner plot
bokeh.io.show(
    bebi103.viz.corner(
        samples_mix,
        parameters=["mu[0]", "mu[1]", "sigma[0]", "sigma[1]", "pi_[0]"],
        xtick_label_orientation=np.pi / 4,
    )
)
Effective sample size looks reasonable for all parameters.

Rhat looks reasonable for all parameters.

0 of 4000 (0.0%) iterations ended with a divergence.

0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.

E-BFMI indicated no pathological behavior.

Everything looks good! We do see a clear separation in \(\mu\) values in the two components of the mixture. It also appears as though the smaller-\(\mu\) component of the mixture has a mixing coefficient of about 0.45. We can do a quick posterior predictive check.

y_ppc = samples_mix.posterior_predictive.y_ppc.stack(
    {"sample": ("chain", "draw")}
).transpose("sample", "y_ppc_dim_0")

bokeh.io.show(
    bebi103.viz.predictive_ecdf(
        y_ppc,
        name="y_ppc",
        data=df["y"].to_numpy(),
        x_axis_label="y",
    )
)

Much nicer! The model allows for plenty of flexibility to allow for the observed bimodal behavior. Let’s proceed to compute the LOO and WAIC, starting with the WAIC.

az.waic(samples_mix, scale="deviance")
/Users/bois/miniconda3/envs/datasai/lib/python3.12/site-packages/arviz/stats/stats.py:1655: UserWarning: For one or more samples the posterior variance of the log predictive densities exceeds 0.4. This could be indication of WAIC starting to fail. 
See http://arxiv.org/abs/1507.04544 for details
  warnings.warn(
Computed from 4000 posterior samples and 394 observations log-likelihood matrix.

              Estimate       SE
deviance_waic  2795.92    32.32
p_waic            5.12        -

There has been a warning during the calculation. Please check the results.

And the LOO.

mix_loo = az.loo(samples_mix, scale="deviance")

mix_loo
Computed from 4000 posterior samples and 394 observations log-likelihood matrix.

             Estimate       SE
deviance_loo  2795.95    32.33
p_loo            5.14        -
------

Pareto k diagnostic values:
                         Count   Pct.
(-Inf, 0.70]   (good)      394  100.0%
   (0.70, 1]   (bad)         0    0.0%
   (1, Inf)   (very bad)    0    0.0%

As expected, we get almost the same value for the two information criteria. Let’s take a quick look at the difference of the LOO’s between the mixture model and the single Negative Binomial model.

mix_loo.elpd_loo - single_loo.elpd_loo
np.float64(-112.25895344885703)

Remember that for historical reasons,

\[\begin{align} \text{WAIC} \approx -2\,\text{elpd}, \\[1em] \text{LOO} \approx -2\,\text{elpd}. \\[1em] \end{align}\]

The bigger the elpd is, the smaller the Kullback-Leibler divergence is, so the better the model is. Thus, a bigger elpd means a smaller WAIC or LOO. So, the smaller the WAIC or LOO is, the closer the model is to the true generative model. This WAIC and LOO are smaller for the mixture model than for the single Normal model, so the mixture model is a better model.

34.1.4 Computing the weights

We can directly compute the Akaike weights from the values of the LOO, using

\[\begin{align} w_i = \frac{\exp\left[-(\text{LOO}_i-\text{LOO}_j)/2\right]}{1 + \exp\left[-(\text{LOO}_i-\text{LOO}_j)/2\right]}. \end{align}\]

d_loo = mix_loo.elpd_loo - single_loo.elpd_loo
w_single = np.exp(d_loo/2) / (1 + np.exp(d_loo/2))
w_mix = 1 - w_single

print('      Mixture model weight:', w_mix)
print('Single Normal model weight:', w_single)
      Mixture model weight: 1.0
Single Normal model weight: 4.200277524728524e-25

In agreement with our posterior predictive checks, the mixture model is far more predictive than the single negative binomial model.

As I mentioned above, ArviZ offers more a sophisticated means of computing the weights using stacking. The results tend to be less extreme (and therefore more conservative) that directly computing the Akaike weights. We can use the az.compare() function to do the calculation. We will do it using the LOO (WAIC is default, so we use the ic kwarg). The first input is a dictionary containing the MCMC samples, where the keys of the dictionary are the names of the models.

az.compare({"single": samples_single, "mix": samples_mix}, ic="loo", scale="deviance")
rank elpd_loo p_loo elpd_diff weight se dse warning scale
mix 0 2795.946936 5.138599 0.000000 1.000000e+00 32.327322 0.000000 False deviance
single 1 2908.205890 2.188323 112.258953 3.882406e-11 30.096514 19.915636 False deviance

The mixture model is still dominant. (Again, we are not going into the details of the stacking calculation, but you can read about it in this paper by Yao and coworkers.)

bebi103.stan.clean_cmdstan()

34.2 Computing environment

%load_ext watermark
%watermark -v -p numpy,polars,cmdstanpy,arviz,bokeh,bebi103,jupyterlab
print("cmdstan   :", bebi103.stan.cmdstan_version())
Python implementation: CPython
Python version       : 3.12.11
IPython version      : 9.1.0

numpy     : 2.1.3
polars    : 1.30.0
cmdstanpy : 1.2.5
arviz     : 0.21.0
bokeh     : 3.6.2
bebi103   : 0.1.27
jupyterlab: 4.3.7

cmdstan   : 2.36.0