25 Model building with prior predictive checks

| Download notebook

# Colab setup ------------------
import os, shutil, sys, subprocess, urllib.request
if "google.colab" in sys.modules:
    cmd = "pip install --upgrade polars iqplot colorcet bebi103 arviz cmdstanpy watermark"
    process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    stdout, stderr = process.communicate()
    from cmdstanpy.install_cmdstan import latest_version
    cmdstan_version = latest_version()
    cmdstan_url = f"https://github.com/stan-dev/cmdstan/releases/download/v{cmdstan_version}/"
    fname = f"colab-cmdstan-{cmdstan_version}.tgz"
    urllib.request.urlretrieve(cmdstan_url + fname, fname)
    shutil.unpack_archive(fname)
    os.environ["CMDSTAN"] = f"./cmdstan-{cmdstan_version}"
    data_path = "https://s3.amazonaws.com/bebi103.caltech.edu/data/"
else:
    data_path = "../data/"
# ------------------------------

import numpy as np
import polars as pl
import scipy.stats as st

import cmdstanpy
import arviz as az

import iqplot
import bebi103

import bokeh.io
bokeh.io.output_notebook()

Loading BokehJS ...

When we do generative modeling, it is important to understand what kind of data might be generated from our model before looking at the data. You may have been struggling to come up with good priors for your models. I find that this struggle is a symptom of the fact that we often think about what kind of data we might see in an experiment as opposed to how we might think the parameters of the generative process that produces the data may be distributed.

In this lesson, we discuss how prior predictive checks can help build appropriate priors for a model. The procedure is used check to make sure the generative model can actually generate reasonable data sets.

25.1 Building a generative model

When considering how to build the model, we should think about the data generation process. We generally specify the likelihood first, and then use it to identify what the parameters are, which tells us what we need to specify for the prior. Indeed, the prior can often only be understood in the context of the likelihood.

Note, though, that the split between the likelihood and prior need even be explicit. When we study hierarchical models, we will see that it is sometimes not obvious how to unambiguously define the likelihood and prior. Rather, in order to do Bayesian inference, we really only need to specify the joint distribution, \(\pi(y,\theta)\), which is the product of the likelihood and prior.

\[\begin{align} g(\theta \mid y) = \frac{f(y\mid \theta)\,g(\theta)}{f(\theta)} = \frac{\pi(y, \theta)}{f(\theta)}. \end{align} \]

After defining the joint distribution, either by defining a likelihood and a prior separately (which is what we usually do) or by directly defining the joint, we should check to make sure that making draws of data sets \(y\) out of it give reasonable results. In other words, we need to ask if the data generation process we have prescribed actually produces data that obeys physical constraints and matches our intuition. The procedure of prior predictive checks enables this. We use the joint distribution to generate parameter sets and data sets and then check if the results make sense. We will learn about doing this using both Numpy and Stan in this lesson.

25.2 The experiment

We will do a very simple experiment in which we take a sample of retinal tissue and expose it to a constant light source. We measure the spiking activity of a single retinal ganglion cell (RGC) over a time interval. We will consider two models, one with Poissonian spiking and one with Inverse Gaussian-distributed spiking. We will build generative models for each, starting with the Poissonian model. We will use prior predictive checks to hone in on the priors.

In order to do the prior predictive checks, we will assume we measure spiking for 30 seconds.

25.3 Model 1: Spikes are Poissonian

For our first model, we assume that spikes arise from a Poisson process, such that the interspike intervals are exponentially distributed.

25.3.1 The likelihood

Let \(y\) be the set of observed interspike intervals. Then, our likelihood is i.i.d. Exponentially distributed interspike intervals with rate parameter \(\beta\). Our likelihood is then

\[\begin{align} y_i \sim \text{Expon}(\beta) \;\forall i. \end{align} \]

This makes clear that the parameter we need to estimate is \(\beta\), so we will need to provide a prior for it.

25.3.2 The prior

With our likelihood specified, we need to give a prior for \(\beta\), \(g(\beta)\). For me, it is easier to think about a time scale than a rate, so I will think about priors of \(\tau = 1/\beta\). So, I ask, “What are reasonable values of waiting times for spikes?” I know that refractory times can be a few milliseconds, and we may have very sluggish spiking, so I want a prior that has probability mass between, say 1 ms and 1 s. I could choose a Log Normal prior for \(\tau\), since my prior knowledge ranges over three orders of magnitude. The Log Normal distribution also has the added benefit that \(\beta\) is guaranteed to be positive, which we need to properly parametrize the likelihood.

I will use the trick that if I want 95% of probability mass of a Normally distributed parameter to lie between two bounds, I choose a location parameter as the midpoint between the bounds and the scale parameter as a quarter of the distance between the two bounds. In this case, I want \(\log_{10} \tau\) to lie between 0 and 3, where \(\tau\) is in units of milliseconds. So, I choose the following prior.

\[\begin{align} &\log_{10} \tau \sim \text{Normal}(1.5, 0.75),\\[1em] &\beta = 10^{- \log_{10} \tau}. \end{align} \]

I have done the egregious sin of taking the logarithm of a quantity with units, but we understand that \(\tau\) is in units of milliseconds and \(\beta\) is in units of inverse milliseconds.

So, we now have a complete model; the likelihood and prior, and therefore the joint distribution, are specified. Here it is (all units are ms):

\[\begin{align} &\log_{10} \tau \sim \text{Normal}(1.5, 0.75),\\[1em] &\beta = 10^{- \log_{10} \tau},\\[1em] &y_i \sim \text{Expon}(\beta) \;\forall i. \end{align} \]

25.3.3 Prior predictive checks

Let us now generate samples out of this generative model. This procedure is known as prior predictive checking. We first generate parameter values drawing out of the prior distribution for \(\beta\). We then use those parameter values to generate a data set using the likelihood. We repeat this over and over again to see what kind of data sets we might expect out of our generative model. Each of these generated data sets is called a prior predictive sample. We can do this efficiently using Numpy’s random number generators.

# Time of recording
T = 30_000  # ms

# Instantiate random number generator
rng = np.random.default_rng()

# Number of prior predictive check samples
n_ppc_samples = 1000

# Draw parameters out of the prior
log_tau = rng.normal(1.5, 0.75, size=n_ppc_samples)
tau = 10**log_tau
beta = 1 / tau

# Draw data sets out of the likelihood for each set of prior params
def draw_spikes(beta, T):
    tau = 1 / beta

    t = rng.exponential(tau)
    spikes = []
    while t < T:
        spikes.append(t)
        isi = rng.exponential(tau)
        t += isi

    return np.array(spikes)


spikes = [draw_spikes(b, T) for b in beta]

There are many ways we could visualize the results. One informative plot is to make an ECDF the ISIs for each of the data sets we generated.

p = None
for spike_vals in spikes[::20]:
    p = iqplot.ecdf(
        np.diff(spike_vals),
        p=p,
        line_kwargs=dict(line_alpha=0.5, line_width=1),
        x_axis_label="ISI (ms)",
    )

bokeh.io.show(p)

This is perhaps better viewed on a logarithmic scale.

p = None
for spike_vals in spikes[::20]:
    p = iqplot.ecdf(
        np.diff(spike_vals),
        p=p,
        line_kwargs=dict(line_alpha=0.5, line_width=1),
        x_axis_label="ISI (ms)",
        x_axis_type='log',
    )

bokeh.io.show(p)

This looks reasonable, except that we do get some very short (and indeed also very long) interspike intervals. We could hone the model to try to get a narrower set of ISIs if our domain knowledge dictates that. For me, I am content to have extra broad priors.

Another option for display is to plot percentiles of the ECDFs. We can do this using the bebi103.viz.predictive_ecdf() function. It expects input as a Numpy array of shape \(n_s \times N\), where \(n_s\) is the number of samples and \(N\) is the number of data points. We would have to collect the ISIs in a bit of a different manner, choosing a specific number of ISIs to observe, instead of a total time for observation. We can do that for illustrative purposes, drawing 1,000 ISIs for each prior sample of \(\beta\).

# Draw 1,000 ISIs
N = 1000
isis = np.array([rng.exponential(1 / b, size=N) for b in beta])

bokeh.io.show(bebi103.viz.predictive_ecdf(isis, x_axis_label="ISI (ms)", x_axis_type='log'))

In this plot, the median ECDF is shown in the middle, the darker blue filled region contains 68% of the samples, and the light blue contains 95% of the samples. The extent of the ECDF gives an indication of the extreme values in the prior predictive data set. The bulk of the ISIs lie in a reasonable region, somewhere between 1 and 100 ms.

25.4 Prior predictive checks with Stan

While generating the data sets to be used in prior predictive checks is intuitive and easy using Numpy, I often find it is more convenient to do it in Stan. This may not be obvious now, but will become clearer later when we construct an entire workflow including prior and posterior predictive checks and use simulation-based calibration.

For ease, we will not generate spike times for a total of 30 minutes, but will rather generate \(N\) interspike intervals (from which the spike times may be easily calculated post facto.

Generating prior predictive samples typically does not require Markov chain Monte Carlo, but the random number generation we are more familiar with. Stan does allow for this kind of random number generation. If you want to draw a sample out of one of Stan’s distributions, append the name of the distribution with _rng. Unlike Numpy’s and Scipy’s random number generators, Stan’s RNGs can only draw one sample at a time (though they may be vectorized with array inputs). Therefore, we have to put the random number generation in a for loop. The code below demonstrates this.

data {
  int<lower=1> N;
}


generated quantities {
  // Parameters
  real log_tau;
  real<lower=0> tau;
  real<lower=0> beta_;

  // Data
  array[N] real isi;

  log_tau = normal_rng(1.5, 0.75);
  tau = 10^log_tau;
  beta_ = 1.0 / tau;
  
  for (i in 1:N) {
    isi[i] = exponential_rng(beta_);
  }
}

The data block contains N, the number of ISI values we want to generate. The generated quantities block defines variables we want to store, in this case beta_ and isi. We first generate values for beta_ and then use those values to generate data isi. Note that though we are not using this syntax here, parameters in the generated quantities block that are enclosed in braces (outside of if statements and for loops, of course) are not saved in the output.

The above code will be useful for our prior predictive checks, but if we want to tweak some of the parametrizations of the priors (e.g., we might want \(\log_{10}\tau\sim \text{Norm}(1, 1)\) instead of \(\log_{10}\tau \sim \text{Norm}(1.5, 0.75)\)), we will have to adjust the Stan code and recompile. For the purposes of prior predictive checks, we may also want to include these values in the data block and pass them in as the data kwarg when sampling.

data {
  int<lower=1> N;
  real log_tau_mu;
  real<lower=0> log_tau_sigma;
}


generated quantities {
  // Parameters
  real log_tau;
  real<lower=0> tau;
  real<lower=0> beta_;

  // Data
  array[N] real isi;

  log_tau = normal_rng(log_tau_mu, log_tau_sigma);
  tau = 10^log_tau;
  beta_ = 1.0 / tau;
  
  for (i in 1:N) {
    isi[i] = exponential_rng(beta_);
  }
}

Let’s build this Stan model.

with bebi103.stan.disable_logging():
    sm_prior_pred = cmdstanpy.CmdStanModel(
        stan_file="poisson_spiking_prior_predictive.stan"
    )

To draw the samples, we specify the contents of the inputted data block, and then sample. We need to use the kwarg fixed_param=True to alert Stan that it will not need to do any MCMC sampling, but rather just run the generated quantities block. We will also use the chains=1 kwarg since we do not need to do parallel sampling.

data = {
    "N": N,
    "log_tau_mu": 1.5,
    "log_tau_sigma": 0.75,
}

with bebi103.stan.disable_logging():
    samples = sm_prior_pred.sample(
        data=data, iter_sampling=1000, fixed_param=True, chains=1, show_progress=False
    )

As usual, we would like to convert the output to an ArviZ InferenceData instance.

samples = az.from_cmdstanpy(prior=samples, prior_predictive=['isi'])

We can pass the array stored in samples.prior_predictive['isi'] directly into the bebi103.viz.predictive_ecdf() function to get the predictive ECDF. (We can do this because we only used one chain. If we had more chains, we would have to reshape the samples into a 2D array, where the row index is the sample and the column index is the index of the data.)

bokeh.io.show(
    bebi103.viz.predictive_ecdf(
        samples.prior_predictive['isi'].squeeze(),
        x_axis_type='log',
        x_axis_label='ISI (ms)',
    )
)

The result is of course the same as when generating data sets with Numpy.

As you can see, it is a bit more verbose to use Stan to generate the prior predictive samples. The calculation time is also substantially longer because of the time required for compilation. Nonetheless, it is often advantageous to use Stan for this stage of an analysis pipeline because when we do simulation based calibration (SBC) of our models, it is useful to have everything written in Stan.

25.5 Model 2: Spikes are Inverse Gaussian distributed

We leave building this model with prior predictive checks as an exercise.

bebi103.stan.clean_cmdstan()

25.6 Computing environment

%load_ext watermark
%watermark -v -p numpy,scipy,cmdstanpy,arviz,bokeh,iqplot,bebi103,jupyterlab
print("cmdstan   :", bebi103.stan.cmdstan_version())

Python implementation: CPython
Python version       : 3.12.11
IPython version      : 9.1.0

numpy     : 2.1.3
scipy     : 1.15.3
cmdstanpy : 1.2.5
arviz     : 0.21.0
bokeh     : 3.6.2
iqplot    : 0.3.7
bebi103   : 0.1.27
jupyterlab: 4.3.7

cmdstan   : 2.36.0