43 Variate-covariate models with MCMC

Code

# Colab setup ------------------
import os, shutil, sys, subprocess, urllib.request
if "google.colab" in sys.modules:
    cmd = "pip install --upgrade polars colorcet datashader bebi103 arviz cmdstanpy watermark"
    process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    stdout, stderr = process.communicate()
    from cmdstanpy.install_cmdstan import latest_version
    cmdstan_version = latest_version()
    cmdstan_url = f"https://github.com/stan-dev/cmdstan/releases/download/v{cmdstan_version}/"
    fname = f"colab-cmdstan-{cmdstan_version}.tgz"
    urllib.request.urlretrieve(cmdstan_url + fname, fname)
    shutil.unpack_archive(fname)
    os.environ["CMDSTAN"] = f"./cmdstan-{cmdstan_version}"
    data_path = "https://s3.amazonaws.com/bebi103.caltech.edu/data/"
else:
    data_path = "../data/"
# ------------------------------

import numpy as np
import polars as pl

import cmdstanpy
import arviz as az

import iqplot
import bebi103

import bokeh.io
bokeh.io.output_notebook()

Loading BokehJS ...

In this lesson, we will perform parameter estimation using a variate-covariate model. We will use the data set from Good, et al., which investigates how the length of mitotic spindles varies with the size of the encapsulating droplet.

We start by loading in the data set and making a quick plot to remind us of the data set.

# Load in Data Frame
df = pl.read_csv(os.path.join(data_path, "good_invitro_droplet_data.csv"), comment_prefix="#")

# Pull out numpy arrays
ell = df["Spindle Length (um)"].to_numpy()
d = df["Droplet Diameter (um)"].to_numpy()

# Make a plot
p = bokeh.plotting.figure(
    frame_height=200,
    frame_width=300,
    x_axis_label="droplet diameter (µm)",
    y_axis_label="spindle length (µm)",
    x_range=[0, 250],
    y_range=[0, 50],
)

p.scatter(
    source=df.to_dict(),
    x="Droplet Diameter (um)",
    y="Spindle Length (um)",
    alpha=0.3,
)

bokeh.io.show(p)

43.1 Full generative model

We have previously worked out a likelihood for the conserved tubulin model. In this model, the spindle length \(l\) is a function of droplet diameter \(d\) according to

\[\begin{align} l(d;\gamma, \phi) = \frac{\gamma d}{\left(1+(\gamma d/\phi)^3\right)^{\frac{1}{3}}}. \end{align}\]

We will model the variation off of this curve as Normally distributed (assuming homoscedasticity), such that our likelihood is defined as

\[\begin{align} l_i \mid d_i, \gamma, \phi, \sigma \sim \text{Norm}(l(d_i;\gamma, \phi), \sigma) \;\forall i. \end{align}\]

We are now left to choose priors to complete the model. We will build the prior for the three parameters \(\phi\), \(\gamma\), and \(\sigma\), taking each to be independent of the others. Let’s start with \(\sigma\). It is possible that the spindle size is very carefully controlled. It is also possible that it could be highly variable. So, we can choose a Half Normal prior for \(\sigma\) with a large scale parameter of 10 µm; \(\sigma \sim \text{HalfNorm}(10)\). For \(\gamma\), we know the physically is must be between zero and one (as we have worked out in previous lessons), so we will assume a prior on is close to uniform that domain, but with probability density dropping right at zero and one; \(\gamma \sim \text{Beta}(1.1, 1.1)\). For \(\phi\), the typical largest spindle size, we assume a smallest spindle of about a micron (right where I would start to feel uncomfortable betting the farm) and the largest spindle to be about a millimeter (the size of a very large cell, like a Xenopus egg). We therefore take \(\log_{10}\phi \sim \text{Norm}(1.5, 0.75)\), where \(\phi\) is in units of microns.

We thus have our complete generative model.

\[\begin{align} &\log_{10} \phi \sim \text{Norm}(1.5, 0.75),\\[1em] &\gamma \sim \text{Beta}(1.1, 1.1), \\[1em] &\sigma \sim \text{HalfNorm}(10),\\[1em] &\mu_i = \frac{\gamma d_i}{\left(1+(\gamma d_i/\phi)^3\right)^{\frac{1}{3}}}, \\[1em] &l_i \mid d_i, \gamma, \phi, \sigma \sim \text{Norm}(\mu_i, \sigma) \;\forall i. \end{align}\]

43.2 Using Stan to sample

The Stan code we will use is below.

functions {
  real ell_theor(real d, real phi, real gamma_) {
    real denom_ratio = (gamma_ * d / phi)^3;
    return gamma_ * d / cbrt(1 + denom_ratio);
  }
}


data {
  int N;
  int N_ppc;
  array[N] real d;
  array[N_ppc] real d_ppc;
  array[N] real ell;
}


parameters {
  real log10_phi;
  real<lower=0, upper=1> gamma_;
  real<lower=0> sigma;
}


transformed parameters {
  real phi = 10 ^ log10_phi;

  array[N] real mu;

  for (i in 1:N) {
    mu[i] = ell_theor(d[i], phi, gamma_);
  }
}


model {
  // Priors
  log10_phi ~ normal(1.5, 0.75);
  gamma_ ~ beta(1.1, 1.1);
  sigma ~ normal(0.0, 10.0);

  // Likelihood
  ell ~ normal(mu, sigma);
}


generated quantities {
  array[N_ppc] real ell_ppc;

for (i in 1:N_ppc) {
    real mu_ppc = ell_theor(d_ppc[i], phi, gamma_);
    ell_ppc[i] = normal_rng(mu_ppc, sigma);
  }
}

I pause to point out a few syntactical elements.

Note how I used gamma_ as the variable name for gamma. This is because gamma is used as a distribution name in Stan.
I have used the functions block in this Stan program. The definition of the function is real ell_theor(real d, real phi, real gamma_) { }, where the code to be executed in the function is between the braces. This is how you declare a function in Stan. The first word, real specifies what data type is expected to be returned. Then comes the name of the function, followed by its arguments in parentheses preceded by their data type.
Within the function, I had to declare denom_ratio to be real. Remember, Stan is statically typed, so every variable needs a declaration of its type.
The return statement is like Python; return followed by a space and then an expression for the value to be returned.
There is no Half-Normal distribution in Stan. It is implemented by putting a bound on the variable (in this case real<lower=0> sigma;) and then model sigma as being Normally distributed with location parameter zero.
We computed an array of mu values in the transformed parameters block, which enabled us to use the simpler sampling statement of ell ~ normal(mu, sigma).
We are also specifying for what values of the covariate d we want to have posterior predictive values for the spindly length. We do not need to have the covariate be exactly the same as what was used for the experimental observations. We simply need to have the covariate vary over the range of the experimental observations.

All right! Let’s do some sampling!

N_ppc = 200
d_ppc = np.linspace(0.1, 250, N_ppc)
data = {
    "N": len(ell),
    "d": d,
    "ell": ell,
    "N_ppc": N_ppc,
    "d_ppc": d_ppc,
}

with bebi103.stan.disable_logging():
    sm = cmdstanpy.CmdStanModel(stan_file='cons_tubulin_model.stan')
    
    samples = sm.sample(
        data=data,
        chains=4,
        iter_sampling=1000
    )

samples = az.from_cmdstanpy(samples, posterior_predictive='ell_ppc')

With samples in hand, let’s make a plot of the posterior samples.

bokeh.io.show(
    bebi103.viz.corner(
        samples,
        parameters=[("phi", "ϕ (µm)"), ("gamma_", "γ"), ("sigma", "σ (µm)")],
        xtick_label_orientation=np.pi/4,
    )
)

The corner plot provides a complete picture of the posterior and allows for direct assessment of confidence regions. Apparently, we have that \(\phi \approx 38.25\) µm, \(\gamma \approx 0.86\), and \(\sigma \approx 3.75\) µm.

43.3 Posterior predictive checks

Let’s make a plot for a graphical posterior predictive check. Because we have a covariate, we want to make a plot of what values we might get for the variate (spindle length) for various values of the covariate (droplet diameter). We can make sure a plot using the bebi103.viz.predictive_regression() function.

ell_ppc = (
    samples.posterior_predictive["ell_ppc"]
    .stack({"sample": ("chain", "draw")})
    .transpose("sample", "ell_ppc_dim_0")
)

bokeh.io.show(
    bebi103.viz.predictive_regression(
        ell_ppc,
        samples_x=d_ppc,
        percentiles=[30, 50, 70, 99],
        data=np.vstack((d, ell)).transpose(),
        x_axis_label="droplet diameter [µm]",
        y_axis_label="spindle length [µm]",
        x_range=[0, 250],
    )
)

Indeed, the observed measurements are conistent with the model.

bebi103.stan.clean_cmdstan()

43.4 Computing environment

%load_ext watermark
%watermark -v -p numpy,polars,cmdstanpy,arviz,bokeh,bebi103,jupyterlab
print("cmdstan   :", bebi103.stan.cmdstan_version())

Python implementation: CPython
Python version       : 3.13.5
IPython version      : 9.4.0

numpy     : 2.2.6
polars    : 1.31.0
cmdstanpy : 1.2.5
arviz     : 0.22.0
bokeh     : 3.7.3
bebi103   : 0.1.28
jupyterlab: 4.4.5

cmdstan   : 2.36.0