36 Parameter estimation by optimization case study: Gamma likelihood

Code

# Colab setup ------------------
import os, sys, subprocess
if "google.colab" in sys.modules:
    cmd = "pip install --upgrade polars bebi103 watermark"
    process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    stdout, stderr = process.communicate()
    data_path = "https://s3.amazonaws.com/bebi103.caltech.edu/data/"
else:
    data_path = "../data/"
# ------------------------------

import numpy as np
import polars as pl
import scipy.optimize
import scipy.stats as st
import statsmodels.tools.numdiff as smnd

import cmdstanpy
import arviz as az

import iqplot

import bebi103

import bokeh.io
bokeh.io.output_notebook()

Loading BokehJS ...

As an example of optimization techniques for parameter inference, we will use a Gamma model for interspike intervals (ISI) of a neuron.

36.1 Exploratory data analysis

We’ll again make a quick ECDF of the data set as a reminder.

# Load in as dataframe
df = pl.read_csv(os.path.join(data_path, 'rgc_spike_times_1.csv'))

# Spike times in milliseconds for convenience
spike_times = df['spike time (ms)'].to_numpy()

# Interspike intervals
y = np.concatenate(((spike_times[0],), np.diff(spike_times)))

# Make ECDF
bokeh.io.show(
    iqplot.ecdf(y, 'interspike interval (ms)')
)

36.2 The model

We will model the interspike intervals as being Gamma distributed with broad priors for the Gamma distribution parameters. Our model is as follows.

\[\begin{align} &\alpha \sim \text{HalfNorm(0, 100)} ,\\[1em] &\beta \sim \text{HalfNorm(0, 100)} ,\\[1em] &y_i \sim \text{Gamma}(\alpha, \beta)\;\forall i. \end{align} \]

We will be doing optimization, but to have a gold standard for comparison, we can code this model up in Stan as follows, modifying the Stan code we wrote in Chapter 26.

data {
    int<lower=2> n;
    array[n] real spike_times;
}


transformed data {
    // Sorted spike times
    array[n] real t = sort_asc(spike_times);

    // Interspike intervals
    array[n] real y;
    y[1] = t[1];
    for (i in 2:n) {
        y[i] = t[i] - t[i-1];
    }
}


parameters {
    real<lower=0> alpha;
    real<lower=0> beta_;
}


model {
    // Priors
    alpha ~ normal(0, 100);
    beta_ ~ normal(0, 100);

    // Likelihood
    y ~ gamma(alpha, beta_);
}

Let’s compile, sample, and make a corner plot to get out standard of comparison.

data = dict(n=len(y), spike_times=spike_times)

with bebi103.stan.disable_logging():
    sm = cmdstanpy.CmdStanModel(stan_file='gamma_spike.stan')
    samples = az.from_cmdstanpy(sm.sample(data=data))

corner = bebi103.viz.corner(samples, show_contours=True)
bokeh.io.show(corner)

36.3 Obtaining MAP parameters by optimization

Finding the MAP amounts to an optimization problem in which we find the arguments (parameters) for which a function (the posterior distribution) is maximal. It is almost always much easier to find the maximum for the logarithm of the posterior. In fact, in the vast majority of Bayesian inference problems, we work with log posteriors instead of the posteriors themselves. Since the logarithm function is monotonic, a maximum of the log posterior is also a maximum of the posterior. So, our first step is code up a function that returns the log posterior.

In considering a log posterior, it is useful to know that

\[\begin{align} \ln g(\theta \mid y) = \text{constant} + \ln f(y \mid \theta) + \ln g(\theta). \end{align}\]

Furthermore, if the priors for each parameter are independent, such that

\[\begin{align} g(\theta) = g(\theta_1)\, g(\theta_2)\cdots, \end{align}\]

and the measurements are independent and identically distributed (i.i.d.), such that

\[\begin{align} f(y\mid \theta) = f(y_1 \mid \theta)\,f(y_2 \mid \theta)\,f(y_3 \mid \theta)\cdots, \end{align}\]

then

\[\begin{align} \ln g(\theta \mid y) = \text{constant} + \sum_i \ln f(y_i \mid \theta) + \sum_j \ln g(\theta_j). \end{align}\]

This means that we can construct the log posterior by summing up all of the terms in the log likelihood and log prior.

Fortunately, we do not have to hand-code any of the mathematical expressions for the log PDFs; we can use the built-in scipy.stats functions.

def log_prior(alpha, beta):
    # If unphysical, return -infinity
    if alpha <= 0 or beta <= 0:
        return -np.inf
    
    lp = st.halfnorm.logpdf(alpha, 0.0, 100.0)
    lp += st.halfnorm.logpdf(beta, 0.0, 100.0)

    return lp


def log_likelihood(alpha, beta, y):
    return np.sum(st.gamma.logpdf(y, alpha, loc=0, scale=1/beta))


def log_posterior(params, y):
    # Unpack parameters
    alpha, beta = params

    lp = log_prior(alpha, beta)
    if lp == -np.inf:
        return lp

    return lp + log_likelihood(alpha, beta, y)

Now that we have the log posterior, we can perform numerical optimization. The function scipy.optimize.minimize() offers many options for algorithms, which you can read about in the documentation. I find that Powell’s method tends to work well. It does not rely on derivative information, so discontinuities don’t hurt, nor do the \(-\infty\) values we get in the log posterior for parameter values that are out of bounds. That is particularly important for the present example because we have hard bounds on \(\alpha\) and \(\beta\); they must be positive. We could use constrained optimizers such as L-BFGS-B or COBYLA, but Powell’s method generally suffices. It is sometimes slow, though.

As an alternative, we could instead choose transformed parameters for \(\alpha\) and \(\beta\) that are unbounded. For example, we could let \(a = \log \alpha\) and \(b = \log \beta\) and find the optimal values of \(a\) and \(b\), which are unbounded. We could then use a more efficient unbounded optimizer like BFGS. For simplicity, we will just use Powell’s method, which is slower, but handles constraints without the need for transformations.

The optimizers offered by scipy.optimize.minimize() find minimizers, so we need to define our objective function as the negative log posterior. Since this is the objective function that will be passed into scipy.optimize.minimize(), we need to make sure it has the proper call signature. Specifically, the parameters that are being optimized, in this case \(\alpha\) and \(\beta\), need to be passed in as an array. All other arguments to the negative log posterior are passed in as separate arguments. In our case, that is the ISIs, y. (Note that we also built the log_posterior() function in this way, which is useful for when we compute the Hessian later.)

def neg_log_posterior(params, y):
    return -log_posterior(params, y)

Next, we have to provide an initial guess for parameter values. The optimization routine only finds a local maximum of the posterior and is not in general guaranteed to converge. Therefore, the initial guess can be very important. Looking at the data, we would expect \(\alpha\) to be somewhere around 1 and for \(\beta\) to be around 40 Hz.

# Initial guess of alpha and beta
params_0 = np.array([1, 0.04])

We are now ready to use scipy.optimize.minimze() to compute the MAP. We use the args kwarg to pass in the other arguments to the neg_log_posterior() function. In our case, these arguments are the data points. The scipy.optimzie.minimize() function returns an object that has several attributes about the resulting optimization. Most important is x, the optimal parameter values.

# Specify extra arguments into posterior function
args = (y,)

# Compute the MAP
res = scipy.optimize.minimize(
    neg_log_posterior, params_0, args=args, method="powell"
)

# Extract optimal parameters
popt = res.x

# For convenience...
alpha_MAP, beta_MAP = popt

# Print results
print(
    """Most probable parameters
------------------------
α = {0:.2f}
β = {1:.4f} kHz""".format(
        alpha_MAP, beta_MAP
    )
)

Most probable parameters
------------------------
α = 1.41
β = 0.0337 kHz

/Users/bois/Dropbox/git/huji_stats/.pixi/envs/default/lib/python3.13/site-packages/scipy/optimize/_optimize.py:2540: RuntimeWarning: invalid value encountered in scalar multiply
  tmp2 = (x - v) * (fx - fw)

We have successfully found the MAP which provides just that; the maximally probable parameter values. This tells us only one piece of information about the posterior; where it is maximal. So we know where all of the first derivatives of the posterior are zero. We want to learn more about it, at least near the MAP. To that end, we can compute all of the second derivatives of the log posterior near the MAP. Using the second derivatives (the Hessian), we can start to understand the posterior, at least locally, by approximating it as Normal.

36.3.1 Normal approximation of the posterior

To compute the Normal approximation of the posterior, our task is two-fold. First, find the parameter values that give the maximal posterior probability, which we have already done. Second, compute the Hessian of the log posterior at the MAP and invert it to get your covariance matrix. Note that the covariance matrix is not the inverse of every element of the Hessian; it is the inverse of the Hessian matrix. The credible interval for each parameter can then be calculated from the diagonal elements of the covariance matrix.

To compute the covariance matrix, we need to compute the Hessian of the log of the posterior at the MAP. We will do this numerically using the statsmodels.tools.numdiff module, which we imported as smnd. We use the function smnd.approx_hess() to compute the Hessian at the optimal parameter values. Note that since we already have the log posterior coded up and the MAP parameter values, we can just directly shove these into the numerical Hessian calculator.

hes = smnd.approx_hess(popt, log_posterior, args=args)

Now that we have the Hessian, we take its negative inverse to get the covariance matrix.

# Compute the covariance matrix
cov = -np.linalg.inv(hes)

# Look at it
cov

array([[2.29386616e-03, 5.46728407e-05],
       [5.46728407e-05, 1.86448377e-06]])

The diagonal terms give the approximate variance in the parameters in the posterior. The off-diagonal terms give the covariance, which describes how the parameters relate to each other. Nonzero covariance indicates that they are not completely independent.

36.3.2 Credible intervals

We can compute the approximate 95% credible intervals for the parameters by multiplying the diagonal terms of the posterior covariance matrix \(\mathsf{\Sigma}\) by 1.96.

# Report results
print("""
Credible intervals (≈ 95% of total probability density)
-------------------------------------------------------
α = {0:.2f} ± {1:.2f}
β = {2:.4f} ± {3:.4f} kHz
""".format(alpha_MAP, 1.96*np.sqrt(cov[0,0]), beta_MAP, 1.96*np.sqrt(cov[1,1])))


Credible intervals (≈ 95% of total probability density)
-------------------------------------------------------
α = 1.41 ± 0.09
β = 0.0337 ± 0.0027 kHz

36.3.3 How good is the approximation?

To assess how close the Normal approximation is to the posterior, let’s plot the Normal approximation on top of the corner plot. We can use scipy.stats.multivariate_normal.pdf() function to generate the PDF for the Normal approximation.

# Extract plots
alpha_p = corner.children[1].children[0].children[0]
beta_p = corner.children[1].children[1].children[1]
sample_p = corner.children[1].children[1].children[0]

# Set up ranges for plots; should be near MAP
alpha = np.linspace(1.3, 1.55, 200)
beta = np.linspace(0.03, 0.038, 200)

# Compute approximate normal posterior
post_norm = np.empty((len(alpha), len(beta)))
for i in range(len(alpha)):
    for j in range(len(beta)):
        post_norm[i, j] = st.multivariate_normal.pdf(
            (alpha[i], beta[j]), popt, cov
        )

# Approximate marginal posteriors
post_alpha_norm = st.norm.pdf(alpha, popt[0], np.sqrt(cov[0, 0]))
post_beta_norm = st.norm.pdf(beta, popt[1], np.sqrt(cov[1, 1]))

# Add plots
sample_p = bebi103.viz.contour(
    alpha, 
    beta, 
    post_norm, 
    line_kwargs=dict(line_color="orange"), 
    p=sample_p
)
alpha_p.line(alpha, post_alpha_norm, line_color='orange', line_width=2)
beta_p.line(beta, post_beta_norm, line_color='orange', line_width=2)

# Look!
bokeh.io.show(corner)

The Normal approximation, shown in orange, is not that far off from the posterior, especially at the peak, though the correlation between the two parameters \(\alpha\) and \(\beta\) is slightly skewed from true posterior, as seen by comparing the orange and black contours in the bottom left plot. It is not always the case that the Normal approximation is ths good! This is one of the main dangers of using optimization, you are using very local information to summarize a posterior that may have relevant features away from the MAP.

36.4 Optimization with Stan

Stan does have functionality for finding the MAP (docs). To find a MAP, use the sm.optimize() function. By default, Stan with transform the parameters appropriately and use the BFGS algorithm. The optimized parameters can then be accesses as a dictionary using the optimized_params_dict attribute of the output.

sm_opt = sm.optimize(data=data)
sm_opt.optimized_params_dict

15:04:34 - cmdstanpy - INFO - Chain [1] start processing
15:04:34 - cmdstanpy - INFO - Chain [1] done processing

OrderedDict([('lp__', np.float64(-6725.93)),
             ('alpha', np.float64(1.41317)),
             ('beta_', np.float64(0.0336821))])

The lp__ entry in the dictionary is not an optimum parameter value; it is the value of the log posterior at the MAP. We see that Stan gave us the same MAP as we got using SciPy. Note, though, that Stan does not compute the Hessian for us. To do that, which enables us to get credible regions, we need to code up the log posterior and numerically differentiate it. Since we need to code that up anyway, I usually use SciPy for MAP finding. SciPy also have more optimizers and settings therefor available.

bebi103.stan.clean_cmdstan()

36.5 Computing environment

%load_ext watermark
%watermark -v -p numpy,polars,scipy,cmdstanpy,arviz,statsmodels,iqplot,bokeh,bebi103,jupyterlab

Python implementation: CPython
Python version       : 3.13.5
IPython version      : 9.4.0

numpy      : 2.2.6
polars     : 1.31.0
scipy      : 1.16.0
cmdstanpy  : 1.2.5
arviz      : 0.22.0
statsmodels: 0.14.5
iqplot     : 0.3.7
bokeh      : 3.7.3
bebi103    : 0.1.28
jupyterlab : 4.4.5