import numpy as npimport polars as plimport scipy.optimizeimport scipy.stats as stimport statsmodels.tools.numdiff as smndimport cmdstanpyimport arviz as azimport iqplotimport bebi103import bokeh.iobokeh.io.output_notebook()
Loading BokehJS ...
In ?exr-gamma-spike, you considered a model for spiking in which interspike intervals are Gamma distributed. We will use that model here to demonstrate optimization procedures for MAP funding.
36.1 Exploratory data analysis
We’ll again make a quick ECDF of the data set as a remidner.
# Load in as dataframedf = pl.read_csv(os.path.join(data_path, 'rgc_spike_times_1.csv'))# Spike times in milliseconds for conveniencespike_times = df['spike time (ms)'].to_numpy()# Interspike intervalsy = np.concatenate(((spike_times[0],), np.diff(spike_times)))# Make ECDFbokeh.io.show( iqplot.ecdf(y, 'interspike interval (ms)'))
36.2 The model
We will model the interspike intervals as being Gamma distributed with broad priors for the Gamma distribution parameters. Our model is as follows.
We will be doing optimization, but to have a gold standard for comparison, we can code this model up in Stan as follows, modifying the Stan code we wrote in Chapter 26 (which is essentially what you did in ?exr-gamma-spike).
data {int<lower=2> n;array[n] real spike_times;}transformed data {// Sorted spike timesarray[n] real t = sort_asc(spike_times);// Interspike intervalsarray[n] real y; y[1] = t[1];for (i in2:n) { y[i] = t[i] - t[i-1]; }}parameters {real<lower=0> alpha;real<lower=0> beta_;}model {// Priors alpha ~ normal(0, 100); beta_ ~ normal(0, 100);// Likelihood y ~ gamma(alpha, beta_);}
Let’s compile, sample, and make a corner plot to get out standard of comparison.
data =dict(n=len(y), spike_times=spike_times)with bebi103.stan.disable_logging(): sm = cmdstanpy.CmdStanModel(stan_file='gamma_spike.stan') samples = az.from_cmdstanpy(sm.sample(data=data))corner = bebi103.viz.corner(samples)bokeh.io.show(corner)
36.3 Obtaining MAP parameters by optimization
Finding the MAP amounts to an optimization problem in which we find the arguments (parameters) for which a function (the posterior distribution) is maximal. It is almost always much easier to find the maximum for the logarithm of the posterior. In fact, in the vast majority of Bayesian inference problems, we work with log posteriors instead of the posteriors themselves. Since the logarithm function is monotonic, a maximum of the log posterior is also a maximum of the posterior. So, our first step is code up a function that returns the log posterior.
In considering a log posterior, it is useful to know that
Now that we have the log posterior, we can perform numerical optimization. The function scipy.optimize.minimize() offers many options for algorithms, which you can read about in the documentation. I find that Powell’s method tends to work well. It does not rely on derivative information, so discontinuities don’t hurt, nor do the \(-\infty\) values we get in the log posterior for parameter values that are out of bounds. That is particularly important for the present example because we have hard bounds on \(\alpha\) and \(\beta\); they must be positive. We could use constrained optimizers such as L-BFGS-B or COBYLA, but Powell’s method generally suffices. It is sometimes slow, though.
As an alternative, we could instead choose transformed parameters for \(\alpha\) and \(\beta\) that are unbounded. For example, we could let \(a = \log \alpha\) and \(b = \log \beta\) and find the optimal values of \(a\) and \(b\), which are unbounded. We could then use a more efficient unbounded optimizer like BFGS. For simplicity, we will just use Powell’s method, which is slower, but handles constraints without the need for transformations.
The optimizers offered by scipy.optimize.minimize() find minimizers, so we need to define our objective function as the negative log posterior. Since this is the objective function that will be passed into scipy.optimize.minimize(), we need to make sure it has the proper call signature. Specifically, the parameters that are being optimized, in this case \(\alpha\) and \(\beta\), need to be passed in as an array. All other arguments to the negative log posterior are passed in as separate arguments. In our case, that is the ISIs, y. (Note that we also built the log_posterior() function in this way, which is useful for when we compute the Hessian later.)
Next, we have to provide an initial guess for parameter values. The optimization routine only finds a local maximum of the posterior and is not in general guaranteed to converge. Therefore, the initial guess can be very important. Looking at the data, we would expect \(\alpha\) to be somewhere around 1 and for \(\beta\) to be around 40 Hz.
# Initial guess of alpha and betaparams_0 = np.array([1, 0.04])
We are now ready to use scipy.optimize.minimze() to compute the MAP. We use the args kwarg to pass in the other arguments to the neg_log_posterior() function. In our case, these arguments are the data points. The scipy.optimzie.minimize() function returns an object that has several attributes about the resulting optimization. Most important is x, the optimal parameter values.
# Specify extra arguments into posterior functionargs = (y,)# Compute the MAPres = scipy.optimize.minimize( neg_log_posterior, params_0, args=args, method="powell")# Extract optimal parameterspopt = res.x# For convenience...alpha_MAP, beta_MAP = popt# Print resultsprint("""Most probable parameters------------------------α = {0:.2f}β = {1:.4f} kHz""".format( alpha_MAP, beta_MAP ))
/Users/bois/miniconda3/envs/datasai/lib/python3.12/site-packages/scipy/optimize/_optimize.py:2502: RuntimeWarning: invalid value encountered in scalar multiply
tmp2 = (x - v) * (fx - fw)
We have successfully found the MAP which provides just that; the maximally probable parameter values. This tells us only one piece of information about the posterior; where it is maximal. So we know where all of the first derivatives of the posterior are zero. We want to learn more about it, at least near the MAP. To that end, we can compute all of the second derivatives of the log posterior near the MAP. Using the second derivatives (the Hessian), we can start to understand the posterior, at least locally, by approximating it as Normal.
36.3.1 Normal approximation of the posterior
To compute the Normal approximation of the posterior, our task is two-fold. First, find the parameter values that give the maximal posterior probability, which we have already done. Second, compute the Hessian of the log posterior at the MAP and invert it to get your covariance matrix. Note that the covariance matrix is not the inverse of every element of the Hessian; it is the inverse of the Hessian matrix. The credible interval for each parameter can then be calculated from the diagonal elements of the covariance matrix.
To compute the covariance matrix, we need to compute the Hessian of the log of the posterior at the MAP. We will do this numerically using the statsmodels.tools.numdiff module, which we imported as smnd. We use the function smnd.approx_hess() to compute the Hessian at the optimal parameter values. Note that since we already have the log posterior coded up and the MAP parameter values, we can just directly shove these into the numerical Hessian calculator.
The diagonal terms give the approximate variance in the parameters in the posterior. The off-diagonal terms give the covariance, which describes how the parameters relate to each other. Nonzero covariance indicates that they are not completely independent.
36.3.2 Credible intervals
We can compute the approximate 95% credible intervals for the parameters by multiplying the diagonal terms of the posterior covariance matrix \(\mathsf{\Sigma}\) by 1.96.
# Report resultsprint("""Credible intervals (≈ 95% of total probability density)-------------------------------------------------------α = {0:.2f} ± {1:.2f}β = {2:.4f} ± {3:.4f} kHz""".format(alpha_MAP, 1.96*np.sqrt(cov[0,0]), beta_MAP, 1.96*np.sqrt(cov[1,1])))
Credible intervals (≈ 95% of total probability density)
-------------------------------------------------------
α = 1.41 ± 0.09
β = 0.0337 ± 0.0027 kHz
36.3.3 How good is the approximation?
To assess how close the Normal approximation is to the posterior, let’s plot the Normal approximation on top of the corner plot.
# Extract plotsalpha_p = corner.children[1].children[0].children[0]beta_p = corner.children[1].children[1].children[1]sample_p = corner.children[1].children[1].children[0]# Set up ranges for plots; should be near MAPalpha = np.linspace(1.3, 1.55, 200)beta = np.linspace(0.03, 0.038, 200)# Compute approximate normal posteriorpost_norm = np.empty((len(alpha), len(beta)))for i inrange(len(alpha)):for j inrange(len(beta)): post_norm[i, j] = st.multivariate_normal.pdf( (alpha[i], beta[j]), popt, cov )# Approximate marginal posteriorspost_alpha_norm = st.norm.pdf(alpha, popt[0], np.sqrt(cov[0, 0]))post_beta_norm = st.norm.pdf(beta, popt[1], np.sqrt(cov[1, 1]))# Add plotssample_p = bebi103.viz.contour( alpha, beta, post_norm, line_kwargs=dict(line_color="orange"), p=sample_p)alpha_p.line(alpha, post_alpha_norm, line_color='orange', line_width=2)beta_p.line(beta, post_beta_norm, line_color='orange', line_width=2)# Look!bokeh.io.show(corner)
and the exact marginalized posterior on the same contour plot. We can use scipy.stats.multivariate_normal.pdf() function to generate the PDF for the Normal approximation.
The calculation will take a while, since we have to compute the whole posterior in the neighborhood of the MAP to plot it. The whole reason we used optimization in the first place was to avoid this calculation! But we are doing it here to visualize the Normal approximation.
The Normal approximation, shown in orange, is not that far off from the posterior, especially at the peak. This is not always the case! This is one of the main dangers of using optimization, you are using very local information to summarize a posterior that may have relevant features away from the MAP. Despite the marginal distributions being close, we do see that the full posterior is slightly rotated; we do not completely correctly capture the covariance present in the posterior.
36.4 Optimization with Stan
Stan does have functionality for finding the MAP (docs). To find a MAP, use the sm.optimize() function. By default, Stan with transform the parameters appropriately and use the BFGS algorithm. The optimized parameters can then be accesses as a dictionary using the optimized_params_dict attribute of the output.
The lp__ entry in the dictionary is not an optimum parameter value; it is the value of the log posterior at the MAP. We see that Stan gave us the same MAP as we got using SciPy. Note, though, that Stan does not compute the Hessian for us. To do that, which enables us to get credible regions, we need to code up the log posterior and numerically differentiate it. Since we need to code that up anyway, I usually use SciPy for MAP finding. SciPy also have more optimizers and settings therefor available.