35 Bayesian approach to parameter estimation by optimization

Code

# Colab setup ------------------
import os, sys, subprocess
if "google.colab" in sys.modules:
    cmd = "pip install --upgrade bebi103 watermark"
    process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    stdout, stderr = process.communicate()
    data_path = "https://s3.amazonaws.com/bebi103.caltech.edu/data/"
else:
    data_path = "../data/"
# ------------------------------

import warnings
with warnings.catch_warnings():
    warnings.simplefilter('ignore')

    import numpy as np
    import scipy.stats as st
    import bebi103.viz

import bokeh.io
import bokeh.plotting
bokeh.io.output_notebook()

Loading BokehJS ...

We have learned in previous lessons that the posterior distribution for a set of parameters \(\theta\) given a data set \(y\) is given by the likelihood \(f(y\mid \theta)\) and prior \(g(\theta)\) according to

\[ \begin{align} g(\theta\mid y) \propto f(y\mid \theta)\,g(\theta). \end{align} \tag{35.1}\]

The central goal of Bayesian inference is making sense of this posterior distribution; writing it down is the easy part. We have learned how to make sense of the posterior using sampling. Now, we will consider an alternative, generally far inferior but often nevertheless necessary, approach: optimization.

35.1 Summarizing the posterior near its maximum

For many posteriors, there exists a set of parameters for which the posterior is maximal. This set of parameters, denoted \(\theta^*\), is referred to as the maximal a posteriori parameter set, abbreviated MAP. Given that \(\theta^*\) maximizes the posterior, it is of some importance (but certainly not paramount importance), and may be useful in summarizing a posterior.

Finding the MAP is an optimization problem. In some simple, rare cases, the optimization problem may be solved analytically, though we will mostly focus on numerical methods. Specifically, we will discuss numerical methods for finding \(\theta^*\) for continuous \(\theta\) as we proceed through this lesson. For now, we will proceed assuming we have found \(\theta^*\) and seek to approximate the posterior distribution near the MAP.

Near the MAP, we can approximately express the log posterior as a Taylor expansion truncated to second order,

\[\begin{align} \ln g(\theta\mid y) \approx \ln g(\theta^*\mid y) + \frac{1}{2}\left(\theta - \theta^*\right)^\mathsf{T} \cdot \mathsf{B} \cdot \left(\theta - \theta^*\right), \end{align} \]

where \(\mathsf{B}\) is the symmetric Hessian matrix of second derivatives, with entries

\[\begin{align} B_{ij} = \left.\frac{\partial^2\,\ln g(\theta\mid y)}{\partial \theta_i\,\partial \theta_j}\right|_{\theta = \theta^*}. \end{align} \]

In one dimension, this is

\[\begin{align} \ln g(\theta\mid y) \approx \ln g(\theta^*\mid y) + \frac{1}{2}\left(\left.\frac{\partial^2\,\ln g(\theta\mid y)}{\partial \theta^2}\right|_{\theta = \theta^*}\right) \left(\theta - \theta^*\right)^2 = \ln g(\theta^*\mid y) - \frac{\left(\theta - \theta^*\right)^2}{2\sigma^2}, \end{align} \]

where we have defined

\[\begin{align} \sigma^2 = -\left(\left.\frac{\partial^2\,\ln g(\theta\mid y)}{\partial \theta^2}\right|_{\theta = \theta^*}\right)^{-1}. \end{align} \]

We note that the \(-(\theta-\theta^*)^2/2\sigma^2\) term is the \(\theta\)-dependence of the logarithm of the PDF of a Normal distribution! We can therefore locally approximate the posterior distribution as Normal near the MAP;

\[\begin{align} g(\theta\mid y) \approx \frac{1}{\sqrt{2\pi\sigma^2}}\,\exp\left[-\frac{\left(\theta - \theta^*\right)^2}{2\sigma^2}\right]. \end{align} \]

This generalizes to multiple dimensions, where the covariance matrix \(\mathsf{\Sigma}\) of the approximate multivariate Normal distribution is related to the Hessian matrix by

\[\begin{align} \mathsf{\Sigma} = - \mathsf{B}^{-1}. \end{align} \]

Note that the covariance matrix is not the inverse of every element of the Hessian; it is the inverse of the Hessian matrix. Note also that optimization theory tells us that because the Hessian is evaluated at \(\theta^*\), which maximizes the posterior, the Hessian must be negative definite. Therefore, the covariance matrix must be positive definite, which we already know must by the case for a multivariate Normal distribution.

35.1.1 Demonstration of the Normal approximation

Actually, the probability density function of any continuous, peaked distribution can be approximated as Normal near a peak. We can demonstrate this fact by approximating a Gamma distribution locally as Normal. Recall that the PDF for a Gamma distribution is

\[\begin{align} f(x ; \alpha, \beta) = \frac{1}{\Gamma(\alpha)}\,\frac{(\beta x)^\alpha}{x}\,\mathrm{e}^{-\beta x}. \end{align}\]

We can analytically find the mode by setting

\[\begin{align} \frac{\mathrm{d}\,\ln f}{\mathrm{d}x} = \frac{\mathrm{d}}{\mathrm{d}x}\left((\alpha-1)\ln x - \beta x\right) = \frac{\alpha-1}{x} - \beta = 0, \end{align}\]

and solving to get \(x^* = (\alpha - 1) / \beta\). We can also compute the second derivative as

\[\begin{align} \frac{\mathrm{d}^2\,\ln f}{\mathrm{d}x^2} = -\frac{\alpha - 1}{x^2}, \end{align}\]

which evaluated at \(x^*\) is \(-\beta^2 / (\alpha - 1)\). Therefore, the scale parameter for the approximate Normal distribution is \(\sigma = \sqrt{\alpha - 1} / \beta\). We can plot the distribution and its Normal approximation as a visualization. We do this for \(\alpha = 50\) and \(\beta = 20\).

x = np.linspace(1, 5, 400)
alpha = 50
beta = 20
y_gamma = st.gamma.pdf(x, alpha, loc=0, scale=1 / beta)
y_norm = st.norm.pdf(x, (alpha - 1) / beta, np.sqrt(alpha - 1) / beta)

p = bokeh.plotting.figure(
    frame_width=400,
    frame_height=200,
    x_range=[1, 5],
    x_axis_label="x",
    y_axis_label="PDF",
)

p.line(x, y_gamma, line_width=2, legend_label="Gamma")
p.line(x, y_norm, line_width=2, color="tomato", legend_label="Normal")

bokeh.io.show(p)

35.2 Credible intervals

A \(\alpha\)-percent Bayesian credible interval for a parameter is an an interval that contains \(\alpha\) percent of the posterior probability density. (This is different form a frequentist confidence interval, though a confidence interval is often erroneously interpreted as a credible interval.) For example, if \(g(\theta \mid y)\) is the marginalized posterior probability density function for a single parameter \(\theta\) and \(G(\theta\mid y)\) is the cumulative distribution function, then a 95% credible interval is \([G^{-1}(0.025),\;G^{-1}(0.975)]\).

If we approximate the posterior as a multivariate Normal distribution, we have analytical results for credible intervals for each parameter. This is the main benefit of using optimization; it affords us an approximate (Normal) expression for the posterior that has convenient, known analytical results. This enables us to make sense of the posterior.

Specifically, if \(\Sigma_{ii}\) is a diagonal element in the covariance matrix \(\mathsf{\Sigma}\), then, e.g., the 95% credible interval for parameter \(\theta_i\) is centered on the MAP estimate, \(\theta_i^*\) and is \([\theta_i^* - 1.96\sqrt{\Sigma_{ii}},\;\theta_i^* + 1.96\sqrt{\Sigma_{ii}}]\). Because the credible intervals are symmetric under this approximation, we can report the estimates as the \(\theta_i^* \pm 1.96 \sqrt{\Sigma_{ii}}\).

35.2.1 Credible intervals may be skewed

We are approximating the PDF of a distribution close to its mode as Normal. However, for an asymmetric distribution, which posterior distributions can certainly be, the mode may not be all that relevant as far as credible regions go. As an extreme example, consider an Exponential distribution. It’s mode is zero, which is also the minimal allowed value. If we were to compute a central 95% credible region for a posterior that happened to be Exponential, the mode would not be included. In cases like these, though the Normal distribution approximately describes the PDF of the posterior near the MAP, the Normal approximation can lead to rather flawed credible regions.

To demonstrate this, let us again return to the Gamma example, this time with \(\alpha = 5\) and \(\beta = 2\). The mode of this Gamma distribution is 2, but the median is about 2.34. If we look graphically, we can see right away that the Normal approximation is not as good as the \(\alpha = 50\), \(\beta = 20\) case.

x = np.linspace(0, 8, 400)
alpha = 5
beta = 2
y_gamma = st.gamma.pdf(x, alpha, loc=0, scale=1 / beta)
y_norm = st.norm.pdf(x, (alpha - 1) / beta, np.sqrt(alpha - 1) / beta)

p = bokeh.plotting.figure(
    frame_width=400,
    frame_height=200,
    x_range=[0, 8],
    x_axis_label="x",
    y_axis_label="PDF",
)

p.line(x, y_gamma, line_width=2, legend_label="Gamma")
p.line(x, y_norm, line_width=2, color="tomato", legend_label="Normal")

bokeh.io.show(p)

Let’s look at how this will affect the credible intervals. We will compute the lower and upper bound to a credible interval directly from the Gamma distribution and from its Normal approximation. We can compute credible intervals using the ppf() method of distributions in scipy.stats, which returns the inverse of the CDF. We can then plot the 1%, …, 99% credible intervals for the Gamma distribution and for its Normal approximation.

perc_cred_int = np.linspace(1, 99, 200)
mu = (alpha - 1) / beta
sigma = np.sqrt(alpha - 1) / beta

gamma_low = st.gamma.ppf((50 - perc_cred_int / 2) / 100, alpha, loc=0, scale=1 / beta)
gamma_high = st.gamma.ppf((50 + perc_cred_int / 2) / 100, alpha, loc=0, scale=1 / beta)
norm_low = st.norm.ppf((50 - perc_cred_int / 2) / 100, mu, sigma)
norm_high = st.norm.ppf((50 + perc_cred_int / 2) / 100, mu, sigma)

p = bebi103.viz.fill_between(
    gamma_low,
    perc_cred_int,
    gamma_high,
    perc_cred_int,
    patch_kwargs=dict(alpha=0.5),
    x_axis_label="x",
    y_axis_label="percent credible interval",
    y_range=[0, 100],
)
p = bebi103.viz.fill_between(
    norm_low,
    perc_cred_int,
    norm_high,
    perc_cred_int,
    patch_kwargs=dict(alpha=0.5, color="tomato"),
    line_kwargs=dict(color="tomato"),
    p=p,
)

bokeh.io.show(p)

For tiny credible intervals, the shift is clear. The median for a Gamma distribution with these parameters is shifted rightward from its mode, whereas the mode and median for a Normal distribution are equal. As we get broader credible intervals, the Normal approximation notably has the 99% credible interval containing negative values (not allowed for a Gamma distribution) and missing important values at the high end.

35.3 Why use the MAP for parameter estimation?

Despite these drawbacks, using the MAP/Normal approximation for parameter estimation is still useful in Bayesian inference for two classes of problem. Why? The most important reason is speed.

The first class of problem for which optimization may be useful is simple, well-behaved problems. For these, optimization is much faster than other methods for characterizing the posterior, such as directly plotting it or using Markov chain Monte Carlo. If you have high-throughput data, you may not have the computational resources to do a full Bayesian treatment with other methods. Provided your model that is tractable by optimization method, finding the MAP and summarizing with an approximate Normal distribution is an attractive method.

The second class of problem is for models that are specifically designed with a structure amenable for inference using optimization. This seems odd. Why would one design a model for use of a method in place of a model that is the best effort to model the generative process? In practice, we do need to tweak models even when using MCMC to make sense of the full posterior, since some models are very difficult to sample. In very large models, particularly those used in machine learning and AI applications, tractability is a major concern, and optimization methods, typically with specifically designed algorithms which are beyond the scope of what we are covering in this course, are essential to get estimates of the parameters of the models.

Aside from models with structure that is specifically designed for ease of parameter inference by optimization, optimization methods suffer from the curse of dimensionality. In general, high-dimensional optimization problems, which is the case for models with a large number of parameters, are very difficult to solve. However, as mentioned above, a common practice is to choose models for which the optimization problems are more tractable, even with large data sets and/or large numbers of parameters. Note that this is a purely pragmatic approach!

In many cases, the Normal approximation can also be a good approximation. But this is definitely not guaranteed! In real inference problems, it is very difficult to identify pathologies that would lead to breakdown of the Normal approximation, since comparing to the real posterior would involve computationally intensive methods to compute it.

35.4 Computing environment

%load_ext watermark
%watermark -v -p numpy,scipy,bokeh,bebi103,jupyterlab

Python implementation: CPython
Python version       : 3.13.5
IPython version      : 9.4.0

numpy     : 2.2.6
scipy     : 1.16.0
bokeh     : 3.7.3
bebi103   : 0.1.28
jupyterlab: 4.4.5