import warningswith warnings.catch_warnings(): warnings.simplefilter('ignore')import numpy as npimport scipy.stats as stimport bebi103.vizimport bokeh.ioimport bokeh.plottingbokeh.io.output_notebook()
Loading BokehJS ...
We have learned in previous lessons that the posterior distribution for a set of parameters \(\theta\) given a data set \(y\) is given by the likelihood \(f(y\mid \theta)\) and prior \(g(\theta)\) according to
The central goal of Bayesian inference is making sense of this posterior distribution; writing it down is the easy part. We have learned how to make sense of the posterior using sampling. Now, we will consider an alternative, generally far inferior but often nevertheless necessary, approach: optimization.
35.1 Summarizing the posterior near its maximum
For many posteriors, there exists a set of parameters for which the posterior is maximal. This set of parameters, denoted \(\theta^*\), is referred to as the maximal a posteriori parameter set, abbreviated MAP. Given that \(\theta^*\) maximizes the posterior, it is of some importance (but certainly not paramount importance), and may be useful in summarizing a posterior.
Finding the MAP is an optimization problem. In some simple, rare cases, the optimization problem may be solved analytically, though we will mostly focus on numerical methods. Specifically, we will discuss numerical methods for finding \(\theta^*\) for continuous \(\theta\) as we proceed through this lesson. For now, we will proceed assuming we have found \(\theta^*\) and seek to approximate the posterior distribution near the MAP.
Near the MAP, we can approximately express the log posterior as a Taylor expansion truncated to second order,
We note that the \(-(\theta-\theta^*)^2/2\sigma^2\) term is the \(\theta\)-dependence of the logarithm of the PDF of a Normal distribution! We can therefore locally approximate the posterior distribution as Normal near the MAP;
This generalizes to multiple dimensions, where the covariance matrix \(\mathsf{\Sigma}\) of the approximate multivariate Normal distribution is related to the Hessian matrix by
Note that the covariance matrix is not the inverse of every element of the Hessian; it is the inverse of the Hessian matrix. Note also that optimization theory tells us that because the Hessian is evaluated at \(\theta^*\), which maximizes the posterior, the Hessian must be negative definite. Therefore, the covariance matrix must be positive definite, which we already know must by the case for a multivariate Normal distribution.
35.1.1 Demonstration of the Normal approximation
Actually, the probability density function of any continuous, peaked distribution can be approximated as Normal near a peak. We can demonstrate this fact by approximating a Gamma distribution locally as Normal. Recall that the PDF for a Gamma distribution is
which evaluated at \(x^*\) is \(-\beta^2 / (\alpha - 1)\). Therefore, the scale parameter for the approximate Normal distribution is \(\sigma = \sqrt{\alpha - 1} / \beta\). We can plot the distribution and its Normal approximation as a visualization. We do this for \(\alpha = 50\) and \(\beta = 20\).
A \(\alpha\)-percent Bayesian credible interval for a parameter is an an interval that contains \(\alpha\) percent of the posterior probability density. (This is different form a frequentist confidence interval, though a confidence interval is often erroneously interpreted as a credible interval.) For example, if \(g(\theta \mid y)\) is the marginalized posterior probability density function for a single parameter \(\theta\) and \(G(\theta\mid y)\) is the cumulative distribution function, then a 95% credible interval is \([G^{-1}(0.025),\;G^{-1}(0.975)]\).
If we approximate the posterior as a multivariate Normal distribution, we have analytical results for credible intervals for each parameter. This is the main benefit of using optimization; it affords us an approximate (Normal) expression for the posterior that has convenient, known analytical results. This enables us to make sense of the posterior.
Specifically, if \(\Sigma_{ii}\) is a diagonal element in the covariance matrix \(\mathsf{\Sigma}\), then, e.g., the 95% credible interval for parameter \(\theta_i\) is centered on the MAP estimate, \(\theta_i^*\) and is \([\theta_i^* - 1.96\sqrt{\Sigma_{ii}},\;\theta_i^* + 1.96\sqrt{\Sigma_{ii}}]\). Because the credible intervals are symmetric under this approximation, we can report the estimates as the \(\theta_i^* \pm 1.96 \sqrt{\Sigma_{ii}}\).
35.2.1 Credible intervals may be skewed
We are approximating the PDF of a distribution close to its mode as Normal. However, for an asymmetric distribution, which posterior distributions can certainly be, the mode may not be all that relevant as far as credible regions go. As an extreme example, consider an Exponential distribution. It’s mode is zero, which is also the minimal allowed value. If we were to compute a central 95% credible region for a posterior that happened to be Exponential, the mode would not be included. In cases like these, though the Normal distribution approximately describes the PDF of the posterior near the MAP, the Normal approximation can lead to rather flawed credible regions.
To demonstrate this, let us again return to the Gamma example, this time with \(\alpha = 5\) and \(\beta = 2\). The mode of this Gamma distribution is 2, but the median is about 2.34. If we look graphically, we can see right away that the Normal approximation is not as good as the \(\alpha = 50\), \(\beta = 20\) case.
Let’s look at how this will affect the credible intervals. We will compute the lower and upper bound to a credible interval directly from the Gamma distribution and from its Normal approximation. We can compute credible intervals using the ppf() method of distributions in scipy.stats, which returns the inverse of the CDF. We can then plot the 1%, …, 99% credible intervals for the Gamma distribution and for its Normal approximation.
For tiny credible intervals, the shift is clear. The median for a Gamma distribution with these parameters is shifted rightward from its mode, whereas the mode and median for a Normal distribution are equal. As we get broader credible intervals, the Normal approximation notably has the 99% credible interval containing negative values (not allowed for a Gamma distribution) and missing important values at the high end.
35.3 Why use the MAP for parameter estimation?
Despite these drawbacks, using the MAP/Normal approximation for parameter estimation is still useful in Bayesian inference, at least for simple, well-behaved problems. Why? The most important reason is speed. For problems with simple models, optimization is much faster than other methods for characterizing the posterior, such as directly plotting it or using Markov chain Monte Carlo. If you have high-throughput data, you may not have the computational resources to do a full Bayesian treatment with other methods. Provided your model that is tractable by optimization method, finding the MAP and summarizing with an approximate Normal distribution is an attractive method.
Optimization methods suffer from the curse of dimensionality. In general, high-dimensional optimization problems, which is the case for a large number of parameters, are very difficult to solve. However, a common practice is to choose models for which the optimization problems are more tractable, even with large data sets and/or large numbers of parameters. Note that this is a purely pragmatic approach! The generative model that is convenient is often not a well-reasoned generative model for the actual observations. Nonetheless, sometimes practicality is necessary, and MCMC and its full characterization of the posterior is out of reach, and you may need to turn to optimization methods. You have to take some water with your wine.
In many cases, the Normal approximation can also be a good approximation. But this is definitely not guaranteed! In real inference problems, it is very difficult to identify pathologies that would lead to breakdown of the Normal approximation, since comparing to the real posterior would involve computationally intensive methods to compute it.