20 Basics of Bayesian modeling
20.1 Tasks of Bayesian modeling
We have learned about Bayes’s theorem as a way to update a hypothesis in light of new data. We use the word “hypothesis” very loosely here. Remember, in the Bayesian view, probability can describe the plausibility of any proposition. The value of a parameter is such a proposition, and we will apply the term “hypothesis” to parameter values as well. Again for concreteness, we will consider the problem of parameter estimation here, where \(\theta\) will denote a set of parameter values, our “hypothesis” in this context.
As we have seen, our goal is to obtain updated knowledge about \(\theta\) as codified in the posterior distribution, \(g(\theta \mid y)\), where \(y\) is the data we measured. Bayes's theorem gives us access to this distribution.
\[\begin{align} g(\theta \mid y) = \frac{f(y\mid\theta)\,g(\theta)}{f(y)}. \end{align}\]
Note, though, that the split between the likelihood and prior need not be explicit. In fact, when we study hierarchical models, we will see that it is not obvious how to unambiguously define the likelihood and prior. In order to do Bayesian inference, we really only need to specify the joint distribution, \(\pi(y,\theta) = f(y\mid\theta)\,g(\theta)\), which is the product of the likelihood and prior. The posterior can also be written as
\[\begin{align} g(\theta \mid y) = \frac{\pi(y,\theta)}{f(y)}. \end{align}\]
Because the evidence, \(f(y)\), is computed from the likelihood and prior (or more generally from the joint distribution) as
\[\begin{align} f(y) = \int\mathrm{d}\theta\,\pi(y,\theta) = \int\mathrm{d}\theta\,f(y\mid\theta)\,g(\theta), \end{align}\]
the posterior is fully specified by the joint distribution. In light of this, there are two broad tasks in Bayesian modeling.
- Define the joint distribution \(\pi(y,\theta)\), most commonly specified as a likelihood and prior.
- Make sense of the resulting posterior.
20.1.1 Model building
Step (1) is the task of model building, or modeling. This is typically done by first specifying a likelihood, \(f(y\mid \theta)\). This is a generative distribution in that it gives the full description of how data \(y\) are generated given parameters \(\theta\). Importantly, included in the likelihood is a theoretical model.
After the likelihood is constructed we need to specify prior distributions for the parameters of the likelihood. That is, the likelihood, as a model of the generative process, prescribes what the parameters \(\theta\) of the model are (but not their values). Knowing what parameters we need to write priors for, we go about codifying what we know about the parameters before we do the experiment as prior probabilities.
So, the process of model building involves prescribing a likelihood that describes the process of data generation. The parameters to be estimated are defined in the likelihood, and from these we construct priors. We will discuss how we go about building likelihoods and priors in this lecture and subsequently in the course.
20.1.1.1 The role of the prior
As an aside, I pause to note that there may be some philosophical issues with this approach. I think Gelman, Simpson, and Betancourt clearly stated the dilemma in their 2017 paper with the apt title, "The Prior Can Often Only Be Understood in the Context of the Likelihood" (emphasis added by me).
One can roughly speak of two sorts of Bayesian analyses. In the first sort, the Bayesian formalism can be taken literally: a researcher starts with a prior distribution about some externally defined quantity, perhaps some physical parameter or the effect of some social intervention, and then he or she analyzes data, leading to an updated posterior distribution. Here, the prior can be clearly defined, not just before the data are observed but before the experiment has even been considered. ...[W]e are concerned with a second sort of analysis, what might be patterned Bayesian analysis using default priors, in which a researcher gathers data and forms a model that includes various unknown parameters and then needs to set up a prior distribution to continue with Bayesian inference. This latter approach is standard in Bayesian workflow...
One might say that what makes a prior a prior, rather than simply a probability distribution, is that it is destined to be paired with a likelihood. That is, the Bayesian formalism requires that a prior distribution be updated into a posterior distribution based on new data.
We are taking the second approach the Gelman, Simpson, and Betancourt laid out; the approach where the prior is in service to a likelihood. We are trying to model a data generation process, which requires a likelihood to even identify what the parameters are that require priors.
20.1.2 Making sense of the posterior
If we can write down the likelihood and prior, or more generally the joint distribution \(\pi(y,\theta)\), we theoretically have the full posterior distribution. We have seen in our previous lessons that for simple models involving one parameter, we can often plot the posterior distribution in order to interpret it. When we have two parameters, it becomes more challenging to make plots, but it can be done. Things get more and more complicated as the number of parameters grow, a manifestation of the curse of dimensionality.
At the heart of the technical challenges of Bayesian inference is how to make sense of the posterior. Plotting it is useful, but in almost all cases, we wish to compute expectations of the posterior. An expectation computed from the posterior is of the form
\[\begin{align} \langle \xi \rangle = \int \mathrm{d}\theta'\, \xi(\theta')\,g(\theta'\mid y), \end{align}\]
with the integral replaced by a sum for discrete \(\theta\). As an example, we may have \(\theta = (\theta_1, \theta_2)\) and we wish to compute a marginal posterior \(g(\theta_2 \mid y)\). In this case, \(\xi(\theta') = \delta(\theta_2' - \theta_2)\), a Dirac delta function, and
\[\begin{align} g(\theta_2 \mid y) = \int \mathrm{d}\theta_1 \, g(\theta_1, \theta_2 \mid y). \end{align}\]
So, making sense of the posterior typically involves computing integrals (or sums).
In addition to using conjagacy, which we have already discussed, in coming lessons, we will explore a few ways to do this.
- Finding the values of the parameters \(\theta\) that maximize the posterior and then approximating the posterior as locally Normal. This involves a numerical optimization calculation to find the maximizing parameters, and then uses known analytical results about multivariate Normal distributions to automatically compute the integrals (without actually having to do integration!).
- Sampling out of the posterior. As you saw in an earlier homework, we can perform marginalization, and indeed most other integrals, by sampling. The trick is sampling out of the posterior, which is not as easy as the sampling you have done so far. We resort to using the sampling method called Markov chain Monte Carlo (MCMC) to do it.
- Performing and optimization to find which distribution in a family of distributions most closely approximates the posterior, and then use this approximate distribution as a substitute for the posterior. We choose the family of candidate distributions to have nice properties, known analytical results, ease of sampling, etc., thereby making exploration of the posterior much easier than by full MCMC. This is called variational inference.
For the rest of this lecture, though, we will focus on model building.
20.2 Bayesian modeling example: parameter estimation from repeated measurements
We will consider one of the simplest examples of parameter estimation, and one that comes up often in research applications. Let’s say we repeat a measurement many times. This could be beak depths of finches on the Galápagos, fluorescence intensity in a cell, etc. The possibilities abound. To have a concrete example in mind for this example, let’s assume we are doing a behavioral assay where a mouse has to descend a 24-inch pole and we measure the amount of time it takes for the mouse to descend. This is a common assay for measuring mouse motor coordination.
We define our set of pole descent times as \(y \equiv\{y_1, y_2, \ldots y_n\}\). We will ambiguously define a parameter of interest to be \(\mu\), the typical descent time. We will sharpen our definition of this parameter through specification of the likelihood.
We wish to calculate \(g(\mu \mid y)\), the posterior probability density function for the parameter \(\mu\), given the data. Values of \(\mu\) for which the posterior probability density is high are more probable (that is, more plausible) than those for which is it low. The posterior \(g(\mu \mid y)\) codifies our knowledge about \(\mu\) in light of our data \(y\).
To compute the posterior, we use Bayes’s theorem.
\[\begin{aligned} g(\mu \mid y) = \frac{f(y\mid \mu)\,g(\mu)}{f(y)}. \end{aligned}\]
Since the evidence, \(f(y)\) does not depend on the parameter of interest, \(\mu\), it is really just a normalization constant, so we do not need to consider it explicitly at this stage. Specification of the likelihood and prior is sufficient for the posterior, since we must have
\[\begin{aligned} f(y) = \int \mathrm{d}\mu \,f(y\mid \mu)\,g(\mu) \end{aligned}\]
to ensure normalization of the posterior \(g(\mu \mid y)\). So, we have just to specify the likelihood \(f(y\mid \mu)\) and the prior \(g(\mu)\). We begin with the likelihood.
20.2.1 The likelihood
To specify the likelihood, we have to ask what we expect from the data, given a value of \(\mu\). If every mouse descended in exactly the same amount of time and there are no errors or confounding factors at all in our measurements, we expect \(y_i = \mu\) for all \(i\). In this case
\[\begin{aligned} g(y\mid \mu) = \prod_{i=1}^n\delta(y_i - \mu), \end{aligned}\]
the product of Dirac delta functions. Of course, this is really never the case. There will be natural variation in mice and in their behavior and additionally some errors in measurement and/or the system has variables that confound the measurement. What, then should we choose for our likelihood?
That choice is of course dependent the story/theoretical modeling behind data generation. For our purposes here, we shall assume our data are generated from a Normal likelihood. Since this distribution gets heavy use, I will pause here to talk a bit more about it, even though we have already seen it.
20.2.2 The Normal distribution
A univariate Normal (also known as Gaussian), probability distribution has a probability density function (PDF) of
\[\begin{aligned} f(y \mid \mu, \sigma) = \frac{1}{\sqrt{2\pi\sigma^2}}\, \exp\left[-\frac{(y - \mu)^2}{2\sigma^2}\right]. \end{aligned}\]
The parameter \(\mu\) is a location parameter and in the case of the Normal distribution is called the mean and \(\sigma\) is a scale parameter and is called the standard deviation in this case. The square of this scale parameter is referred to as the variance. Importantly (and confusingly), the terms "mean," "standard deviation," and "variance" in this context are names of parameters of the distribution; they are not what you compute directly from data as plug-in estimates. We will therefore use the terms location parameter and scale parameter, though the terms mean and standard deviation are used very widely in the literature.
The central limit theorem says that any quantity that emerges from a large number of subprocesses tends to be Normally distributed, provided none of the subprocesses is very broadly distributed. We will not prove this important theorem, but we make use of it when choosing likelihood distributions based on the stories behind the generative process. Indeed, in the simple case of estimating a single parameter where many processes may contribute to noise in the measurement, the Normal distribution is a good choice for a likelihood.
More generally, the multivariate Normal distribution for \(y = (y_1, y_2, \cdots, y_n)^\mathsf{T}\) is
\[\begin{aligned} f(y \mid \boldsymbol{\mu}, \mathsf{\Sigma}) = (2\pi)^{-\frac{n}{2}} \left(\det \mathsf{\Sigma}\right)^{-\frac{1}{2}}\, \exp\left[-\frac{1}{2}(y - \boldsymbol{\mu})^\mathsf{T}\cdot \mathsf{\Sigma}^{-1}\cdot(y - \boldsymbol{\mu})\right], \end{aligned}\]
where \(\boldsymbol{\mu} = (\mu_1, \mu_2,\ldots, \mu_n)^\mathsf{T}\) is an array of location parameters. The parameter \(\mathsf{\Sigma}\) is a symmetric positive definite matrix called the covariance matrix. If off-diagonal entry \(\Sigma_{ij}\) is nonzero, then \(y_i\) and \(y_j\) are correlated (or anticorrelated). In the case where all \(y_i\) are independent, all off-diagonal terms in the covariance matrix are zero, and the multivariate Normal distribution reduces to
\[\begin{aligned} f(y \mid \boldsymbol{\mu}, \boldsymbol{\sigma}) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi \sigma_i^2}}\, \exp\left[-\frac{(y_i - \mu_i)^2}{2\sigma_i^2}\right], \end{aligned}\]
where \(\sigma^2_i\) is the \(i\)th entry along the diagonal of the covariance matrix. This is the variance associated with measurement \(i\). So, if all independent measurements \(y_i\) have the same location and scale parameters, which is to say that the measurements are independent and identically distributed (i.i.d.), the multi-dimensional Gaussian reduces to
\[\begin{aligned} f(y \mid \mu, \sigma) = \left(\frac{1}{2\pi \sigma^2} \right)^{-\frac{n}{2}}\, \exp\left[-\frac{1}{2\sigma^2}\,\sum_{i=1}^n (y_i - \mu)^2\right]. \end{aligned}\]
20.2.3 The likelihood revisited: and another parameter
For the purposes of this demonstration of parameter estimation, we assume the Normal distribution is a good choice for our likelihood for repeated measurements (as it often is). We have to decide how the measurements are related to specify how many entries in the covariance matrix we need to specify as parameters. It is often the case that the measurements are i.i.d, so that only a single mean and variance are specified. So, we choose our likelihood to be
\[\begin{aligned} f(y\mid \mu, \sigma) = \left(\frac{1}{2\pi \sigma^2} \right)^{\frac{n}{2}}\, \exp\left[-\frac{1}{2\sigma^2}\,\sum_{i=1}^n (y_i - \mu)^2\right]. \end{aligned}\]
By choosing this as our likelihood, we are saying that we expect our measurements to have a well-defined mean \(\mu\) with a spread described by the variance, \(\sigma^2\).
But wait a minute; we had a single parameter, \(\mu\), that we sought to estimate, and now we now have another parameter, \(\sigma\), beyond the one we’re trying to measure. So, our statistical model has two parameters, and Bayes’s theorem now reads
\[\begin{aligned} g(\mu, \sigma \mid y) = \frac{f(y\mid \mu, \sigma)\,g(\mu, \sigma)} {f(y)}. \end{aligned}\]
After we compute the posterior, we can still find the posterior probability distribution we are after by marginalizing.
\[\begin{aligned} g(\mu\mid y) = \int_0^\infty \mathrm{d}\sigma\,g(\mu, \sigma \mid y). \end{aligned}\]
20.2.4 Choice of prior
Now that we have defined a likelihood, we know what the parameters are and we can define a prior, \(g(\mu, \sigma)\). As is often the case, we assume \(\mu\) and \(\sigma\) are independent of each other, so that
\[\begin{aligned} g(\mu, \sigma) = g(\mu)\,g(\sigma). \end{aligned}\]
How might we choose prior distributions for \(\mu\) and \(\sigma\)? Remember, the prior probability distribution captures what we know about the parameter before we measure data. For the current example of mouse pole descent, we can guess that the moues should descend in about ten seconds, but we are not too sure about this. So, we will make the prior distribution broad; we take \(g(\mu)\) to be Normal with a location parameter of 15 seconds, but a scale parameter of 10 seconds. That is,
\[\begin{aligned} g(\mu) = \frac{1}{\sqrt{2\pi \sigma_\mu^2}}\,\exp\left[-\frac{(\mu-\mu_\mu)^2}{2\sigma^2}\right], \end{aligned}\]
with \(\mu_\mu = 15\) seconds and \(\sigma_\mu = 10\) seconds. There is some trepidation with this model, as there is finite probability that the time to descend is negative, which is of course unphysical. If we wish to use a Normal prior, we will necessarily have to allow for some unallowed parameter values to have finite prior probability, as the support of the Normal distribution is over the whole number line. The advise given in the excellent wiki for advice about choosing priors from the Stan developers has this gem of advise: "…loss in precision by making the prior a bit too weak (compared to the true population distribution of parameters or the current expert state of knowledge) is less serious than the gain in robustness by including parts of parameter space that might be relevant." So, we will take some water with our wine for this prior.
For \(g(\sigma)\), we might think that the pole descent time might vary by about 50% of our guessed descent time, but not much more than that. We could again choose a Normal prior, with
\[\begin{aligned} g(\sigma) = \frac{1}{\sqrt{2\pi \sigma_\sigma^2}}\,\exp\left[-\frac{(\mu-\mu_\sigma)^2}{2\sigma_\sigma^2}\right],\end{aligned}\]
with \(\mu_\sigma = 5\) seconds and \(\sigma_\sigma = 5\) second.
As I have already mentioned, we have the obvious issue that there is nonzero probability that \(\mu\) or \(\sigma\) could be negative, which we know is respectively unphysical or mathematically disallowed. We could refine our prior distribution to make sure this does not happen. With any approach we choose, the prior should roughly match what we would sketch on a piece of paper and cover any reasonable parameter values and exclude any that are unreasonable (or unphysical).
20.2.5 Succinctly stating the model
Our model is complete, which means that we have now completely specified the posterior. We can write it out. First, defining the parametrization of the priors.
\[\begin{aligned} \begin{align} &\mu_\mu = 15 \text{ seconds} \nonumber \\[1em] &\sigma_\mu = 10 \text{ seconds} \nonumber \\[1em] &\mu_\sigma = 5 \text{ seconds} \nonumber \\[1em] &\sigma_\sigma = 5 \text{ seconds}. \end{align} \end{aligned}\]
Then, the posterior is
\[\begin{aligned} \begin{aligned} g(\mu, \sigma\mid y) = &\;\frac{1}{f(y)}\,\left\{\left(\frac{1}{2\pi \sigma^2} \right)^{\frac{n}{2}}\, \exp\left[-\frac{1}{2\sigma^2}\,\sum_{i=1}^n (y_i - \mu)^2\right] \right.\nonumber\\ &\;\;\;\;\;\;\;\;\;\;\;\times \, \frac{1}{\sqrt{2\pi \sigma_\mu^2}}\,\exp\left[-\frac{(\mu-\mu_\mu)^2}{2 \sigma_\mu^2}\right] \nonumber\\ &\;\;\;\;\;\;\;\;\;\;\;\times\,\left.\frac{1}{\sqrt{2\pi \sigma_\sigma^2}}\,\exp\left[-\frac{(\sigma-\mu_\sigma)^2}{2 \sigma_\sigma^2}\right]\right\}, \end{aligned} \end{aligned}\]
with
\[\begin{aligned} f(y) = \int \mathrm{d}\mu\,\int\mathrm{d}\sigma\, \{\text{term in braces in the above equation}\}. \end{aligned}\]
Oh my, this is a mess, even for this simple model! Even though we have the posterior, it is very hard to make sense of it. Essentially the rest of the course involved making sense of the posterior, which is the challenge. It turns out that writing it down is relatively easy.
One of the first things we can do to make sense of our model, and also to specify it, is to use a shorthand for model specification. First of all, we do not need to specify the evidence, since it is always given by integrating the likelihood and prior; that is by fully marginalizing the likelihood. So, we will always omit its specification. Now, we would like to have a notation for stating the likelihood and prior. English works well.
- The parameter \(\mu\) is Normally distributed with location parameter 15 seconds and scale parameter 5 seconds.
- The parameter \(\sigma\) is Normally distributed with location parameter 5 seconds and scale parameter 5 seconds.
- The descent times are i.i.d. and are Normally distributed with location parameter \(\mu\) and scale parameter \(\sigma\).
This is much easier to understand. We can write this with a convenient, and self evident, shorthand.1
\[\begin{aligned} \begin{aligned} &\mu \sim \text{Norm}(\mu_\mu, \sigma_\mu),\\[1em] &\sigma \sim \text{Norm}(\mu_\sigma, \sigma_\sigma),\\[1em] &y_i \sim \text{Norm}(\mu, \sigma) \;\forall i. \end{aligned} \end{aligned}\]
Here, the symbol \(\sim\) may be read as “is distributed as.” The above three lines are completely sufficient to specify our model. Because we will be using a probabilistic programming language in practice, we will almost never need to code up any nasty mathematical expressions in our modeling and we can express a model as we have done here.
20.3 Choosing likelihoods
In our example of model building for measurements of mouse descent times, we chose a Normal likelihood for the egg lengths. We did so because the story of the repeated measurements matched that of the Normal distribution via the central limit theorem. When a measurement is the result of many processes, none of which has an enormous variance, the values of the measurement is Normally distributed.
This method of choosing a likelihood amounts to story matching. The idea is that we describe the data generation process with a story. We then find a distribution that describes the outcome of the story.
For example, one might do an electrophysiology experiment measuring spiking of an individual neuron in response to a constant stimulus and record the time between binding events. A possible model story for this is that the spikes are all independent and without memory; that is their timing does not depend on previous spiking events. This means that we can model spiking as a Poisson process and are interested in the timing between arrivals (spikes) of the Poisson process. This story matches the story of the Exponential distribution, so we would use it for our likelihood.
The procedure of story matching is an important part of Bayesian modeling. In a great many cases, there exists a well-known, named distribution that matches the story you are using to model the generation of your data. In cases where no such distribution exists, or you do not know about it, you need to derive a PDF or PMF matching your story, which can be challenging. It is therefore well worth the time investment to know about distributions that can be useful in modeling. You should read the contents of the Distribution Explorer to get yourself familiar with named distributions. This will greatly facilitate choosing likelihoods (and priors).
20.4 Choosing a prior
Choice of likelihood is not trivial. It often involves physical/biological modeling and substantial domain knowledge. However, compared to the choosing prior distributions, there is typically less subtlety outside of the domain expertise in choosing likelihoods. We therefore reserve choice of priors, and the important related concept of conjugacy, to the next separate lessons.
I understand that I should be providing units on all parameters that I am specifying with numbers. I am not doing this here, nor throughout the course, to avoid notational clutter and to maintain focus on the modeling.↩︎