53  Generalized linear models: An introduction

Imagine the following neural decoding problem. I measure the activity of several neurons, possibly by measuring their spiking rate, fluorescence with calcium imaging, or some other measure of activity. I simultaneously measure a behavior, which I will call \(y\). This could be a categorical description of behavior (like “sniff,” “attack”, “mount,” “ignore”), or some continuous description, like applied force by the musculature. I want to determine how neuronal activity is linked to behavior with a mathematical model. I then may want to use that model to predict behavior from measured neuronal activity.

Let’s think about this mathematically. We define by \(\mathbf{x}\) to be a vector of neuronal activities. That is, \(x_1\) is the instantaneous activity of neuron 1, \(x_2\) is that of neuron 2, etc. The observed behavior at any point in time will be some function of \(\mathbf{x}\). More specifically, the probability distribution of the quantitative description of the behavior depends on \(\mathbf{x}\). That is to say that the likelihood, \(f(y\mid \mathbf{x}, \theta)\) is conditioned on \(\mathbf{x}\) and some set of parameters \(\theta\). Most generally, we can say this conditioning is based on some function of \(\mathbf{x}\). For reasons that will become clear momentarily, we will define this functional dependence as a composed function \(\ell^{-1}(\xi (\mathbf{x}))\), where the functions \(\ell^{-1}\) and \(\xi\) may both be parametrized by some subset of the parameters \(\theta\). The function \(\ell^{-1}\) is referred to as the inverse link function, which is sometimes called the mean function.

53.1 An example: Multinomial logistic regression

As we turn to the question of how to choose \(\xi\) and \(\ell^{-1}\), it helps to have a concrete example in mind. Imagine that the observed behavior is one of four behaviors, sniff, attack, mount, and ignore, which we can index respectively as 1, 2, 3, and 4.

How should we choose \(\xi(\mathbf{x})\)? We could build the function \(\xi(\mathbf{x})\) through careful modeling based on physical and biological principles. But, as is often the case, we may have no idea what kind of function to use for \(\xi\)! We can take a page from the physicists’ handbook in this case, and write \(\xi(\mathbf{x})\) as a Taylor series to first order in \(\mathbf{x}\). For example, if the output of \(\xi\) needs to be a scalar for input into \(\ell^{-1}\), then

\[\begin{align} \xi(\mathbf{x} ; \alpha, \boldsymbol{\beta}) = \alpha + \boldsymbol{\beta}\cdot \mathbf{x}. \end{align} \]

In our case, the output needs to be a 4-vector, since \(\ell^{-1}(\xi(\mathbf{x}))\) needs to return a simplex of four probabilities. In this case,

\[\begin{align} \xi(\mathbf{x} ; \boldsymbol{\alpha}, \mathsf{\beta}) = \boldsymbol{\alpha} + \mathsf{\beta}\cdot \mathbf{x}. \end{align} \]

How might we choose \(\ell^{-1}\) then? Because \(\xi(\mathbf{x})\) is a linear function, its output is unbounded. So, we need a function that converts the unbounded output of \(\xi(\mathbf{x})\) into a simplex of probabilities. The softmax function is a great choice!. The softmax function is defined as follows. Let \(\mathbf{a} = (a_1, a_2, \ldots)\). Then,

\[\begin{align} \text{softmax}(\mathbf{a}) = \frac{\left(\mathrm{e}^{a_1}, \mathrm{e}^{a_2},\ldots\right)^\mathsf{T}}{\displaystyle{\sum_i \mathrm{e}^{a_i}}}. \end{align} \]

So, choosing the softmax function as our inverse link function and considering many i.i.d. measurements \(y_i\), we have as our model

\[\begin{align} &\boldsymbol{\alpha} \sim \text{prior for }\boldsymbol{\alpha}, \\[1em] &\mathsf{\beta} \sim \text{prior for }\mathsf{\beta}, \\[1em] &y_i \mid \mathbf{x}_i, \boldsymbol{\alpha}, \mathsf{\beta} = \text{Categorical}(\text{softmax}(\boldsymbol{\alpha} + \mathsf{\beta}\cdot \mathbf{x}_i)) \;\forall i. \end{align} \]

This model is referred to as a multinomial logistic regression model.

53.2 Generalized linear models

The model we have just described is a special case of a class of models known as generalized linear models, or GLM. The idea behind these models is that the output of a linear function of the input variables \(\mathbf{x}\) is passed into an inverse link function whose output parametrizes the likelihood of observations \(y\).[^1]

Several GLMs have special names. In what follows, it is useful to define the standard logistic function, also known as an inverse logit function, which is a special case of the softmax function when there are only two possible values.

\[\begin{align} \text{logit}^{-1}(a) = \frac{\mathrm{e}^{a}}{1 + \mathrm{e}^{a}} = \frac{1}{1 + \mathrm{e}^{-a}}. \end{align} \]

The likelihoods of some named GLMs are in the table below.

GLM name likelihood inverse link function
Linear regression \(y\mid \mathbf{x}, \alpha, \boldsymbol{\beta}, \sigma \sim \text{Norm}(\alpha + \boldsymbol{\beta} \cdot \mathbf{x}, \sigma)\) \(\ell^{-1}(\eta) = \eta\)
Logistic regression \(y\mid \mathbf{x}, \alpha, \boldsymbol{\beta} \sim \text{Bernoulli}(\text{logit}^{-1}(\alpha + \boldsymbol{\beta} \cdot \mathbf{x}))\) \(\ell^{-1}(\eta) = \text{logit}^{-1}(\eta)\)
Multinomial logistic regression \(y\mid \mathbf{x}, \alpha, \boldsymbol{\beta} \sim \text{Categorical}(\text{softmax}(\boldsymbol{\alpha} + \mathsf{\beta} \cdot \mathbf{x}))\) \(\ell^{-1}(\boldsymbol{\eta}) = \text{softmax}(\boldsymbol{\eta})\)
Binomial regression \(y \mid \mathbf{x}, N, \alpha, \boldsymbol{\beta} \sim \text{Binom}(N, \text{logit}^{-1}(\alpha + \boldsymbol{\beta}\cdot\mathbf{x}))\) \(\ell^{-1}(\eta) = \text{logit}^{-1}(\eta)\)
Multinomial regression \(y \mid \mathbf{x}, N, \boldsymbol{\alpha}, \mathsf{\beta} \sim \text{Multinomial}(N, \text{softmax}(\boldsymbol{\alpha} + \mathsf{\beta}\cdot\mathbf{x}))\) \(\ell^{-1}(\boldsymbol{\eta}) = \text{softmax}(\boldsymbol{\eta})\)
Poisson regression \(y\mid \mathbf{x}, \alpha, \boldsymbol{\beta} \sim \text{Poisson}(\mathrm{e}^{\alpha + \boldsymbol{\beta}\cdot\mathbf{x}})\) \(\ell^{-1}(\eta) = \mathrm{e}^{\eta}\)
Gamma regression \(y\mid \mathbf{x}, \alpha, \boldsymbol{\beta}, a \sim \text{Gamma}(a, a/\ell^{-1}(\alpha + \boldsymbol{\beta}\cdot\mathbf{x})\) \(\ell^{-1}(\eta) = \mathrm{e}^{\eta}\) or \(\ell^{-1}(\eta) = 1/{\eta}\)

53.3 Hierarchical GLMs

Note that if the parameters \(\alpha\) and \(\boldsymbol{\beta}\) in the GLM are themselves dependent on hyperparameters, we have a hierarchical GLM. These are quite widely encountered and present the usual challenges of hierarchical models.

[^1] Strictly speaking, GLMs are defined based on the exponential dispersion family of distributions where the canonical parameters are linear functions of the input variables \(\mathbf{x}\), but I do not want to get too tied down with what qualifies as a GLM. I prefer to think generatively and develop the most appropriate models for my applications, which many times may end up being GLMs.