39 EM applied to a Gaussian mixture model
In preparing this section, I benefitted from discussions with and reading a notebook by Kayla Jackson.
39.1 Gaussian mixture models (GMMs)
A Gaussian mixture model, or GMM, is a model in which each of \(N\) i.i.d. data points may be drawn from one of \(K\) \(n\)-variate Normal (a.k.a. Gaussian) distributions. We index the data points with \(i\) and the Normal distributions with \(k\). The probably that an arbitrary data point in the data set is drawn from Normal distribution \(k\) is \(\pi_k\), a quantity referred to as a mixing coefficient. We denote the set of \(K\) mixing coefficients as \(\boldsymbol{\pi}\) and \(\sum_k \pi_k = 1\). We define \(z_{ik}\) to be one if data point \(i\) derives from a Normal distribution with parameters \((\boldsymbol{\mu}_k, \mathsf{\Sigma}_k)\) and zero otherwise. The prior probability of \(z_{ik}\) being one is \(\pi_k\), since we do not know the value of datum \(i\) a priori. Thus, \(\boldsymbol{z}_i\) represents a \(K\)-vector for data point \(i\) in which \(K-1\) entries are zero and one entry is one.
For notational convenience, when referring to the set of all parameters, we will drop the subscripts. That is, \(\mathbf{z}\) denotes the set of all \(\mathbf{z}_i\)’s, \(\boldsymbol{\mu}\) denotes the set of all \(\boldsymbol{\mu}_k\)’s, and \(\mathsf{\Sigma}\) denotes the set of all \(\mathsf{\Sigma}_k\)’s. In general, a Gaussian mixture model where we assume the parameters have independent priors is
\[\begin{align} &\boldsymbol{\pi} \sim \text{prior for mixing coefficients}, \\[1em] &\boldsymbol{\mu} \sim \text{prior for location parameters}, \\[1em] &\mathsf{\Sigma} \sim \text{prior for scale parameters}, \\[1em] &\mathbf{z}_i \mid \boldsymbol{\pi} \sim \text{Categorical}(\boldsymbol{\pi}) \;\forall i, \\[1em] &\mathbf{y}_i \mid \mathbf{z}_i, \boldsymbol{\mu}, \mathsf{\Sigma} \sim \prod_{k=1}^K (\mathrm{Norm}(\boldsymbol{\mu}_k,\mathsf{\Sigma}_k))^{z_{ik}} \;\forall i. \end{align} \tag{39.1}\]
To make all conditioning clear, we can explicitly write out Bayes’s theorem.
\[\begin{align} g(\mathbf{z}, \boldsymbol{\pi}, \boldsymbol{\mu}, \mathsf{\Sigma} \mid \mathbf{y}) = \frac{1}{f(\mathbf{y})}\, \left(\prod_{i=1}^N f(\mathbf{y}_i\mid \mathbf{z}_i, \boldsymbol{\mu}, \mathsf{\Sigma})\right)\left(\prod_{i=1}^N g(\mathbf{z}_i \mid \boldsymbol{\pi})\right)\,g(\boldsymbol{\pi})\,g(\boldsymbol{\mu})\,g(\mathsf{\Sigma}). \end{align} \]
For further clarity, we write out the Categorical prior for the mixing coefficients,
\[\begin{align} g(\mathbf{z}_i \mid \boldsymbol{\pi}) = \prod_{k=1}^K \pi_k^{z_{ik}}, \end{align} \]
and the likelihood,
\[\begin{align} f(\mathbf{y} \mid \mathbf{z}, \boldsymbol{\mu}, \mathsf{\Sigma}) &= \prod_{i=1}^N f(\mathbf{y}_i \mid \mathbf{z}_i, \boldsymbol{\mu}, \mathsf{\Sigma}) \nonumber \\[1em] &= \prod_{i=1}^N\underbrace{\prod_{k=1}^K \left(\frac{1}{\sqrt{(2\pi)^N\det\mathsf{\Sigma}_k}}\,\exp\left[-\frac{1}{2}(\mathbf{y}_i-\boldsymbol{\mu}_k)^\mathsf{T}\cdot\mathsf{\Sigma}_k^{-1}\cdot(\mathbf{y}_i-\boldsymbol{\mu}_k)\right]\right)^{z_{ik}}}_{f(\mathbf{y}_i \mid \mathbf{z}_i, \boldsymbol{\mu}, \mathsf{\Sigma})}, \end{align} \]
where the fact that \(z_{ik}\) is one for only one value of \(k\) means that a single Normal distribution is assigned to datum \(i\) for a given value of \(\mathbf{z}_i\). To prevent the equations for becoming unwieldy, we define, as is commonly done,
\[\begin{align} \mathcal{N}(\mathbf{y}_i \mid \boldsymbol{\mu}_k, \mathsf{\Sigma}_k) \equiv \frac{1}{\sqrt{(2\pi)^N\det\mathsf{\Sigma}_k}}\,\exp\left[-\frac{1}{2}(\mathbf{y}_i-\boldsymbol{\mu}_k)^\mathsf{T}\cdot\mathsf{\Sigma}_k^{-1}\cdot(\mathbf{y}_i-\boldsymbol{\mu}_k)\right], \end{align} \tag{39.2}\]
such that the likelihood may more compactly be written as
\[\begin{align} f(\mathbf{y} \mid \mathbf{z}, \boldsymbol{\mu}, \mathsf{\Sigma}) = \prod_{i=1}^N \prod_{k=1}^K \mathcal{N}(\mathbf{y}_i \mid \boldsymbol{\mu}_k, \mathsf{\Sigma}_k)^{z_{ik}}. \end{align} \]
39.1.1 The marginalized GMM posterior
For many applications, such as for sampling with Markov chain Monte Carlo, it is useful to marginalize over the \(z_{ik}\)’s. This may be done analytically.
\[\begin{align} g(\boldsymbol{\pi}, \boldsymbol{\mu}, \mathsf{\Sigma} \mid \mathbf{y}) &= \sum_\mathbf{z} g(\mathbf{z}, \boldsymbol{\pi}, \boldsymbol{\mu}, \mathsf{\Sigma} \mid \mathbf{y}) =\frac{g(\boldsymbol{\pi})\,g(\boldsymbol{\mu})\,g(\mathsf{\Sigma})}{f(\mathbf{y})}\, \sum_\mathbf{z}\left(\prod_{i=1}^N f(\mathbf{y}_i\mid \mathbf{z}_i, \boldsymbol{\mu}, \mathsf{\Sigma})\right)\left(\prod_{i=1}^N \prod_{k=1}^K \pi_k^{z_{ik}}\right), \end{align} \]
So, the sum we need to evaluate is
\[\begin{align} \sum_\mathbf{z}\left(\prod_{i=1}^N f(\mathbf{y}_i\mid \mathbf{z}_i, \boldsymbol{\mu}, \mathsf{\Sigma})\right)\left(\prod_{i=1}^N \prod_{k=1}^K \pi_k^{z_{ik}}\right) &= \sum_\mathbf{z} \prod_{i=1}^N\prod_{k=1}^K \left[\pi_k\,\mathcal{N}(\mathbf{y}_i \mid \boldsymbol{\mu}_k, \mathsf{\Sigma}_k)\right]^{z_{ik}} \nonumber \\[1em] &= \sum_{j=1}^K \prod_{i=1}^N\prod_{k=1}^K \left[\pi_k\,\mathcal{N}(\mathbf{y}_i \mid \boldsymbol{\mu}_k, \mathsf{\Sigma}_k)\right]^{\delta_{jk}} \nonumber \\ &= \sum_{j=1}^K\prod_{i=1}^N \pi_j\,\mathcal{N}(\mathbf{y}_i \mid \boldsymbol{\mu}_j, \mathsf{\Sigma}_j), \end{align} \]
where \(\delta_{jk}\) is the Kronecker delta function, equal to one for \(j = k\) and zero otherwise. The above result means that we can write the likelihood for the marginalized posterior as
\[\begin{align} \mathbf{y}_i \mid \boldsymbol{\pi}, \boldsymbol{\mu}, \mathsf{\Sigma} \sim \sum_{k=1}^K \pi_k\, \text{Norm}(\boldsymbol{\mu}_k,\mathsf{\Sigma}_k)\;\forall i, \end{align} \]
and thereby write the model with the \(z\)-variables marginalized out as
\[\begin{align} &\boldsymbol{\pi} \sim \text{prior for mixing coefficients}, \\[1em] &\boldsymbol{\mu} \sim \text{prior for location parameters}, \\[1em] &\mathsf{\Sigma} \sim \text{prior for scale parameters}, \\[1em] &\mathbf{y}_i \mid \boldsymbol{\pi}, \boldsymbol{\mu}, \mathsf{\Sigma} \sim \sum_{k=1}^K \pi_k\, \text{Norm}(\boldsymbol{\mu}_k,\mathsf{\Sigma}_k) \;\forall i. \end{align} \]
Thus, the log posterior is
\[\begin{align} \ln g(\boldsymbol{\pi}, \boldsymbol{\mu}, \mathsf{\Sigma} \mid \mathbf{y}) = \text{constant} + \ln g(\boldsymbol{\pi}, \boldsymbol{\mu}, \mathsf{\Sigma}) + \sum_{i=1}^N\ln\left[\sum_{k=1}^K \pi_k\, \mathcal{N}(\mathbf{y}_i \mid \boldsymbol{\mu}_k,\mathsf{\Sigma}_k)\right]. \end{align} \tag{39.3}\]
39.2 The EM algorithm for a GMM
Now that we have a firm understanding of what a GMM is, we can proceed to develop an EM algorithm for GMMs.
39.2.1 The surrogate function for a GMM
Recall that we need to specify the parameter-dependent part of the surrogate function, which for an EM algorithm is given by Equation 38.6,
\[\begin{align} \mathcal{Q}(\theta, \phi) = \ln g(\theta) + \sum_z g(z \mid y, \phi)\,\ln \pi(y, z \mid \theta). \end{align} \]
For a Gaussian mixture model, \(z\) in the above equation is \(\mathbf{z}\), \(y\) is the set of all the data, and \(\phi\) and \(\theta\) are sets of parameters that we will respectively call \((\boldsymbol{\pi}, \boldsymbol{\mu},\mathsf{\Sigma})\) and \((\boldsymbol{\pi}^\theta, \boldsymbol{\mu}^\theta, \mathsf{\Sigma}^\theta)\) parameters. Since we will eventually employ EM, it will be useful to get expressions for \(g(z \mid y, \phi)\) and \(\pi(y, z \mid \theta)\). We start with the latter.
\[\begin{align} \pi(y, z \mid \theta) &= f(y \mid z, \theta)\,g(z \mid \theta) = \left(\prod_{i=1}^N f(\mathbf{y}_i\mid \mathbf{z}_i, \boldsymbol{\mu}^\theta, \mathsf{\Sigma}^\theta)\right)\left(\prod_{i=1}^N \prod_{k=1}^K (\pi_k^\theta)^{z_{ik}}\right) \nonumber \\[1em] &= \prod_{i=1}^N\prod_{k=1}^K \left[\pi_k^\theta\,\mathcal{N}(\mathbf{y}_i \mid \boldsymbol{\mu}_k^\theta, \mathsf{\Sigma}_k^\theta)\right]^{z_{ik}}. \end{align} \]
To compute \(g(z \mid y, \phi)\), we use Bayes’s theorem.
\[\begin{align} g(z \mid y, \phi) &= \frac{f(y\mid z, \phi)\,g(z\mid\phi)}{f(y\mid\phi)} \propto f(y\mid z, \phi)\,g(z\mid\phi) = \prod_{i=1}^N\prod_{k=1}^K \left[\pi_k\,\mathcal{N}(\mathbf{y}_i \mid \boldsymbol{\mu}_k, \mathsf{\Sigma}_k)\right]^{z_{ik}}. \end{align} \]
Evaluating the normalization constant for \(g(z\mid y, \phi)\), we get
\[\begin{align} g(z \mid y, \phi) &= \frac{\displaystyle{\prod_{i=1}^N\prod_{k=1}^K \left[\pi_k\,\mathcal{N}(\mathbf{y}_i \mid \boldsymbol{\mu}_k, \mathsf{\Sigma}_k)\right]^{z_{ik}}}}{\displaystyle{\sum_\mathbf{z} \prod_{i=1}^N\prod_{k=1}^K \left[\pi_k\,\mathcal{N}(\mathbf{y}_i \mid \boldsymbol{\mu}_k, \mathsf{\Sigma}_k)\right]^{z_{ik}}}}. \end{align} \tag{39.4}\]
We can plug these expressions in and write the parameter-dependent part of the surrogate function as
\[\begin{align} \mathcal{Q}(\theta, \phi) &= \ln g(\theta) + \sum_\mathbf{z} g(z\mid y, \phi)\,\sum_{i=1}^N\sum_{k=1}^k z_{ik}\left(\ln \pi_k + \ln \mathcal{N}(\mathbf{y}_i \mid \boldsymbol{\mu}_k, \mathsf{\Sigma}_k)\right) \nonumber \\[1em] &= \ln g(\theta) + \sum_{i=1}^N\sum_{k=1}^k\left[\sum_{z_{ik}} g(z\mid y, \phi)\,z_{ik}\right] \left(\ln \pi_k + \ln \mathcal{N}(\mathbf{y}_i \mid \boldsymbol{\mu}_k, \mathsf{\Sigma}_k)\right). \end{align} \]
We recognize the bracketed sum as the expectation of \(z_{ik}\) over the posterior distribution. That is, the bracketed term is probability that \(z_{ik}\) is unity for mixture component \(k\). This quantity is called the responsibility of component \(k\) generating datum \(i\), and is typically denoted as \(\gamma(z_{ik})\). We can directly evaluate it.
\[\begin{align} \gamma(z_{ik}) &= \sum_{z_{ik}} g(z\mid y, \phi)\,z_{ik} \nonumber \\[1em] &= \sum_{z_{ik}} z_{ik}\,\frac{\displaystyle{\prod_{i=1}^N\prod_{k=1}^K \left[\pi_k\,\mathcal{N}(\mathbf{y}_i \mid \boldsymbol{\mu}_k, \mathsf{\Sigma}_k)\right]^{z_{ik}}}}{\displaystyle{\sum_\mathbf{z} \prod_{i=1}^N\prod_{k=1}^K \left[\pi_k\,\mathcal{N}(\mathbf{y}_i \mid \boldsymbol{\mu}_k, \mathsf{\Sigma}_k)\right]^{z_{ik}}}} \nonumber \\[1em] &= \displaystyle{\frac{\pi_k \,\mathcal{N}(\mathbf{y}_i \mid \boldsymbol{\mu}_k, \mathsf{\Sigma}_k)}{\sum_{j=1}^K\pi_j\,\mathcal{N}(\mathbf{y}_i \mid \boldsymbol{\mu}_j, \mathsf{\Sigma}_j)}}, \end{align} \tag{39.5}\]
where we have used the fact that \(z_{ik}\) takes a value of one for exactly one \(k\) and a value of zero otherwise.
The responsibilities are of use even if we do not directly use them directly in the computational tasks of inference (we will in the EM algorithm, but not, for example, when we do MCMC). The have the direct interpretation as the posterior probability that a given datum comes from a given component of the mixture model. Equation 39.5 gives a formula for computing the responsibilities from estimates (or samples of) the model parameters.
Proceeding with the EM specification, now that we have the responsibilities, we can write the parameter-dependent part of the surrogate function as
\[\begin{align} \mathcal{Q}(\theta, \phi) = \ln g(\theta) + \sum_{i=1}^N\sum_{k=1}^k\gamma(z_{ik}) \left(\ln \pi_k^\theta + \ln \mathcal{N}(\mathbf{y}_i \mid \boldsymbol{\mu}_k^\theta, \mathsf{\Sigma}_k^\theta)\right), \end{align} \tag{39.6}\]
where \(g(\theta) = g(\boldsymbol{\pi}^\theta)\,g(\boldsymbol{\mu}^\theta)\,g(\mathsf{\Sigma}^\theta)\).
39.2.2 Summary of the EM algorithm for a GMM
We can write the EM algorithm for this specific case (see Section 38.4) as follows.
- Guess a maximizer of the marginal log posterior. Define this initial guess as \((\boldsymbol{\pi}, \boldsymbol{\mu}, \mathsf{\Sigma})\).
- Evaluate \(g(\mathbf{z}\mid \mathbf{y}, \boldsymbol{\pi}, \boldsymbol{\mu}, \mathsf{\Sigma})\) using Equation 39.4. (E step)
- Use the result of step (2) to define \(\mathcal{Q}((\boldsymbol{\pi}^\theta, \boldsymbol{\mu}^\theta, \mathsf{\Sigma}^\theta), (\boldsymbol{\pi}, \boldsymbol{\mu}, \mathsf{\Sigma}))\). This involves computing the responsibilities \(\gamma(z_{ik})\) using the result from the E step according to Equation 39.5 and then using Equation 39.6.
- Compute \((\boldsymbol{\pi}^*, \boldsymbol{\mu}^*, \mathsf{\Sigma}^*) = \text{arg max}_{(\boldsymbol{\pi}^\theta, \boldsymbol{\mu}^\theta, \mathsf{\Sigma}^\theta)}\, \mathcal{Q}((\boldsymbol{\pi}^\theta, \boldsymbol{\mu}^\theta, \mathsf{\Sigma}^\theta), (\boldsymbol{\pi}, \boldsymbol{\mu}, \mathsf{\Sigma}))\). (M step)
- If the difference between the log posteriors (given by Equation 39.3) evaluated at \((\boldsymbol{\pi}^*, \boldsymbol{\mu}^*, \mathsf{\Sigma}^*)\) and \((\boldsymbol{\pi}, \boldsymbol{\mu}, \mathsf{\Sigma})\) is smaller than a predefined threshold, STOP and record \((\boldsymbol{\pi}^*, \boldsymbol{\mu}^*, \mathsf{\Sigma}^*)\) as the MAP estimate for the parameters. Otherwise, go to 6.
- Set \((\boldsymbol{\pi}, \boldsymbol{\mu}, \mathsf{\Sigma}) = (\boldsymbol{\pi}^*, \boldsymbol{\mu}^*, \mathsf{\Sigma}^*)\).
- Go to 2.
The intermediate step to compute the parameter-dependent part of the surrogate function, namely computing the responsibilities, is traditionally included in the E step.
While the E step is achieved analytically, the M step in general cannot be. However, for some specific priors, the M step may be solved analytically. The most widely used example is when all priors are uniform, and we describe that case below.
39.2.3 MAP estimation with uniform priors using EM
If we use uniform priors, \(g(\theta) = g(\boldsymbol{\pi}^\theta)\,g(\boldsymbol{\mu}^\theta)\,g(\mathsf{\Sigma}^\theta) = \text{constant}\), we can perform the M step analytically. First, differentiating the surrogate function with respect to \(\boldsymbol{\mu}_k^\theta\), we have at the MAP,
\[\begin{align} \frac{\partial \mathcal{Q}}{\partial \boldsymbol{\mu}_k} &= \frac{\partial}{\partial \boldsymbol{\mu}_k}\,\sum_{i=1}^N \gamma(z_{ik})\,(\mathbf{y}_i - \boldsymbol{\mu}_k)^\mathsf{T}\cdot\mathsf{\Sigma}_k^{-1}\cdot (\mathbf{y}_i - \boldsymbol{\mu}_k) \nonumber \\[1em] &= \sum_{i=1}^N \gamma(z_{ik})\,\mathsf{\Sigma}_k^{-1}\cdot (\mathbf{y}_i - \boldsymbol{\mu}_k). \end{align} \]
In the above, we have dropped the \(\theta\) superscripts for notational convenience. Setting the derivative to zero and multiplying by \(\mathsf{\Sigma}_k\) gives
\[\begin{align} \sum_{i=1}^N \gamma(z_{ik})\,(\mathbf{y}_i - \boldsymbol{\mu}_k) = \mathbf{0}. \end{align} \]
Solving gives the optimal \(\boldsymbol{\mu}_k\), perhaps unsurprisingly, as the responsibility-weighted average of the data,
\[\begin{align} \boldsymbol{\mu}^*_k = \frac{1}{\gamma_k}\sum_{i=1}^N\gamma(z_{ik})\,\mathbf{y}_i, \end{align} \]
where
\[\begin{align} \gamma_k \equiv \sum_{i=1}^N\gamma(z_{ik}). \end{align} \]
Similarly, we can differentiate the surrogate function with respect to \(\mathsf{\Sigma}_k^\theta\), where we will again drop the \(\theta\) superscripts for notational convenience. First, it helps to know a couple identities from matrix calculus,
\[\begin{align} &\frac{\partial}{\partial \mathsf{A}}\,\mathbf{x}^\mathsf{T}\cdot \mathsf{A}\cdot\mathbf{x} = \mathbf{x}\cdot\mathbf{x}^\mathsf{T},\\[1em] &\frac{\partial \mathsf{A}^{-1}}{\partial \mathsf{A}} = -\mathsf{A}^{-1}\otimes\mathsf{A}^{-1},\\[1em] &\frac{\partial}{\partial \mathsf{A}}\,\mathbf{x}^\mathsf{T}\cdot \mathsf{A}\cdot\mathbf{x} = -\mathsf{A}^{-1}\cdot\mathbf{x}\cdot\mathbf{x}^\mathsf{T}\cdot\mathsf{A}^{-1},\\[1em] &\frac{\partial}{\partial \mathsf{A}}\,\ln \det\mathsf{A} = \mathsf{A}^{-1}, \end{align} \]
where the \(\otimes\) symbol denotes the outer product and \(\mathbf{x}\cdot\mathbf{x}^\mathsf{T} = \mathbf{x}\otimes\mathbf{x}\). It follows from the first two identities and the chain rule that
\[\begin{align} \frac{\partial}{\partial \mathsf{A}}\,\mathbf{x}^\mathsf{T}\cdot \mathsf{A}^{-1}\cdot\mathbf{x} = \frac{\partial \mathsf{A}^{-1}}{\partial \mathsf{A}}\frac{\partial}{\partial \mathsf{A}^{-1}}\,\mathbf{x}^\mathsf{T}\cdot \mathsf{A}^{-1}\cdot\mathbf{x} = -\mathsf{A}^{-1}\cdot\mathbf{x}\cdot\mathbf{x}^\mathsf{T}\cdot\mathsf{A}^{-1} \end{align} \]
Now, computing the derivative,
\[\begin{align} \frac{\partial \mathcal{Q}}{\partial \mathsf{\Sigma}_k} &= \frac{\partial}{\partial \mathsf{\Sigma}_k}\,\sum_{i=1}^N \gamma(z_{ik})\,\left(-\frac{1}{2}\,\ln \det \mathsf{\Sigma}_k - \frac{1}{2} (\mathbf{y}_i - \boldsymbol{\mu}_k)^\mathsf{T}\cdot\mathsf{\Sigma}_k^{-1}\cdot (\mathbf{y}_i - \boldsymbol{\mu}_k)\right)\nonumber \\[1em] &= -\frac{1}{2}\sum_{i=1}^N \gamma(z_{ik})\left(\mathsf{\Sigma}_k^{-1} - \mathsf{\Sigma}_k^{-1}\cdot(\mathbf{y}_i - \boldsymbol{\mu}_k)\cdot (\mathbf{y}_i - \boldsymbol{\mu}_k)^\mathsf{T}\cdot\mathsf{\Sigma}_k^{-1}\right). \end{align} \]
Setting the derivative to zero and left multiplying and then right multiplying by \(\mathsf{\Sigma}\) gives
\[\begin{align} \mathsf{0} &= -\frac{1}{2}\sum_{i=1}^N \gamma(z_{ik})\left(\mathsf{\Sigma}_k\cdot\mathsf{\Sigma}_k^{-1}\cdot\mathsf{\Sigma}_k - \mathsf{\Sigma}_k\cdot\mathsf{\Sigma}_k^{-1}\cdot(\mathbf{y}_i - \boldsymbol{\mu}_k)\cdot (\mathbf{y}_i - \boldsymbol{\mu}_k)^\mathsf{T}\cdot\mathsf{\Sigma}_k^{-1}\cdot\mathsf{\Sigma}_k\right) \nonumber \\ &= -\frac{1}{2}\sum_{i=1}^N \gamma(z_{ik})\left(\mathsf{\Sigma}_k - (\mathbf{y}_i - \boldsymbol{\mu}_k)\cdot (\mathbf{y}_i - \boldsymbol{\mu}_k)^\mathsf{T}\right) , \end{align} \]
which is readily solved to give
\[\begin{align} \mathsf{\Sigma}_k^* = \frac{1}{\gamma_k}\sum_{i=1}^N \gamma(z_{ik})\,(\mathbf{y}_i - \boldsymbol{\mu}_k^*)\otimes(\mathbf{y}_i - \boldsymbol{\mu}_k^*). \end{align} \]
To find the value of \(\boldsymbol{\pi}^\theta\) that maximizes the surrogate function, we need to enforce the constraint that \(\sum_k\pi_k^\theta = 1\), which we can do using a Lagrange multiplier.
\[\begin{align} \frac{\partial}{\partial \boldsymbol{\pi}_k} \left(\mathcal{Q} + \lambda\left(1 - \sum_{k=1}^K \pi_k\right)\right) &= \sum_{i=1}^N\frac{\mathcal{N}(\mathbf{y}_i\mid\boldsymbol{\mu}_k, \mathsf{\Sigma}_k)}{\sum_{j=1}^K\pi_j\,\mathcal{N}(\mathbf{y}_i\mid\boldsymbol{\mu}_j, \mathsf{\Sigma}_j)} - \lambda. \end{align} \]
Setting this derivative to zero and multiplying both sides by \(\pi_k\) gives
\[\begin{align} 0 &= \sum_{i=1}^N\frac{\pi_k\,\mathcal{N}(\mathbf{y}_i\mid\boldsymbol{\mu}_k, \mathsf{\Sigma}_k)}{\sum_{j=1}^K\pi_j\,\mathcal{N}(\mathbf{y}_i\mid\boldsymbol{\mu}_j, \mathsf{\Sigma}_j)} - \lambda\,\pi_k = \sum_{i=1}^N\gamma(z_{ik}) - \lambda\,\pi_k = \gamma_k - \lambda\,\pi_k. \end{align} \]
Summing both sides over \(k\) and solving for \(\lambda\) gives \(\lambda = N\). Then, solving of \(\pi_k\), we have
\[\begin{align} \pi_k^* = \frac{\gamma_k}{N}. \end{align} \]
In summary, then, the M step involves updating the values of the parameters as follows:
\[\begin{align} &\pi_k^* = \frac{\gamma_k}{N}, \\[1em] &\boldsymbol{\mu}^*_k = \frac{1}{\gamma_k}\sum_{i=1}^N\gamma(z_{ik})\,\mathbf{y}_i, \\[1em] &\mathsf{\Sigma}_k^* = \frac{1}{\gamma_k}\sum_{i=1}^N \gamma(z_{ik})\,(\mathbf{y}_i - \boldsymbol{\mu}_k^*)\otimes(\mathbf{y}_i - \boldsymbol{\mu}_k^*). \end{align} \]
39.3 GMM priors
Until now, except for the above treatment with uniform priors, we have left the choice of priors unspecified. There are a few challenges for choosing priors for Gaussian mixture models that arise from constraints on the parameters, and we address them one by one below.
39.3.1 Prior for the mixing coefficients
We must have \(0 \le \pi_k \le 1\) for all \(\pi_k\) with \(\sum_{k=1}^K \pi_k = 1\). The mixing coefficients thus comprise a simplex. The Dirichlet distribution is convenient for defining priors for simplices. It is parametrized by a \(K\)-array \(\boldsymbol{\alpha}\) of positive numbers that push probability mass into corresponding mixing coefficients. An uninformative prior on a simplex is Dirichlet with \(\boldsymbol{\alpha}\) being an array of ones.
39.3.2 Prior for the location parameters
Mixture models suffer from a label-switching nonidentifiability. To understand this issue, recall that we have a set of location parameters \(\{\boldsymbol{\mu}_1, \boldsymbol{\mu}_2, \ldots \boldsymbol{\mu}_K\}\). We can permute the subscripts of these parameters and we still get the same model because the subscript labels are arbitrary. In practice, that means that if we, say, swapped the values of \(\boldsymbol{\mu}_1\) and \(\boldsymbol{\mu}_2\), \(\pi_1\) and \(\pi_2\), and \(\mathsf{\Sigma}_1\) and \(\mathsf{\Sigma}_2\) we get exactly the same posterior probability density.
This is a tricky problem in general. One approach is to enforce that \(\mu_{11} < \mu_{21} < \cdots < \mu_{k1}\). That is, the entries of each \(\boldsymbol{\mu}_k\) for the first dimension of the multivariate Gaussian are ordered.
39.3.3 Prior for the covariance matrix
Any covariance matrix \(\Sigma\) must be real symmetric positive definite. It can therefore be tricky to come up with a prior distribution, and further to ensure positive definiteness.
It is useful to note that any real symmetric positive definite matrix has a unique Cholesky decomposition \(\mathsf{L}\) such that \(\mathsf{\Sigma} = \mathsf{L}\cdot \mathsf{L}^\mathsf{T}\), where \(\mathsf{L}\) is a lower triangular matrix with real entries with the diagonal entries being positive. It is therefore often easier to consider the Cholesky decomposition of a covariance matrix when performing optimization (or any other kind of inference) because you can in practice leave the entries of the Cholesky decomposition unconstrained and the covariance matrix will retain positive definiteness.
It is further useful to note that any real symmetric positive definite matrix can be written as \(\mathsf{\Sigma} = \mathsf{S}\cdot \mathsf{C} \cdot \mathsf{S}\), where \(\mathsf{S}\) is a diagonal matrix with the diagonal being the square root of the diagonal entries of \(\mathsf{\Sigma}\), and \(\mathsf{C}\) is a symmetric matrix whose with entry \(i,j\) being \(C_{ij} = \Sigma_{ij} / \sqrt{\Sigma_{ii} \Sigma_{jj}}\). Note that the diagonal entries of \(\mathsf{C}\) are all one. It follows that \(\mathsf{\Sigma}\) is positive definite if and only if \(\mathsf{C}\) is positive definite. The matrix \(\mathsf{C}\) is referred to as a correlation matrix.
Given these two facts, a useful (and typical) approach to priors for covariance matrices are to specify priors for the standard deviations of each dimension (the square roots of the diagonal entries of \(\mathsf{\Sigma}\) and then to specify a prior for the correlation matrix and ensure positive definiteness. We could just specify a prior for the Cholesky decomposition of the correlation matrix so that whatever values we get for the Cholesky decomposition will result in a positive definite covariance matrix. (The diagonal of the Cholesky decomposition of a correlation matrix is necessarily comprised of ones, so the only the off-diagonal entries in the lower triangle need be inferred.)
The Lewandowski-Kurowicka-Joe (LKJ) distribution provides convenient priors for correlation matrices, and is implemented in Stan. It has a single positive scalar parameter, \(\eta\), which tunes the strength of the correlations. If \(\eta = 1\), the density is uniform over all correlation matrices. If \(\eta > 1\), matrices with a stronger diagonal (and therefore smaller correlations) are favored. If \(\eta < 1\), the diagonal is weak and correlations are favored.
In the bivariate case more fine-grained control of the prior can be easily achieved. A covariance matrix may be decomposed as follows.
\[\begin{align} \mathsf{\Sigma} = \begin{pmatrix} \sigma_1^2 & \sigma_{12} \\[0.5em] \sigma_{12} & \sigma_2^2 \end{pmatrix} = \begin{pmatrix} \sigma_1 & 0 \\[0.5em] 0 & \sigma_2 \end{pmatrix} \cdot \begin{pmatrix} 1 & \rho \\[0.5em] \rho & 1 \end{pmatrix} \cdot \begin{pmatrix} \sigma_1 & 0 \\[0.5em] 0 & \sigma_2 \end{pmatrix}, \end{align} \]
where \(\rho\) is the bivariate correlation, also known as the Pearson correlation, given by
\[\begin{align} \rho = \frac{\sigma_{12}}{\sigma_1\,\sigma_2}. \end{align} \]
Positive definiteness of \(\mathsf{\Sigma}\) is achieved when \(-1 < \rho < 1\). Therefore, we can provide priors for \(\sigma_{1}\), \(\sigma_{2}\), and \(\rho\) for a covariance matrix we wish to parametrize. Given that \(\rho\) must be between \(-1\) and \(1\), a common prior is \(\rho \sim 2\,\mathrm{Beta}(\alpha, \beta) - 1\). An \(\mathrm{LKJ}(\eta)\) prior is equivalent to \(\rho \sim 2\,\mathrm{Beta}(\eta, \eta) - 1\).