38  The expectation-maximization (EM) algorithm

Consider the case where I have a generative model where I observe data \(y\), but I have unobserved data \(z\). Unobserved data, often referred to as latent variables are parameters in the Bayesian sense, but we will treat them differently than the rest of the parameters, \(\theta\), in our generative model. So, if our goal is to get a MAP estimate for \(\theta\), we will need to marginalize away the latent variables \(z\).

38.1 The optimization problem solved by EM

For this case, the joint distribution is \(\pi(y, z, \theta)\), and Bayes’s theorem reads

\[\begin{align} g(z, \theta \mid y) = \frac{\pi(y, z, \theta)}{g(y)} = \frac{\pi(y, z\mid \theta)\,g(\theta)}{g(y)} = \frac{f(y\mid z, \theta)\,g(z\mid \theta)\,g(\theta)}{g(y)} \end{align} \tag{38.1}\]

The marginal posterior of interest is

\[\begin{align} g(\theta \mid y) &= \frac{1}{g(y)}\,\sum_z \pi(y, z, \theta) \nonumber \\[1em] &= \frac{1}{g(y)}\,\sum_z \frac{\pi(y, z\mid \theta)\,g(\theta)}{g(y)} \nonumber \\[1em] &= \frac{f(y\mid \theta)\,g(\theta)}{g(y)}, \end{align} \tag{38.2}\]

where, as usual, the sum over \(z\) is replaced with an integral for continuous \(z\).

So, if we are trying to find the MAP, we wish to maximize \(g(\theta\mid y)\), or equivalently to maximize \(\ln g(\theta \mid y)\). As usual, the denominator in Bayes theorem does not have any \(z\) or \(\theta\) dependence, so finding the MAP is equivalent to finding the value of \(\theta\) that maximizes

\[\begin{align} \pi(y, \theta) = \sum_z \pi(y, z, \theta) = f(y\mid \theta)\,g(\theta). \end{align} \tag{38.3}\]

Our goal, then is to compute

\[\begin{align} \underset{\theta}{\text{arg max}} h(\theta), \end{align} \tag{38.4}\]

where the objective function is \[\begin{align} h(\theta) = \ln \pi(y,\theta) \end{align} \tag{38.5}\]

Note that defining the objective function requires evaluation of the sum over \(z\) to marginalize the joint distribution. This sum is in general very difficult to evaluate. However, we will see that we can get away without ever evaluating it by using the EM algorithm! This is where its real magic comes in.

38.2 The E-step: Finding the minorizing surrogate function

We will take a somewhat convoluted path to constructing an appropriate surrogate function \(Q(\theta, \phi)\), but upon seeing the end result, it will make sense. We start by using conditional probability to write

\[\begin{align} \pi(y, z, \theta) = \pi(y, z\mid \theta)\,g(\theta) = g(z\mid y, \theta)\,f(y\mid \theta)\,g(\theta) = g(z\mid y, \theta)\,\pi(y, \theta), \end{align} \]

which we can rearrange to give

\[\begin{align} \pi(y, \theta) = \frac{\pi(y, z\mid \theta)\,g(\theta)}{g(z\mid y, \theta)}. \end{align} \]

Therefore, our objective function can be written as

\[\begin{align} h(\theta) = \ln \pi(y, \theta) = \ln \left[\frac{\pi(y, z \mid \theta)\,g(\theta)}{g(z\mid y, \theta)}\right] = \ln \pi(y, z \mid \theta) + \ln g(\theta) - \ln g(z \mid y, \theta). \end{align} \]

We now define a probability mass/density function \(q(z\mid \phi)\), which for now is arbitrary. We multiply both sides of the above equation by \(q(z\mid \phi)\) to get

\[\begin{align} q(z\mid \phi)\,h(\theta) = q(z\mid \phi)\,\ln \left[\pi(y, z \mid \theta)\,g(\theta)\right] - q(z\mid \phi)\,\ln g(z \mid \theta, y). \end{align} \]

We next use one of the best tricks in mathematics; we add zero to the right hand side. In this case, the “zero” we add is \(q(z\mid \phi) \ln q(z\mid \phi) - q(z\mid \phi) \ln q(z\mid \phi)\).

\[\begin{align} q(z\mid \phi)\,h(\theta) &= q(z\mid \phi)\,\ln \left[\pi(y, z \mid \theta)\,g(\theta)\right] - q(z\mid \phi)\,\ln g(z, \theta \mid y) + q(z\mid \phi) \ln q(z\mid \phi) - q(z\mid \phi) \ln q(z\mid \phi) \nonumber \\[1em] &= q(z\mid \phi)\,\ln \left[\frac{\pi(y, z \mid \theta)\,g(\theta)}{q(z\mid \phi)}\right] - q(z\mid \phi)\,\ln\left[\frac{g(z \mid y, \theta)}{q(z\mid \phi)}\right]. \end{align} \]

We now sum both sides of the equation over \(z\). The left hand side simply evaluates to \(h(\theta)\), since \(q(z\mid \phi)\) is normalized. We thus obtain

\[\begin{align} h(\theta) = \sum_z q(z\mid \phi)\,\ln \left[\frac{\pi(y, z\mid \theta)\,g(\theta)}{q(z\mid \phi)}\right] - \sum_z q(z\mid \phi)\,\ln\left[\frac{g(z \mid y, \theta)}{q(z\mid \phi)}\right] \end{align} \]

We recognize the second term as the negative Kullback-Leibler divergence between the \(q(z\mid \phi)\) and \(g(z \mid y, \theta)\);

\[\begin{align} D_\mathrm{KL}(q(z\mid \phi) \, \| \, g(z \mid y, \theta)) \sum_z q(z\mid \phi)\,\ln\left[\frac{q(z\mid \phi)}{g(z \mid y, \theta)}\right], \end{align} \]

so that

\[\begin{align} h(\theta) = \sum_z q(z\mid \phi)\,\ln \left[\frac{\pi(y, z\mid \theta)\,g(\theta)}{q(z\mid \phi)}\right] + D_\mathrm{KL}(q(z\mid \phi)\,\|\, g(z \mid y, \theta)). \end{align} \]

The Gibbs inequality tells us that the Kullback-Leibler divergence is positive, except when \(q(z\mid \phi) = g(z, \theta \mid y)\), where the Kullback-Leibler divergence is zero. Therefore, the first term in the above equation is indeed a tight lower bound (a minorizer) of \(h(\theta)\) if \(q(z\mid \phi) = g(z \mid y, \phi)\). The minorizer of \(h(\theta)\) is then

So our surrogate function that minorizes \(h(\theta)\) is

\[\begin{align} Q(\theta, \phi) &= \sum_z g(z \mid y, \phi)\,\ln \left[\frac{\pi(y, z \mid \theta)\,g(\theta)}{g(z \mid y, \phi)}\right]. \end{align} \]

Because the surrogate function is an expectation of \(\ln \left[\pi(y, z \mid \theta)\,g(\theta)/g(z \mid y, \phi)\right]\) over the distribution \(g(z \mid y, \phi)\), this minorizing step is referred to as the “expectation step,” or the “E step.”

38.3 The M-step: Finding the \(\theta\) that maximizes the surrogate function

Now that we have defined the surrogate function that minorizes the marginal log posterior, we need to find the value of \(\theta\) that maximizes \(Q(\theta, \phi)\). We note that we can write \(Q(\theta, \pi)\) as

\[\begin{align} Q(\theta, \phi) = \sum_z g(z \mid y, \phi)\,\ln \left[\pi(y, z \mid \theta)\,g(\theta)\right] - \sum_z g(z \mid y, \phi)\,\ln g(z \mid y, \phi). \end{align} \]

The last term is the entropy of the \(g(z\mid y, \phi)\) distribution, and is \(\theta\)-independent. So, the maximization problem involves finding the value of \(\theta\) that maximizes the quantity

\[\begin{align} \mathcal{Q}(\theta, \phi) = \sum_z g(z \mid y, \phi)\,\ln \left[\pi(y, z \mid \theta)\,g(\theta)\right], \end{align} \]

where \(\mathcal{Q} = Q(\theta,\phi) - H[g(z \mid y, \phi)]\). We can further simplify \(\mathcal{Q}(\theta, \phi)\) by evaluating the logarithm to give

\[\begin{align} \mathcal{Q}(\theta, \phi) = \ln g(\theta) + \sum_z g(z \mid y, \phi)\,\ln \pi(y, z \mid \theta). \end{align} \tag{38.6}\]

38.4 Summary of the EM algorithm

We seek to find the value of the parameter \(\theta\) that maximizes the marginal log posterior,

\[\begin{align} g(\theta \mid y) = \frac{1}{f(y)}\sum_z \pi(y, z \mid \theta) g(\theta), \end{align} \]

where \(\pi(y,z\mid \theta) = f(y\mid z, \theta)\,g(z\mid \theta)\) and \(g(\theta)\) are known. This is equivalent to maximizing

\[\begin{align} h(\theta) = \ln \pi(y, \theta) = \ln[f(y\mid\theta)\,g(\theta)]. \end{align} \]

The EM algorithm goes as follows.

  1. Guess a maximizer of the marginal log posterior and define it as \(\phi\).
  2. Evaluate \(g(z\mid y, \theta)\). (E step)
  3. Use the result of step (2) to define \(\mathcal{Q}(\theta, \phi) = \sum_z g(z \mid y, \phi)\,\ln \left[\pi(y, z \mid \theta)\,g(\theta)\right]\).
  4. Compute \(\theta^* = \text{arg max}_\theta, \mathcal{Q}(\theta, \phi)\). (M step)
  5. If the difference between \(h(\theta^*)\) and \(h(\phi)\) is smaller than a predefined threshold, STOP and record \(\theta^*\) as the maximizer of \(h(\theta)\). Otherwise, go to 6.
  6. Set \(\phi = \theta^*\).
  7. Go to 2.