37 Minorize-maximize algorithms

We have been taking the (suboptimal, but in some cases necessary) approach of making sense of posterior distributions by finding the parameter values for which the posterior probability density function is maximal, the so-called MAP. We have been employing various optimization methods to find the MAP, and different methods can be effective in different contexts. Specifically for the context of inferring latent variables, the expectation-maximization algorithm, or EM algorithm is particularly powerful.

The EM algorithm is a special case of a minorize-maximize algorithm, or MM algorithm, also know as a bound optimization algorithm. Here, we introduce the main ideas behind MM algorithms with an eye toward applications of EM algorithms.

37.1 Description of the algorithm

Consider an optimization problem in which we seek parameters \(\theta\) that maximize some objective function \(h(\theta)\). This is the same general optimization problem we have been considering thus far.

The basic idea behind an MM algorithm is to define a surrogate function, \(Q(\theta, \phi)\), that is a tight lower bound of \(h(\theta)\), or minorizes \(h(\theta)\). That is to say, \(Q(\theta, \phi)\) is everywhere less than \(h(\theta)\), except at \(Q(\phi, \phi)\), which is equal to \(h(\phi)\). Once we have defined the surrogate function \(Q(\theta, \phi)\), we find the value of \(\theta\) that maximizes \(Q(\theta, \phi)\). Because \(Q(\theta, \phi)\) minorizes \(h(\theta)\), if we evaluate \(h(\theta)\) at the \(\theta\) that maximized \(Q(\theta, \phi)\), the result will be above \(Q(\theta, \phi)\). The only time that \(h(\theta)\) will not be above \(Q(\theta, \phi)\) is when the maximum of \(Q(\theta, \phi)\) is also the maximum of \(h(\theta)\), which means we have found the maximizer of \(h(\theta)\).

So, we start with some guess for which \(\theta\) maximizes \(h(\theta)\); call it \(\phi\). We then construct the surrogate function \(Q(\theta, \phi)\). Next, we find the value of \(\theta\) that maximizes the surrogate function; call that \(\theta^*\).

\[\begin{align} \theta^* = \underset{\theta}{\text{arg max}} \,Q(\theta, \phi). \end{align} \tag{37.1}\]

Given that \(Q(\theta,\phi)\) minorizes \(h(\theta)\), we have

\[\begin{align} h(\theta^*) \ge Q(\theta^*,\phi) \ge Q(\phi, \phi) = h(\phi). \end{align} \tag{37.2}\]

This means that by updating \(\phi\) to \(\theta^*\), we have moved uphill on \(h(\theta)\). We continue in this way until we stop climbing uphill. An MM algorithm follows thusly.

Guess a maximizer of \(h(\theta)\) and define it as \(\phi\).
Define \(Q(\theta, \phi)\) which minorizes \(h(\theta)\).
Compute \(\theta^* = \text{arg max}_\theta, Q(\theta, \phi)\).
If the difference between \(h(\theta^*)\) and \(h(\phi)\) is smaller than a predefined threshold, STOP and record \(\theta^*\) as the maximizer of \(h(\theta)\). Otherwise, go to 5.
Set \(\phi = \theta^*\).
Go to 2.

The alternating minorization of a function and then maximization of the minorized function is where the MM algorithm gets its name.

37.2 This is clever, but what is the rub?

The MM algorithm is appealing. We sequentially bound the objective function from below with a surrogate function, maximize the surrogate function, and then move the surrogate function up, and repeat. But there are two challenges.

In the maximization step, we still need to do a maximization! Why not just maximize the objective function in the first place? The idea is that we can pick a surrogate function that is much easier to maximize, maybe even a surrogate function that we can maximize analytically. It is much easier to solve many easy maximization problems than one hard one.

The other issue is coming up with the surrogate function. We have already alluded to the fact that it is not worth using an MM algorithm if the surrogate function is no easier to optimize than the objective function. So, we have to pick a surrogate function the is easily optimizable. Beyond that, we need to make sure it is indeed a minorizer; the surrogate function must be a tight lower bound of the objective function!

So, ultimately, effective use of an MM algorithm comes down to effectively choosing easily optimizable minorizers of the objective function. For a general optimization, this is nontrivial. In the next lesson, we will show that for a specific class of posterior distributions, we can conveniently pick a surrogate function that allows for effective MM approaches.