40 An example application of the EM algorithm to a Gaussian mixture model

Some of the material in this notebook is based on a similar notebook written by Kayla Jackson.

Code

# Colab setup ------------------
import os, shutil, sys, subprocess, urllib.request
if "google.colab" in sys.modules:
    cmd = "pip install --upgrade polars iqplot colorcet bebi103 arviz cmdstanpy watermark"
    process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    stdout, stderr = process.communicate()
    from cmdstanpy.install_cmdstan import latest_version
    cmdstan_version = latest_version()
    cmdstan_url = f"https://github.com/stan-dev/cmdstan/releases/download/v{cmdstan_version}/"
    fname = f"colab-cmdstan-{cmdstan_version}.tgz"
    urllib.request.urlretrieve(cmdstan_url + fname, fname)
    shutil.unpack_archive(fname)
    os.environ["CMDSTAN"] = f"./cmdstan-{cmdstan_version}"
    data_path = "https://s3.amazonaws.com/bebi103.caltech.edu/data/"
else:
    data_path = "../data/"
# ------------------------------

import numpy as np
import polars as pl

import scipy.stats as st
import scipy.optimize
import statsmodels.tools.numdiff as smnd

import sklearn

import cmdstanpy
import arviz as az

import iqplot
import bebi103
import colorcet

import bokeh.io
import bokeh.plotting
bokeh.io.output_notebook()

Loading BokehJS ...

As an example of an application of the EM algorithm to a Gaussian mixture model, we can use a now-classic data set featuring the waiting time and durations of eruptions of the Old Faithful geyser in Yellowstone National Park. The data set consists of a set of measurements of eruptions in 1985 and was published by Azzalini and Bowman in 1990. Let us first take a look at the data.

df = pl.read_csv(os.path.join(data_path, 'old_faithful.csv'))

p = bokeh.plotting.figure(
    frame_width=300,
    frame_height=300,
    x_axis_label='eruption duration (min)',
    y_axis_label='eruption waiting time (min)'
)

p.scatter(source=df.to_dict(), x='duration-minutes', y='waiting-minutes')

bokeh.io.show(p)

To build our model, we assume there are two bivariate Gaussians in our mixture. We will assume uniform priors for the mixing coefficients and for the correlation matrix. We choose reasonable priors for the remaining parameters. Our model is

\[\begin{align} &\pi_1 \sim \mathrm{Beta}(1, 1)\;\text{(equivalent to }\boldsymbol{\pi} \sim \text{Dirichlet}(1, 1)\text{)}, \\[1em] &\pi_2 = 1- \pi_1, \\[1em] &\mu_{d,1}, \mu_{d,2} \sim \mathrm{Norm}(5, 2), \\[1em] &\mu_{w,1}, \mu_{w,2} \sim \mathrm{Norm}(60, 10), \\[1em] &\boldsymbol{\mu}_1 = (\mu_{d,1}, \mu_{w,1})^\mathsf{T},\\[1em] &\boldsymbol{\mu}_2 = (\mu_{d,2}, \mu_{w,2})^\mathsf{T},\\[1em] &\sigma_{d,1}, \sigma_{d,2} \sim \mathrm{HalfNorm}(5), \\[1em] &\sigma_{w,1}, \sigma_{w,2} \sim \mathrm{HalfNorm}(20), \\[1em] &\rho_{1}, \rho_{2} \sim 2\,\mathrm{Beta}(1, 1) - 1\;\text{(equivalent to }\mathsf{C}_1, \mathsf{C}_2 \sim \text{LKJ}(1)\text{)}, \\[1em] &\mathsf{\Sigma}_1 = \begin{pmatrix} \sigma_{d,1} & 0 \\[0.5em] 0 & \sigma_{w,1} \end{pmatrix} \cdot \begin{pmatrix} 1 & \rho_1 \\[0.5em] \rho_1 & 1 \end{pmatrix} \cdot \begin{pmatrix} \sigma_{d,1} & 0 \\[0.5em] 0 & \sigma_{w,1} \end{pmatrix}, \\[1em] &\mathsf{\Sigma}_2 = \begin{pmatrix} \sigma_{d,2} & 0 \\[0.5em] 0 & \sigma_{w,2} \end{pmatrix} \cdot \begin{pmatrix} 1 & \rho_2 \\[0.5em] \rho_2 & 1 \end{pmatrix} \cdot \begin{pmatrix} \sigma_{d,2} & 0 \\[0.5em] 0 & \sigma_{w,2} \end{pmatrix}, \\[1em] &z_{i1} \mid \pi_1 \sim \text{Bernoulli}(\pi_1) \;\forall i, \\[1em] &z_{i2} = 1 - z_{i,1} \; \forall i, \\[1em] &\mathbf{y}_i \mid \mathbf{z}_i, \boldsymbol{\mu}, \mathsf{\Sigma} \sim \mathrm{Norm}(\boldsymbol{\mu}_1,\mathsf{\Sigma}_1)^{z_{i1}} \cdot \mathrm{Norm}(\boldsymbol{\mu}_2,\mathsf{\Sigma}_2)^{z_{i2}}\;\forall i. \end{align} \]

To avoid a label switching nonidentifiability, we stipulate that \(\mu_{d,1} < \mu_{d,2}\).

40.0.1 Full MCMC of the generative model

We start by applying the gold standard: performing full MCMC to sample out of the posterior distribution. Because HMC (Stan’s algorithm) can only sample continuous variables, we must sample out of the marginal posterior where we have marginalized out the \(\mathbf{z}_i\)’s. This is not problematic, as we can always compute the responsibilities for each sample. The Stan code below is more general for Gaussian mixture models, but we can use it in this case.

data {
    int<lower=2> K;       // Number of mixture components (at least 2)
    int<lower=1> d;       // Dimensionality of observations

    int<lower=1> N;       // Number of multidimensional data points

    array[N] vector[d] y; // Measurements
}


parameters {
    // Mixing coefficients
    simplex[K] pi_;
 
    // The mu values for the first dimension are ordered to 
    // prevent label switching nonidentifiability.
    ordered[K] mu_1;
    array[K] vector[d-1] mu_2_to_K;   // Rest of mu's

    // Cholesky decomposition of correlation matrix
    array[K] cholesky_factor_corr[d] L_Omega;

    // Sqrt of diagonal of covariance matrix
    array[K] vector<lower=0>[d] sigma;
}


transformed parameters {
    // mu's as a vector. Have to hand-build because we keep first entry 
    // in mu is ordered to prevent label-switching nonidentifiability.
    array[K] vector[d] mu;
    for (k in 1:K) {
        mu[k][1] = mu_1[k];
        // for (j in 2:K) {
        //     mu[k][j] = mu_2_to_K[k][j-1];
        // }
        mu[k][2:K] = mu_2_to_K[k];
    }

    // Cholesky factor for covariance matrix
    array[K] cholesky_factor_cov[d] L_Sigma;
    for (k in 1:K) {
        L_Sigma[k] = diag_pre_multiply(sigma[k], L_Omega[k]);
    }
}


model {
    // Begin priors -- edit for specific use case
    // Prior for mixing coefficients
    pi_ ~ dirichlet(rep_vector(1.0, d));

    for (k in 1:K) {
        // For Old Faithful analysis, d = 2; explicit priors on mu's.
        mu[k][1] ~ normal(5.0, 2.0);
        mu[k][2] ~ normal(60.0, 10.0);

        // Sqrt of diagonal of covariance matrix
        // For Old Faithful analysis, d = 2; explicit priors.
        sigma[k][1] ~ normal(0.0, 5.0);
        sigma[k][2] ~ normal(0.0, 20.0);

        // Use LKJ priors with eta = 1 for correlation matrix
        L_Omega[k] ~ lkj_corr_cholesky(1.0);
    }
    // End priors

    // Likelihood for GMM
    for (i in 1:N) {
        // Use logsumexp trick to compute likelihood of mixture model
        array[K] real log_pdf_components;
        for (k in 1:K) {
            log_pdf_components[k] = log(pi_[k]) 
                + multi_normal_cholesky_lpdf(y[i] | mu[k], L_Sigma[k]);
        }
        target += log_sum_exp(log_pdf_components);
    }
}


generated quantities {
    // Compute responsibilities
    array[N, K] real responsibilities;

    for (i in 1:N) {
        array[K] real log_responsibilities;

        // Compute log of the numerator for the responsibilities
        for (k in 1:K) {
            log_responsibilities[k] = log(pi_[k]) 
                + multi_normal_cholesky_lpdf(y[i] | mu[k], L_Sigma[k]);
        }

        // Normalization factor is the sum of all of the numerators
        real log_normalization_factor = -log_sum_exp(log_responsibilities);

        // Exponentiate log responsibilities to get responsibilities
        for (k in 1:K) {
            responsibilities[i, k] = exp(
                log_responsibilities[k] + log_normalization_factor
            );
        }
    }

    // Covariance matrices
    array[K] cov_matrix[d] Sigma;
    for (k in 1:K) {
        Sigma[k] = multiply_lower_tri_self_transpose(L_Sigma[k]);
    }
}

Let’s put this to use!

# Extract data set from data frame
y = df.select(pl.col('duration-minutes', 'waiting-minutes')).to_numpy()

# Prepare data for input, including 
# number of components in mixture and number of dimensions.
data = dict(
    K=2,
    d=2,
    N=len(df),
    y=y,
)

# Compile and sample!
with bebi103.stan.disable_logging():
    sm = cmdstanpy.CmdStanModel(stan_file='old_faithful_gmm.stan')
    samples = sm.sample(data=data)

# Get a convenient ArviZ object
samples = az.from_cmdstanpy(samples)

# As always, check diagnostics
bebi103.stan.check_all_diagnostics(samples, parameters=['mu', 'L_Sigma', 'pi_'])

                                                                                                                                                                                                                                                                                                                                
Effective sample size looks reasonable for all parameters.

Rhat looks reasonable for all parameters.

0 of 4000 (0.0%) iterations ended with a divergence.

0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.

E-BFMI indicated no pathological behavior.

Looks great! We can check a corner plot to see where the centers of the two Gaussians of the mixture model are.

bokeh.io.show(
    bebi103.viz.corner(
        samples, 
        parameters=[
            ('mu[0,0]', r'$$\mu_{\mathrm{d},1}$$ (min)'), 
            ('mu[0,1]', r'$$\mu_{\mathrm{d},2}$$ (min)'), 
            ('mu[1,0]', r'$$\mu_{\mathrm{w},1}$$ (min)'), 
            ('mu[1,1]', r'$$\mu_{\mathrm{w},2}$$ (min)')
        ],
        xtick_label_orientation=np.pi/4,
    )
)

This looks about right from what we would expect from our exploration of the plot. Conveniently, the corner plot displays credible intervals for the parameters describing the centers of the bivariate Gaussians in the mixture. We can explicitly calculate these from the samples.

# Credible intervals on all mu's
for component in [0, 1]:
    for dim, descriptor in zip([0, 1], ['duration (min)', 'waiting (min)']):
        print(f"95% cred int for mean {descriptor} for component {component}:  ", end='')
        print(np.percentile(samples.posterior.mu.sel(dict(mu_dim_0=component, mu_dim_1=dim)).values, [2.5, 97.5]))

95% cred int for mean duration (min) for component 0:  [1.98143675 2.09175075]
95% cred int for mean waiting (min) for component 0:  [53.302945 55.70961 ]
95% cred int for mean duration (min) for component 1:  [4.2240095  4.35365475]
95% cred int for mean waiting (min) for component 1:  [78.9693875 80.8362875]

We can further explore the results by making a plot of the data overlaid with a contour plot of the mixture of Gaussians given by the expectation of the parameter values. We can then color the data points according to the responsibilities. We start by computing expectations of the responsibilities.

responsibilities = (
    samples.posterior.responsibilities
    .sel(dict(responsibilities_dim_1=0))
    .mean(dim=['chain', 'draw'])
)

To color the data points according to which component of the mixture model they came from, we can convert the quantitative responsibilities to colors using the convenient bebi103.viz.q_to_color() function that converts quantitative information to colors.

colors = bebi103.viz.q_to_color(responsibilities, colorcet.CET_D11, low=0, high=1)

Next, we can compute expectations of the \(\boldsymbol{\pi}\), \(\boldsymbol{\mu}\) and \(\mathsf{\Sigma}\) parameters.

pi_1, pi_2 = samples.posterior.pi_.median(dim=['chain', 'draw']).values
mu_1, mu_2 = samples.posterior.mu.median(dim=['chain', 'draw']).values
Sigma_1, Sigma_2 = samples.posterior.Sigma.median(dim=['chain', 'draw']).values

Now we make the data from which the contours will be calculated.

duration = np.linspace(1.5, 5.5, 200)
waiting = np.linspace(40, 100, 200)
X, Y = np.meshgrid(duration, waiting)
Z = np.exp(
    np.logaddexp(
        np.log(pi_1) + st.multivariate_normal.logpdf(np.dstack((X, Y)), mean=mu_1, cov=Sigma_1),
        np.log(pi_2) + st.multivariate_normal.logpdf(np.dstack((X, Y)), mean=mu_2, cov=Sigma_2)
    )
)

Finally, we can make the plot, first plotting the contours and then overlaying the appropriately shaded data.

p = bebi103.viz.contour(
    X, 
    Y, 
    Z,     
    frame_width=300,
    frame_height=300,
    x_axis_label='eruption duration (min)',
    y_axis_label='eruption waiting time (min)'
)

# Add data
p.scatter(x=df['duration-minutes'], y=df['waiting-minutes'], color=colors)

bokeh.io.show(p)

This is a reasonably effective visualization that is commonly employed for data set of this type with a Gaussian mixture model. I would not use this visualization alone; corner plots and/or credible intervals for the parameters are important as well.

40.0.2 MAP estimation by direct optimization

We now turn to parameter estimation by find the maximum a posteriori (MAP) estimate. Optimization methods we have discussed so far are difficult in general for mixture models. Instead of coding up the log prior, log likelihood, etc., in Python and using scipy.optimize(), we will use Stan’s optimization routine (which is a BFGS solver) to attempt to find the MAP.

# Perform optimization, cheating by giving starting points at mu's from MCMC
map_stan = sm.optimize(data=data, inits=dict(mu=[[2.0, 53.0], [4.25, 79.0]])).optimized_params_dict

# Check out mu values
print('mu_1d = ', map_stan['mu[1,1]'], 'min')
print('mu_2d = ', map_stan['mu[2,1]'], 'min')
print('mu_1w = ', map_stan['mu[1,2]'], 'min')
print('mu_2w = ', map_stan['mu[2,2]'], 'min')

22:26:28 - cmdstanpy - INFO - Chain [1] start processing
22:26:28 - cmdstanpy - INFO - Chain [1] done processing

mu_1d =  3.48583 min
mu_2d =  3.48583 min
mu_1w =  59.9959 min
mu_2w =  70.8621 min

We clearly have not converged to the correct positions that we saw from doing MCMC. We show this calculation only to demonstrate the challenge of naively performing optimization for a Gaussian mixture model. We now turn to using EM to find the MAP.

40.0.3 MAP estimation using EM

To estimate the MAP using EM, we need a function to compute the responsibilities for the E-step. This is accomplished by directly coding up Equation 39.5.

def responsibilities(y, pi, mu, Sigma):
    """Compute responsibilities for a multivariate Gaussian mixture 
    model containing K components.

    Parameters
    ----------
    y : Numpy array, shape (N, d)
        The measurements.
    pi : Numpy array, shape (K,)
        The mixing coefficients.
    mu : Numpy array, shape (K, d)
        mu[k, :] is a vector giving the center of Gaussian component k.
    Sigma : Numpy array, shape (K, d, d)
        Sigma[k, :, :] is the covariance matrix of Gaussian component k.

    Returns
    -------
    gamma : Numpy array, shape (N, K)
        gamma[i, k] is the responsibility of component k for data 
        point i.
    """
    # Problem dimensions
    K = len(pi)
    N = len(y)

    # Values of log numerator of responsibilities
    log_num = np.empty((N, K))
    for k in range(K):
        log_num[:, k] = np.log(pi[k]) + st.multivariate_normal.logpdf(y, mu[k], Sigma[k])

    # Values of log numerator of responsibilities
    log_denom = -scipy.special.logsumexp(log_num, axis=1)

    return np.exp(log_num + log_denom[:, np.newaxis])

Next, we need to code up an optimization routine to find the maximum of the parameter-dependent part of the surrogate function given the responsibilities. First, we code up the surrogate function using Equation 39.6. As we have done before when we have constraints to satisfy (specifically \(\sum_k \pi _k = 1\)), it is convenient to use Powell’s method for optimizing. As such, we need to make sure the surrogate function evaluates to \(-\infty\) when constraints on parameters are not met.

def surrogate(y, pi, mu, Sigma, gamma, log_prior):
    """Parameter-dependent part of the surrogate function Q(θ, φ) for
    Gaussian mixture models.
    
    Parameters
    ----------
    y : Numpy array, shape (N, d)
        The measurements.
    pi : Numpy array, shape (K,)
        The mixing coefficients.
    mu : Numpy array, shape (K, d)
        mu[k, :] is a vector giving the center of Gaussian component k.
    Sigma : Numpy array, shape (K, d, d)
        Sigma[k, :, :] is the covariance matrix of Gaussian component k.
    gamma : Numpy array, shape (N, K)
        gamma[i, k] is the responsibility of component k for data 
        point i.
    log_prior : function
        Function with call signature log_prior(pi, mu, Sigma) that 
        returns the scalar log prior.
        
    Returns
    -------
    Q : float
        The parameter-dependent part of the surrogate function for a
        Gaussian mixture model.
    """
    # Number of components
    K = len(pi)

    # Make sure pi is between zero and one
    # We already enforce it is a simplex when we unpack params
    if (pi < 0).any() or (pi > 1).any():
        return -np.inf

    # No need to check the covariance matrices because we use
    # Cholesky decomposition in the numerical routines

    # Prior
    Q = log_prior(pi, mu, Sigma)
    if Q == -np.inf:
        return Q

    # Sum of gamma's over the data
    gamma_sum = np.sum(gamma, axis=0)
    
    # Mixing coefficients
    Q += np.dot(gamma_sum, np.log(pi))

    # Terms from Gaussians
    for k in range(K):
        Q += np.dot(gamma[:, k], st.multivariate_normal.logpdf(y, mu[k], Sigma[k]))
    
    return Q

Since we will be using numerical optimization routines, we need to store all of the parameters in an array. We will therefore need function to pack and unpack this array. We also will use the Cholesky decomposition of the covariance matrices. If \(d\) is the dimensionality of the data, then the covariance matrix is \(d\times d\), and there are a total of \(d(d+1)/2\) nonzero entries in the Cholesky decomposition.

def pack_params(pi, mu, Sigma):
    """Pack parameters into a single 1D Numpy array.

    Parameters
    ----------
    pi : Numpy array, shape (K,)
        Mixing coefficients
    mu : Numpy array, shape (K, d)
        Centers of Gaussian components
    Sigma : Numpy array, shape (K, d, d)
        Covariance matrices for Gaussian components

    Returns
    -------
    output : Numpy array, shape (K + K*d + K*d*(d+1)//2 - 1,)
        A 1D array containg parameter values. The first K-1 entries
        are the mixing coefficients. The Kth mixing coefficient is
        omitted because it must be equal to 1 - sum(other pi's). The
        next K * d entries are the values of mu. The remaining entries
        are the Cholesky decompositions of the covariance matrices.
    """
    # Problem dimensions
    K, d = mu.shape

    # Cholesky decompositions of the Sigma's
    L = np.empty_like(Sigma)
    for k in range(K):
        L[k] = np.linalg.cholesky(Sigma[k])

    # Params array holds K-1 pi values, K*d mu values,
    # and K * d * (d + 1) / 2 values of Cholesky decomps
    params = np.empty((K - 1) + (K * d) + (K * d * (d + 1) // 2))
    
    # Omit the last entry in pi, since that is not manipulated by solver
    params[: K - 1] = pi[: K - 1]

    # Plop in the mu's
    params[K - 1 : K * (1 + d) - 1] = np.ravel(mu)
    
    # Put in nonzero terms of Cholesky decomposition
    params[K * (1 + d) - 1 : ] = np.concatenate([L_k[np.tril_indices(d)] for L_k in L])

    return params


def unpack_params(params, K, d):
    """Unpack parameters from a single 1D Numpy array.

    Parameters
    ----------
    params : Numpy array, shape (K + K*d + K*d*(d+1)//2 - 1,)
        A 1D array containg parameter values. The first K-1 entries
        are the mixing coefficients. The Kth mixing coefficient is
        omitted because it must be equal to 1 - sum(other pi's). The
        next K * d entries are the values of mu. The remaining entries
        are the Cholesky decompositions of the covariance matrices.
    K : int
        Number of components in the mixture model
    d : int
        Dimensionality of the data

    Returns
    -------
    pi : Numpy array, shape (K,)
        Mixing coefficients
    mu : Numpy array, shape (K, d)
        Centers of Gaussian components
    Sigma : Numpy array, shape (K, d, d)
        Covariance matrices for Gaussian components
    """
    # Give pi containing all entries
    pi = np.concatenate((params[:K-1], [1.0 - np.sum(params[:K-1])]))
    
    # Pull out mu's
    mu = params[K - 1 : K * (1 + d) - 1].reshape((K, d))

    # Grab Cholesky decompositions
    L = np.zeros((K, d, d))
    i = K * (1 + d) - 1
    for k in range(K):
        L[k][np.tril_indices(d)] = params[i + k * d * (d + 1) // 2 : i + (k + 1) * d * (d + 1) // 2]

    #  Convert Cholesky decompositions to covariance matrices
    Sigma = np.empty((K, d, d))
    for k in range(K):
        Sigma[k] = np.dot(L[k], L[k].transpose())
    
    return pi, mu, Sigma

Now that we have an array of parameters, we can code up an objective function for our solver.

def negative_Q(params, y, gamma, log_prior):
    """Negative of parameter-dependent part of the surrogate function 
    Q(θ, φ) for Gaussian mixture models.
    
    Parameters
    ----------
    y : Numpy array, shape (N, d)
        The measurements.
    pi : Numpy array, shape (K,)
        The mixing coefficients.
    mu : Numpy array, shape (K, d)
        mu[k, :] is a vector giving the center of Gaussian component k.
    Sigma : Numpy array, shape (K, d, d)
        Sigma[k, :, :] is the covariance matrix of Gaussian component k.
    gamma : Numpy array, shape (N, K)
        gamma[i, k] is the responsibility of component k for data 
        point i.
    log_prior : function
        Function with call signature log_prior(pi, mu, Sigma) that 
        returns the scalar log prior.
        
    Returns
    -------
    negative_Q : float
        The parameter-dependent part of the surrogate function for a
        Gaussian mixture model multiplied by -1.
    """
    return -surrogate(y, *unpack_params(params, gamma.shape[1], y.shape[1]), gamma, log_prior)

To perform the M step, we first use the analytical results for uniform priors outlined in Section 39.2.3. If we do indeed have uniform priors, then the job is done. If not, we use these parameter values as an initial guess for a numerical solver. For simplicity, we will use Powell’s method, setting out-of-bounds parameter values (such as parameter sets with \(\sum_k \pi_k \ne 1\)) to \(-\infty\).

def m_step_no_prior(y, pi, mu, Sigma, gamma):  
    """Perform the M step of an EM algorithm for a GMM where the model
    has uniform priors.

    Parameters
    ----------
    y : Numpy array, shape (N, d)
        The measurements.
    pi : Numpy array, shape (K,)
        The mixing coefficients.
    mu : Numpy array, shape (K, d)
        mu[k, :] is a vector giving the center of Gaussian component k.
    Sigma : Numpy array, shape (K, d, d)
        Sigma[k, :, :] is the covariance matrix of Gaussian component k.
    gamma : Numpy array, shape (N, K)
        gamma[i, k] is the responsibility of component k for data 
        point i.

    Returns
    -------
    pi : Numpy array, shape (K,)
        Maximizing mixing coefficients.
    mu : Numpy array, shape (K, d)
        Maximizing mu where mu[k, :] is a vector giving the center of 
        Gaussian component k.
    Sigma : Numpy array, shape (K, d, d)
        Maximizing Sigma where Sigma[k, :, :] is the covariance matrix
        of Gaussian component k.
    """
    N = len(y)
    K = len(pi)
    gamma_sum = np.sum(gamma, axis=0)

    # Update mu values
    for k in range(K):
        mu[k] = np.dot(gamma[:, k], y) / gamma_sum[k]

    # Update Sigma values
    for k in range(K):
        Sigma[k] = np.zeros_like(Sigma[k])
        for i in range(N):
            Sigma[k] += gamma[i, k] * np.outer(y[i] - mu[k], y[i] - mu[k])
        Sigma[k] /= gamma_sum[k]

    # Update pi values
    pi = gamma_sum / N
    
    return pi, mu, Sigma


def m_step(y, pi, mu, Sigma, gamma, log_prior):
    """Perform the M step of an EM algorithm for a GMM.

    Parameters
    ----------
    y : Numpy array, shape (N, d)
        The measurements.
    pi : Numpy array, shape (K,)
        The mixing coefficients.
    mu : Numpy array, shape (K, d)
        mu[k, :] is a vector giving the center of Gaussian component k.
    Sigma : Numpy array, shape (K, d, d)
        Sigma[k, :, :] is the covariance matrix of Gaussian component k.
    gamma : Numpy array, shape (N, K)
        gamma[i, k] is the responsibility of component k for data 
        point i.
    log_prior : function
        Function with call signature log_prior(pi, mu, Sigma) that 
        returns the scalar log prior.
        
    Returns
    -------
    pi : Numpy array, shape (K,)
        Maximizing mixing coefficients.
    mu : Numpy array, shape (K, d)
        Maximizing mu where mu[k, :] is a vector giving the center of 
        Gaussian component k.
    Sigma : Numpy array, shape (K, d, d)
        Maximizing Sigma where Sigma[k, :, :] is the covariance matrix
        of Gaussian component k.
    """
    # Initial guess of parameters comes from optimization without prior
    pi, mu, Sigma = m_step_no_prior(y, pi, mu, Sigma, gamma)

    # If there is no prior, return
    if log_prior is None:
        return pi, mu, Sigma
    else:
        res = scipy.optimize.minimize(
            negative_Q, 
            pack_params(pi, mu, Sigma), 
            args=(y, gamma, log_prior), 
            method='powell'
        )

        return unpack_params(res.x, len(pi), y.shape[1])

To assess convergence of the EM algorithm, we will need to check how the log posterior changes from iteration to iteration. The log prior function is supplied as an argument, but the log likelihood function is the same for all GMMs. We code that up now.

def gmm_log_likelihood(y, pi, mu, Sigma):
    """Log likelihood function for a GMM.

    Parameters
    ----------
    y : Numpy array, shape (N, d)
        The measurements.
    pi : Numpy array, shape (K,)
        The mixing coefficients.
    mu : Numpy array, shape (K, d)
        mu[k, :] is a vector giving the center of Gaussian component k.
    Sigma : Numpy array, shape (K, d, d)
        Sigma[k, :, :] is the covariance matrix of Gaussian component k.
        
    Returns
    -------
    output : float
        The log likelihood.
    """
    N = len(y)
    K = len(pi)

    log_lik = 0.0
    log_likes = np.empty(K)

    for i in range(N):    
        for k in range(K):
            log_likes[k] = np.log(pi[k]) + st.multivariate_normal.logpdf(y[i], mu[k], Sigma[k])
        log_lik += scipy.special.logsumexp(log_likes)

    return log_lik

We finally have all the pieces to code up the complete EM algorithm!

def em(y, pi, mu, Sigma, log_prior=None, max_iter=100, tol=1e-6):    
    """Perform the M step of an EM algorithm for a GMM.

    Parameters
    ----------
    y : Numpy array, shape (N, d)
        The measurements.
    pi : Numpy array, shape (K,)
        Initial guess of the mixing coefficients.
    mu : Numpy array, shape (K, d)
        Initial guess of the centers of the Gaussian components.
        mu[k, :] is a vector giving the center of Gaussian component k.
    Sigma : Numpy array, shape (K, d, d)
        Initial guess of the covariance matrices of the Gaussian 
        components. Sigma[k, :, :] is the covariance matrix of Gaussian
        component k.
    log_prior : function or None, default None
        Function with call signature log_prior(pi, mu, Sigma) that 
        returns the scalar log prior. If None, uniform priors on all
        parameters are assumed.
    max_iter : int, default 100
        Maximum number of iterations.
    tol : float, default 1e-6
        Tolerance for convergence. If the log posterior changes by less
        than `tol` from one iteration to the next, convergence is 
        achieved.
        
    Returns
    -------
    pi : Numpy array, shape (K,)
        MAP mixing coefficients.
    mu : Numpy array, shape (K, d)
        MAP mu where mu[k, :] is a vector giving the center of 
        Gaussian component k.
    Sigma : Numpy array, shape (K, d, d)
        MAP Sigma where Sigma[k, :, :] is the covariance matrix
        of Gaussian component k.
    """    
    params = pack_params(pi, mu, Sigma)

    log_post = -np.inf
    for i in range(max_iter):
        # E-step
        gamma = responsibilities(y, pi, mu, Sigma)

        # M-step
        pi, mu, Sigma = m_step(y, pi, mu, Sigma, gamma, log_prior)

        # Check for convergences
        log_post_new = gmm_log_likelihood(y, pi, mu, Sigma)
        if log_prior is not None:
            log_post_new += log_prior(pi, mu, Sigma)
            
        if log_post_new - log_post < tol:
            break
        else:
            log_post = log_post_new

    if i == max_iter:
        raise RuntimeError(f'Failed to converge in {max_iter} steps with tolerance {tol}.')

    return pi, mu, Sigma

Let’s put our EM algorithm to the test with the Old Faithful data! We code up the prior for the Old Faithful data.

def log_prior_old_faithful(pi, mu, Sigma):
    # No need to check if pi_k isout-of-bounds; checked in surrogate function   
    # No need to add rho or pi terms, since they're uniform

    # Break nonidentifiability by enforcing the first entry mu is sorted
    if mu[0, 0] > mu[1, 0]:
        return -np.inf
    
    lp = st.norm.logpdf(mu[0, 0], 5, 5) + st.norm.logpdf(mu[0, 1], 60, 10)
    lp += st.norm.logpdf(mu[1, 0], 5, 5) + st.norm.logpdf(mu[1, 1], 60, 10)
    lp += st.halfnorm.logpdf(Sigma[0, 0, 0], 0, 5) + st.halfnorm.logpdf(Sigma[0, 1, 1], 0, 10)
    lp += st.halfnorm.logpdf(Sigma[1, 0, 0], 0, 5) + st.halfnorm.logpdf(Sigma[1, 1, 1], 0, 10)

    return lp

We now supply initial guesses and let the EM algorithm get to work!

# Initial guesses
pi = np.array([0.5, 0.5])
mu = np.array([[1.0, 50.0], [3.0, 50.0]])
Sigma = np.array([[[2, 0.5],[0.5, 7]], [[2, 0.6],[0.6, 8]]])

# MAP!
pi, mu, Sigma = em(y, pi, mu, Sigma, log_prior=log_prior_old_faithful)

# Responsibilities are useful to ascribe data points 
# to components of mixture
gamma = responsibilities(y, pi, mu, Sigma)

/Users/bois/miniconda3/envs/datasai/lib/python3.12/site-packages/scipy/optimize/_optimize.py:2502: RuntimeWarning: invalid value encountered in scalar multiply
  tmp2 = (x - v) * (fx - fw)

Now that we have our MAP, we can compute an approximate posterior, as usual by making an assumption of local normality. To do this, we need to code of a log posterior function we can differentiate.

def log_posterior_old_faithful(params, y, K, log_prior):
    """Log posterior of a GMM.
    """
    # Unpack parameters
    d = y.shape[1]
    pi, mu, Sigma = unpack_params(params, K, d)
    
    # log prior
    log_post = log_prior_old_faithful(pi, mu, Sigma)

    # log likelihood
    log_post += gmm_log_likelihood(y, pi, mu, Sigma)

    return log_post

We now numerically compute the Hessian of the log posterior to get the covariance matrix for the approximate posterior.

hes = smnd.approx_hess(
    pack_params(pi, mu, Sigma), 
    log_posterior_old_faithful, 
    args=(y, 2, lambda x: log_prior_old_faithful(*unpack_params(x)))
)

posterior_cov = -np.linalg.inv(hes)

Focusing on the centers of the components of the GMM, we can report credible intervals.

print(
    """
Most probable parameters in units of minutes
--------------------------------------------
Component 1:
  mean duration = {0:.2f} ± {1:.2f}  
  mean waiting = {2:.2f} ± {3:.2f}
  
Component 2:
  mean duration = {4:.2f} ± {5:.2f}  
  mean waiting = {6:.2f} ± {7:.2f}

""".format(
        mu[0, 0],
        1.96 * np.sqrt(posterior_cov[1, 1]),
        mu[0, 1],
        1.96 * np.sqrt(posterior_cov[2, 2]),
        mu[1, 0],
        1.96 * np.sqrt(posterior_cov[3, 3]),
        mu[1, 1],
        1.96 * np.sqrt(posterior_cov[4, 4]),
    )
)


Most probable parameters in units of minutes
--------------------------------------------
Component 1:
  mean duration = 2.04 ± 0.05  
  mean waiting = 54.50 ± 1.07

Component 2:
  mean duration = 4.29 ± 0.06  
  mean waiting = 79.94 ± 0.84

Finally, we can make a plot of the

# Color by responsibilities
colors = bebi103.viz.q_to_color(responsibilities(y, pi, mu, Sigma)[:,0], colorcet.CET_D11, low=0, high=1)

# Build contours
Z = np.exp(
    np.logaddexp(
        np.log(pi[0]) + st.multivariate_normal.logpdf(np.dstack((X, Y)), mean=mu[0], cov=Sigma[0]),
        np.log(pi[1]) + st.multivariate_normal.logpdf(np.dstack((X, Y)), mean=mu[1], cov=Sigma[1])
    )
)

p = bebi103.viz.contour(
    X, 
    Y, 
    Z,     
    frame_width=300,
    frame_height=300,
    x_axis_label='eruption duration (min)',
    y_axis_label='eruption waiting time (min)'
)

# Add data
p.scatter(x=df['duration-minutes'], y=df['waiting-minutes'], color=colors)

bokeh.io.show(p)

This is nearly the same plot we achieved with MCMC.

40.1 GMMs with scikit-learn

The scikit-learn package has built-in functionality to perform parameter estimation for Gaussian mixture models. However, it only works with uniform priors. That is, scikit-learn does maximum likelihood estimation, and not a full Bayesian parameter estimation. Nonetheless, the functionality is useful, and we demonstrate its use here.

We start by instantiating a GMM with the appropriate number \(K\) of components in the mixture. For the present data set, \(K = 2\).

gmm = sklearn.mixture.GaussianMixture(2)

Next, we “fit” the model using our data set y.

gmm.fit(y)

GaussianMixture(n_components=2)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

We can pull the MAP (again, it’s a MAP for a model with uniform priors) using the appropriate attributes. For example, our array mu above, in which the first row is \(\boldsymbol{\mu}\) for the first component and the second row is \(\boldsymbol{\mu}\) for the second component, is accessed using the means_ attribute of the GaussianMixture instance.

gmm.means_

array([[ 4.28977944, 79.96953298],
       [ 2.03652149, 54.47986018]])

This is quite similar to the result we got with the full prior, which is not surprising given we had plenty of data and the model seems to be a good descriptor of the generative process. Note, though, that the order of the components may be different, since scikit-learn does not handle the label-switching nonidenfiability.

Similarly, we can get the values of \(\mathsf{\Sigma}\) using the covariances_ attribute.

gmm.covariances_

array([[[ 0.16982046,  0.93871793],
        [ 0.93871793, 36.02497019]],

       [[ 0.06927449,  0.43627723],
        [ 0.43627723, 33.70493352]]])

And the mixing coefficients using the weights_ attribute.

gmm.weights_

array([0.64407255, 0.35592745])

Finally, we can access the responsibilities using the predict_proba() method.

# Show first five for brevity
gmm.predict_proba(y)[:5]

array([[9.99999997e-01, 2.68469170e-09],
       [1.86358243e-09, 9.99999998e-01],
       [9.99991353e-01, 8.64671381e-06],
       [1.05070205e-05, 9.99989493e-01],
       [1.00000000e+00, 1.08537069e-21]])

bebi103.stan.clean_cmdstan()

40.2 Computing environment

%load_ext watermark
%watermark -v -p numpy,polars,scipy,statsmodels,sklearn,arviz,cmdstanpy,bokeh,bebi103,jupyterlab
print("cmdstan   :", bebi103.stan.cmdstan_version())

Python implementation: CPython
Python version       : 3.12.9
IPython version      : 9.1.0

numpy      : 2.1.3
polars     : 1.27.1
scipy      : 1.15.2
statsmodels: 0.14.4
sklearn    : 1.6.1
arviz      : 0.21.0
cmdstanpy  : 1.2.5
bokeh      : 3.6.2
bebi103    : 0.1.27
jupyterlab : 4.4.2

cmdstan   : 2.36.0