41  K-means clustering

Open in Google Colab | Download notebook

Code
# Colab setup ------------------
import os,  sys, subprocess
if "google.colab" in sys.modules:
    cmd = "pip install --upgrade polars watermark"
    process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    stdout, stderr = process.communicate()
    data_path = "https://s3.amazonaws.com/bebi103.caltech.edu/data/"
else:
    data_path = "../data/"
# ------------------------------
import polars as pl

import sklearn.cluster
import sklearn.preprocessing

import bokeh.io
import bokeh.plotting
bokeh.io.output_notebook()
Loading BokehJS ...

We has seen the EM algorithm applied to a Gaussian mixture model. In computing the responsibilities, we are able to probabilistically assign each data point to one of the components of the mixture model. In this way, inference of the parameters of a GMM may be viewed as a form of clustering, in which each data point is assigned to a group, or cluster.

One of the simplest, and also fairly widely used, methods is K-means clustering (usually written with a lowercase k, but we use a capital K here to connect it to GMMs that have \(K\) components). This method is typically described in term of its algorithm. The algorithm proceeds as follows for a set of \(d\)-dimensional data.

  1. \(K\) \(d\)-dimensional means are chosen. Call them \(\mathbf{m}_1, \mathbf{m}_2,\ldots,\mathbf{m}_K\).
  2. For each data point \(\mathbf{y}_i\), assign it to cluster \(k\) such that it is closest to \(\mathbf{m}_k\) in a least squares sense. That is, assign data point \(i\) to the cluster \(k\) for which \(\lVert \mathbf{y}_i - \mathbf{m}_k \rVert\) is minimal.
  3. Recalculate the means \(m_k\) based on the data points assigned to cluster \(k\). That is, \[\begin{align} m_k = \frac{1}{N_k}\sum_{i \text{ in cluster }k}\mathbf{y}_i, \end{align} \] where \(N_k\) is the number of data points in cluster \(k\).
  4. In the reassignments are the same as the previous iteration STOP. Otherwise, go to 2.

41.1 K-means with scikit-learn

While the K-means algorithm is easy to implement, we do not have to do it, as it is implemented in scikit-learn. Let us apply it to the Old Faithful data set we used in ?sec-gmm-em-example.

# Load in data
df = pl.read_csv(os.path.join(data_path, 'old_faithful.csv'))

# Pull data out as a Numpy array
y  = df.select(pl.col('duration-minutes', 'waiting-minutes')).to_numpy()

# Instantiate KMeans instance with two clusters
kmeans = sklearn.cluster.KMeans(2)

# Perform optimziation to get means and cluster assignments
kmeans.fit(y)
KMeans(n_clusters=2)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

We can access the means (cluster centers) using the cluster_centers_ attribute and the cluster assignments using the labels_ attribute. We will use those to make a plot of the cluster assignments of the data in the Old Faithful data set.

# Color according to labels
colors = ['#00b6ff' if label==1 else '#f6906c' for label in kmeans.labels_]

# Plot data with label by color
p = bokeh.plotting.figure(
    frame_width=300,
    frame_height=300,
    x_axis_label='eruption duration (min)',
    y_axis_label='eruption waiting time (min)'
)
p.scatter(y[:,0], y[:,1], color=colors)

# Put in means
p.scatter(
    kmeans.cluster_centers_[:, 0], 
    kmeans.cluster_centers_[:, 1], 
    color='black', 
    marker='+',
    size=15
)


bokeh.io.show(p)

The result is similar to what we say in using EM (see Chapter 40), but the data points that apparently lie between the two clusters differ in their assignment.

For K-means clustering, it is wise to first center-and-scale the data. That is, perform a linear transformation on the observations such that the data points have a mean of zero and a variance of one.

# Center and scale y
scaler = sklearn.preprocessing.StandardScaler().fit(y)
y_scaled = scaler.transform(y)

# Do K-means on centered-and-scaled
kmeans = sklearn.cluster.KMeans(2).fit(y_scaled)

# Rescale the cluster centers
cluster_centers = scaler.inverse_transform(kmeans.cluster_centers_)

# Plot data
colors = ['#00b6ff' if label==1 else '#f6906c' for label in kmeans.labels_]

# Plot data with label by color
p = bokeh.plotting.figure(
    frame_width=300,
    frame_height=300,
    x_axis_label='eruption duration (min)',
    y_axis_label='eruption waiting time (min)'
)
p.scatter(y[:,0], y[:,1], color=colors)

# Put in means
p.scatter(
    cluster_centers[:, 0], 
    cluster_centers[:, 1], 
    color='black', 
    marker='+',
    size=15
)

bokeh.io.show(p)

These results are closer to what we got with a GMM.

41.2 K-means clustering as a GMM

As if often the case, heuristic/algorithmic methods have a Bayesian generative model underlying them. Indeed, the above algorithm for K-means clustering screams EM!

Recall from Equation 41.1 that the generative model for a GMM is

\[\begin{align} &\boldsymbol{\pi} \sim \text{prior for mixing coefficients}, \\[1em] &\boldsymbol{\mu} \sim \text{prior for location parameters}, \\[1em] &\mathsf{\Sigma} \sim \text{prior for scale parameters}, \\[1em] &\mathbf{z}_i \mid \boldsymbol{\pi} \sim \text{Categorical}(\boldsymbol{\pi}) \;\forall i, \\[1em] &\mathbf{y}_i \mid \mathbf{z}_i, \boldsymbol{\mu}, \mathsf{\Sigma} \sim \prod_{k=1}^K (\mathrm{Norm}(\boldsymbol{\mu}_k,\mathsf{\Sigma}_k))^{z_{ik}} \;\forall i. \end{align} \tag{41.1}\]

When doing K-means, the priors are all uniform and all covariance matrices are diagonal with the same fixed very small entry \(\sigma^2\) on the diagonal. The model is thus

\[\begin{align} &\boldsymbol{\pi} \sim \text{Dirichlet}(1/K, 1/K, \ldots, 1/K), \\[1em] &\mathbf{z}_i \mid \boldsymbol{\pi} \sim \text{Categorical}(\boldsymbol{\pi}) \;\forall i, \\[1em] &\mathbf{y}_i \mid \mathbf{z}_i, \boldsymbol{\mu}, \sigma \sim \prod_{k=1}^K (\mathrm{Norm}(\boldsymbol{\mu}_k, \sigma))^{z_{ik}} \;\forall i. \end{align} \tag{41.2}\]

Approaching this model with an EM algorithm (see Section 39.2), the E step involves evaluating \(g(z\mid y, \phi)\), which in this case is

\[\begin{align} g(z \mid y, \phi) = \prod_{i=1}^N\prod_{k=1}^K \left[\pi_k\,\mathcal{N}(\mathbf{y}_i \mid \boldsymbol{\mu}_k, \sigma)\right]^{z_{ik}}, \end{align} \]

resulting in a parameter-dependent part of the surrogate function of

\[\begin{align} \mathcal{Q}(\theta, \phi) = \sum_{i=1}^N\sum_{k=1}^k \gamma(z_{ik}) \left(\ln \pi_k + \ln \mathcal{N}(\mathbf{y}_i \mid \boldsymbol{\mu}_k, \sigma)\right), \end{align} \]

where the responsibilities are

\[\begin{align} \gamma(z_{ik}) = \displaystyle{\frac{\pi_k \,\mathcal{N}(\mathbf{y}_i \mid \boldsymbol{\mu}_k, \sigma)}{\sum_{j=1}^K\pi_j\,\mathcal{N}(\mathbf{y}_i \mid \boldsymbol{\mu}_j, \sigma)}} = \frac{\pi_k\,\mathrm{e}^{-\lVert \mathbf{y}_i - \boldsymbol{\mu}_k \rVert^2/2\sigma^2}}{\sum_{j=1}^K\pi_j\,\mathrm{e}^{-\lVert \mathbf{y}_i - \boldsymbol{\mu}_j\rVert^2/2\sigma^2}}. \end{align} \]

As \(\sigma\) gets really small, \(\mathrm{exp}[-\lVert \mathbf{y}_i - \mathbf{\mu}_k \rVert^2/2\sigma^2]\) is largest for the \(\boldsymbol{\mu}_k\) that is closest to \(\mathbf{y}_i\) and relatively tiny for every other term. Therefore, the responsibility is approximately one for the component \(k\) whose mean is closest to \(\mathbf{y}_i\) and zero for the other components. Since the responsibilities are all one or zero, each data point is assigned to the cluster whose mean is closest.

Therefore, the parameter-dependent part of the surrogate function is

\[\begin{align} \mathcal{Q}(\theta, \phi) = \sum_{i=1}^N \ln \mathcal{N}(\mathbf{y}_i \mid \boldsymbol{\mu}_{ik}, \sigma) = -\frac{1}{2\sigma}\sum_{i=1}^N \lVert \mathbf{y}_i - \boldsymbol{\mu}_{ik} \rVert^2, \end{align} \]

where we have define \(\mu_{ik}\) to be the mean of the cluster to which data point \(i\) is assigned. So, the goal in the M step is to find the value of \(\boldsymbol{\mu}_{ik}\) that maximizes the surrogate function. We differentiate with respect to \(\boldsymbol{\mu}_k\). The terms of the sum that survive are only those assigned to cluster \(k\), such that

\[\begin{align} \frac{\partial \mathcal{Q}}{\partial \boldsymbol{\mu}_k} = -\frac{1}{\sigma}\sum_{i\text{ in cluster }k} (\mathbf{y}_i - \boldsymbol{\mu}_{k}). \end{align} \]

Setting the derivative to zero gives

\[\begin{align} \boldsymbol{\mu}_k = \frac{1}{N_k}\sum_{i\text{ in cluster }k} \mathbf{y}_i, \end{align} \]

which says that \(\mu_k\) is updated to be the mean of all of the data points assigned to it. So, the EM algorithm applied to the mixture model described by Equation 41.2 is indeed the algorithm for K-means clustering, thereby making clear what the generative model underlying it is. It’s a rather restrictive model with all uniform priors and a very rigid ascribed covariance matrix for the likelihood.

41.3 Computing environment

%load_ext watermark
%watermark -v -p polars,sklearn,bokeh,jupyterlab
Python implementation: CPython
Python version       : 3.12.9
IPython version      : 9.1.0

polars    : 1.27.1
sklearn   : 1.6.1
bokeh     : 3.6.2
jupyterlab: 4.4.2