import polars as plimport sklearn.clusterimport sklearn.preprocessingimport bokeh.ioimport bokeh.plottingbokeh.io.output_notebook()
Loading BokehJS ...
We has seen the EM algorithm applied to a Gaussian mixture model. In computing the responsibilities, we are able to probabilistically assign each data point to one of the components of the mixture model. In this way, inference of the parameters of a GMM may be viewed as a form of clustering, in which each data point is assigned to a group, or cluster.
One of the simplest, and also fairly widely used, methods is K-means clustering (usually written with a lowercase k, but we use a capital K here to connect it to GMMs that have \(K\) components). This method is typically described in term of its algorithm. The algorithm proceeds as follows for a set of \(d\)-dimensional data.
\(K\)\(d\)-dimensional means are chosen. Call them \(\mathbf{m}_1, \mathbf{m}_2,\ldots,\mathbf{m}_K\).
For each data point \(\mathbf{y}_i\), assign it to cluster \(k\) such that it is closest to \(\mathbf{m}_k\) in a least squares sense. That is, assign data point \(i\) to the cluster \(k\) for which \(\lVert \mathbf{y}_i - \mathbf{m}_k \rVert\) is minimal.
Recalculate the means \(m_k\) based on the data points assigned to cluster \(k\). That is, \[\begin{align}
m_k = \frac{1}{N_k}\sum_{i \text{ in cluster }k}\mathbf{y}_i,
\end{align}
\] where \(N_k\) is the number of data points in cluster \(k\).
In the reassignments are the same as the previous iteration STOP. Otherwise, go to 2.
41.1 K-means with scikit-learn
While the K-means algorithm is easy to implement, we do not have to do it, as it is implemented in scikit-learn. Let us apply it to the Old Faithful data set we used in ?sec-gmm-em-example.
# Load in datadf = pl.read_csv(os.path.join(data_path, 'old_faithful.csv'))# Pull data out as a Numpy arrayy = df.select(pl.col('duration-minutes', 'waiting-minutes')).to_numpy()# Instantiate KMeans instance with two clusterskmeans = sklearn.cluster.KMeans(2)# Perform optimziation to get means and cluster assignmentskmeans.fit(y)
KMeans(n_clusters=2)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KMeans(n_clusters=2)
We can access the means (cluster centers) using the cluster_centers_ attribute and the cluster assignments using the labels_ attribute. We will use those to make a plot of the cluster assignments of the data in the Old Faithful data set.
# Color according to labelscolors = ['#00b6ff'if label==1else'#f6906c'for label in kmeans.labels_]# Plot data with label by colorp = bokeh.plotting.figure( frame_width=300, frame_height=300, x_axis_label='eruption duration (min)', y_axis_label='eruption waiting time (min)')p.scatter(y[:,0], y[:,1], color=colors)# Put in meansp.scatter( kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], color='black', marker='+', size=15)bokeh.io.show(p)
The result is similar to what we say in using EM (see Chapter 40), but the data points that apparently lie between the two clusters differ in their assignment.
For K-means clustering, it is wise to first center-and-scale the data. That is, perform a linear transformation on the observations such that the data points have a mean of zero and a variance of one.
# Center and scale yscaler = sklearn.preprocessing.StandardScaler().fit(y)y_scaled = scaler.transform(y)# Do K-means on centered-and-scaledkmeans = sklearn.cluster.KMeans(2).fit(y_scaled)# Rescale the cluster centerscluster_centers = scaler.inverse_transform(kmeans.cluster_centers_)# Plot datacolors = ['#00b6ff'if label==1else'#f6906c'for label in kmeans.labels_]# Plot data with label by colorp = bokeh.plotting.figure( frame_width=300, frame_height=300, x_axis_label='eruption duration (min)', y_axis_label='eruption waiting time (min)')p.scatter(y[:,0], y[:,1], color=colors)# Put in meansp.scatter( cluster_centers[:, 0], cluster_centers[:, 1], color='black', marker='+', size=15)bokeh.io.show(p)
These results are closer to what we got with a GMM.
41.2 K-means clustering as a GMM
As if often the case, heuristic/algorithmic methods have a Bayesian generative model underlying them. Indeed, the above algorithm for K-means clustering screams EM!
Recall from Equation 41.1 that the generative model for a GMM is
\[\begin{align}
&\boldsymbol{\pi} \sim \text{prior for mixing coefficients}, \\[1em]
&\boldsymbol{\mu} \sim \text{prior for location parameters}, \\[1em]
&\mathsf{\Sigma} \sim \text{prior for scale parameters}, \\[1em]
&\mathbf{z}_i \mid \boldsymbol{\pi} \sim \text{Categorical}(\boldsymbol{\pi}) \;\forall i, \\[1em]
&\mathbf{y}_i \mid \mathbf{z}_i, \boldsymbol{\mu}, \mathsf{\Sigma} \sim \prod_{k=1}^K (\mathrm{Norm}(\boldsymbol{\mu}_k,\mathsf{\Sigma}_k))^{z_{ik}} \;\forall i.
\end{align}
\tag{41.1}\]
When doing K-means, the priors are all uniform and all covariance matrices are diagonal with the same fixed very small entry \(\sigma^2\) on the diagonal. The model is thus
As \(\sigma\) gets really small, \(\mathrm{exp}[-\lVert \mathbf{y}_i - \mathbf{\mu}_k \rVert^2/2\sigma^2]\) is largest for the \(\boldsymbol{\mu}_k\) that is closest to \(\mathbf{y}_i\) and relatively tiny for every other term. Therefore, the responsibility is approximately one for the component \(k\) whose mean is closest to \(\mathbf{y}_i\) and zero for the other components. Since the responsibilities are all one or zero, each data point is assigned to the cluster whose mean is closest.
Therefore, the parameter-dependent part of the surrogate function is
where we have define \(\mu_{ik}\) to be the mean of the cluster to which data point \(i\) is assigned. So, the goal in the M step is to find the value of \(\boldsymbol{\mu}_{ik}\) that maximizes the surrogate function. We differentiate with respect to \(\boldsymbol{\mu}_k\). The terms of the sum that survive are only those assigned to cluster \(k\), such that
\[\begin{align}
\boldsymbol{\mu}_k = \frac{1}{N_k}\sum_{i\text{ in cluster }k} \mathbf{y}_i,
\end{align}
\]
which says that \(\mu_k\) is updated to be the mean of all of the data points assigned to it. So, the EM algorithm applied to the mixture model described by Equation 41.2 is indeed the algorithm for K-means clustering, thereby making clear what the generative model underlying it is. It’s a rather restrictive model with all uniform priors and a very rigid ascribed covariance matrix for the likelihood.