import warningsimport numpy as npimport scipy.ioimport scipy.stats as stimport sklearnimport sklearn.decompositionimport sklearn.preprocessingimport bebi103import bokeh.ioimport bokeh.plottingbokeh.io.output_notebook()
Loading BokehJS ...
In Chapter 49, we took a heuristic approach to principal component analysis in which we did the following.
We noted that we have (centered and scaled) high dimensional (dimension \(D\)) observations \(\mathbf{y}\) and posit that they could be generated from \(L\) causes, with \(L < D\), often with \(L \ll D\).
We sought a linear projection of \(D\)-dimensional data \(\mathbf{y}\) onto \(L\)-dimensional latent variables (causes) \(\mathbf{z}\).
We defined a loss function that is the mean squared difference between observed data and data recovered from a linear transformation of the causes.
We minimized the loss function to get the optimal transformation.
To be clear, this transformation is optimal only in the sense of the loss function we defined.
We now approach this problem with more careful probabilistic modeling.
50.1 Formulation of factor analysis
In factor analysis, we have a similar goal as in PCA in that we define a model \(\mathbf{y} = h(\mathbf{z};\theta)\), where \(\theta\) is some set of parameters, and we want to infer a function \(h(\mathbf{z};\theta)\). As in PCA, we restrict \(h(\mathbf{z};\theta)\) to be linear function defined by a \(D\times L\) loading matrix \(\mathsf{W}\) such that
where \(\mathsf{\Psi}\) is a \(D\times D\) covariance matrix describing how the observed data vary from the linear model. We typically take \(\mathsf{\Psi}\) to be diagonal; the residuals of the observations are uncorrelated (though the observations are correlated through the latent variables \(\mathbf{z}\)). We therefore will define the vector \(\boldsymbol{\sigma}\) to be the diagonal elements of \(\mathsf{\Psi}\) such that \(\Psi = \text{diag}(\boldsymbol{\sigma}^2)\), where \(\boldsymbol{\sigma}^2\) denotes element-wise squaring of the elements of \(\boldsymbol{\sigma}\).
We need to specify a prior on the latent variables \(\mathbf{z}\). In factor analysis, we take the latent variables to be \(L\)-variate Normally distributed. \[\begin{align}
\mathbf{z}_i \mid \boldsymbol{\mu}_0, \mathsf{\Sigma}_0 \sim \text{Norm}(\boldsymbol{\mu}_0, \mathsf{\Sigma}_0)\;\forall i.
\end{align}\]
Note that since we are using a linear model and \(\mathbf{z}_i\) gets transformed according to \(\mathsf{W}\cdot \mathbf{z}_i + \boldsymbol{\mu}\), we can absorb \(\boldsymbol{\mu}_0\) and \(\mathsf{\Sigma}_0\) into \(\mathsf{W}\) and \(\boldsymbol{\mu}\). Absorbing these as described, we can write
\[\begin{align}
\mathbf{z}_i \sim \text{Norm}(\mathbf{0}, \mathsf{I})\;\forall i,
\end{align}
\]
where \(\mathsf{I}\) is the \(L\times L\) identity matrix. The full generative model for a factor analysis is then
\[\begin{align}
&\boldsymbol{\mu} \sim \text{prior for }\boldsymbol{\mu}, \\[1em]
&\mathsf{W} \sim \text{prior for }\mathsf{W}, \\[1em]
&\boldsymbol{\sigma} \sim \text{prior for }\boldsymbol{\sigma} \\[1em]
&\mathsf{\Psi} = \text{diag}(\boldsymbol{\sigma}^2) \\[1em]
&\mathbf{z}_i \sim \text{Norm}(\mathbf{0}, \mathsf{I})\;\forall i, \\[1em]
&\mathbf{y}_i \mid\boldsymbol{\mu}, \mathsf{W}, \mathbf{z}_i, \mathsf{\Psi} \sim \text{Norm}(\mathsf{W}\cdot \mathbf{z}_i + \boldsymbol{\mu}, \mathsf{\Psi})\; \forall i
\end{align}
\]
For clarity and for reference, the dimensionality of the variables are shown in the table below. The index \(i\) indexes data points, and there are \(N\) of them.
Variable
Description
Dimension
\(\boldsymbol{\mu}\)
location parameter of measurements
\(D\)-vector
\(\mathsf{W}\)
loading matrix
\(D \times L\) matrix
\(\boldsymbol{\sigma}\)
standard deviation for each measurement
positive \(D\)-vector
\(\mathsf{\Psi}\)
matrix representation of \(\boldsymbol{\sigma}^2\)
\(D \times D\) nonnegative diagonal matrix
$_i $
Latent variable for datum \(i\)
\(L\)-vector
$_i $
Datum \(i\)
\(D\)-vector
Obviously, we need priors for \(\mathsf{W}\), \(\boldsymbol{\mu}\), and \(\mathsf{\Psi}\). We will temporarily leave those priors unspecified and turn our attention to marginal distributions and then to identifiability of the parameters of a factor analysis.
50.1.2 Marginal likelihood in factor analysis
We note that we can write a marginalized likelihood for datum \(i\) as
where we have used the fact that the latent variables \(\mathbf{z}_i\) are not conditioned on the parameters \(\boldsymbol{\mu}\), \(\mathsf{W}\), and \(\mathsf{\Psi}\). Computing the integral, we get a marginalized likelihood of
This marginalized likelihood makes clear what factor analysis does: It expresses the observed data as generated by a low-dimensional Gaussian distribution with mean \(\boldsymbol{\mu}\) and covariance matrix \(\mathsf{W}\cdot\mathsf{W}^\mathsf{T} + \mathsf{\Psi}\).
50.1.3 Recognition distribution in factor analysis
The recognition distribution is the distribution of latent variables conditioned on the observed data, \(g(\mathbf{z}_i\mid \mathbf{y}_i, \boldsymbol{\mu}, \mathsf{W}, \mathsf{\Psi})\). Bayes’s theorem gives
Because the denominator has no \(\mathbf{z}_i\)-dependence, it is a normalization constant. The recognition distribution is then a product of two Normal distributions, since
We note that evaluating the recognition distribution is part of the E step of an EM algorithm for finding the MAP of a factor model, as described in Section 50.2.
50.1.4 Nonidentifiability of the factor loading matrix and latent variables
In general, the factor loading matrix \(\mathsf{W}\) and latent variable \(\mathbf{z}\) are not identifiable. This can be seen by considering an arbitrary \(L\times L\) orthonormal matrix \(\mathsf{Q}\) (meaning that the rows and vectors of \(\mathsf{Q}\) are orthonormal vectors such that \(\mathsf{Q}\cdot\mathsf{Q}^\mathsf{T} = \mathsf{I}\)). We define a new loading matrix and latent variable vector according to
Thus, two different loading matrices and vector of latent variables gives exactly the same likelihood upon multiplication by an orthonormal matrix.
This presents a real problem! The variables we are most interested in, the latent variables and the factor loading matrix, are not in general identifiable. We will discuss methods to deal with this nonidentifiability momentarily, but for now we will proceed without it giving us too much concern. Chris Bishop gives a nice motivation for this approach in his book Pattern Recognition and Machine Learning.
We shall view factor analysis as a form of latent variable density model, in which the form of the latent space is of interest but not the particular choice of coordinates used to describe it.
50.1.5 Specifying priors
We have left priors for \(\boldsymbol{\mu}\), \(\mathsf{W}\), and \(\boldsymbol{\Psi}\) unspecified. The rotational nonidentifiability makes prior assignment difficult, as we would like to ascribe a prior that breaks the nonidentifiability. There are several approaches to this. One is to take the same approach we did with PCA (see Chapter 49), and insist in our prior that the columns of \(\mathsf{W}\) are othronormal. This breaks much of the nonidentifiability because in this case, \(\mathsf{W}\cdot \mathsf{Q}^\mathsf{T}\) is itself orthonormal, as can be shown by noting that
However, any given column of \(\mathsf{W}\) can still be arbitrarily multiplied by \(-1\), so \(\mathsf{W}\) and therefore also the latent variables \(\mathbf{z}\), are not identifiable (see also Section 49.1.6).
Even if we wanted to formulate a prior to insist that the columns of \(\mathsf{W}\) are orthonormal, this is difficult to implement. An easier, but fraught, approach is to force \(\mathsf{W}\) to be lower triangular. Forcing this upon the load matrix can lead to constraints on the resulting latent variables \(\mathbf{z}\), clouding their interpretation.
In our analysis, we will punt on this and take the egregious step of leaving the priors unspecified and follow Bishop’s philosophy we are using the factor analysis to expose the form of the latent space and not concern ourselves with the particularities of coordinate choices. We will also forego evaluating the full posterior using MCMC because of the awful nonidentifiabilities and will proceed to find a MAP (using an indefinite article because there are many MAPs!).
(Rick Farouni has written a couple of nice blog posts about the nonidentifiability challenges of factor analysis, available here and here.)
50.2 EM for factor analysis
As a latent variable model, factor analysis is well-suited for the EM algorithm. We can take the usual approach for deriving the EM algorithm by evaluating the recognition distribution, defining the parameter-dependent part of the surrogate function, and then finding the parameter values that maximize it. It turns out that the MAP estimate for \(\boldsymbol{\mu}\) is the sample mean, so that can be computed directly outside of the EM algorithm.
to be the plug-in estimate for \(\boldsymbol{\sigma}^2\), where the symbol \(\odot\) denotes element-wise multiplication.
Initialize \(\mathsf{W}\) and \(\boldsymbol{\sigma}\).
Compute \(\mathsf{\Sigma}_\mathbf{z} = (\mathsf{I} + \mathsf{W}^\mathsf{T}\cdot\mathsf{\Psi}^{-1}\cdot \mathsf{W})^{-1}\). (E step)
Compute the \(\mathbf{E}_i = \mathsf{\Sigma}_\mathbf{z}\cdot\mathsf{W}^\mathsf{T}\cdot\mathsf{\Psi}^{-1}\cdot(\mathbf{y}_i - \bar{\mathbf{y}})\) for all \(i\). (E step)
Compute \(\mathsf{E}_{ii} = \mathsf{\Sigma}_\mathbf{z} + \mathbf{E}_i\cdot \mathbf{E}_i^\mathsf{T}\) for all \(i\). (E step)
If the difference between the log posterior (which in our case is equal to the log likelihood) evaluated at \(\boldsymbol{\mu}_\mathrm{MAP}\), \(\mathsf{W}^*\) and \(\boldsymbol{\sigma}^*\) and the log posterior evaluated at \(\boldsymbol{\mu}_\mathrm{MAP}\), \(\mathsf{W}\) and \(\boldsymbol{\sigma}\) is smaller than a predefined threshold, STOP and record \(\boldsymbol{\mu}_\mathrm{MAP}\), \(\mathsf{W}^*\) and \(\boldsymbol{\sigma}^*\) as the MAP estimate. Otherwise, go to 6.
Set \(\mathsf{W} = \mathsf{W}^*\) and \(\boldsymbol{\sigma} = \boldsymbol{\sigma}^*\).
Go to 2.
50.2.1 Factor analysis with scikit-learn
We will not code up the EM algorithm here, as we did for a Gaussian mixture models (see ?sec-gmm-em-example), as it is not particularly instructive. Instead, we will utilize the implementation in scikit-learn. Note, though, that if we did have priors on the parameters, we would need to update the above algorithm and implement it ourselves, since scikit-learn’s implementation does not have priors.
# Load in the data set and separate into Numpy arraysdata = scipy.io.loadmat(os.path.join(data_path, 'hypothalamus_calcium_imaging_remedios_et_al.mat'))neural_data = data['neural_data']attack_vector = data['attack_vector'].flatten()sex_vector = data['sex_vector'].flatten()# Time points at 30 Hz samplingt = np.arange(len(attack_vector)) /30
Scikit-learn’s implementation of factor analysis assumes centered data. This is because we know the MAP value of \(\boldsymbol{\mu}\) analytically. So, we can go ahead and compute that directly.
# MAP estimate for parameter mumu = np.mean(neural_data, axis=1)
We should therefore center the data. We will not scale the data set, since the factor analysis involves estimation of the variance via the parameter \(\boldsymbol{\sigma}\) along each dimension.
# Centered data set by subtracting the meanscaler = sklearn.preprocessing.StandardScaler(with_std=False).fit(neural_data.T)y = scaler.transform(neural_data.T)
Next, we can instantiate a FactorAnalysis instance, which requires specification of the number of components. We will again use two components. We will go ahead and run the fit() method to perform the factor analysis.
with warnings.catch_warnings(): warnings.simplefilter("ignore") fa = sklearn.decomposition.FactorAnalysis(n_components=2).fit(y)
We can now explore the results of the factor analysis. The MAP estimate for the location parameter (up to an additive constant) of the recognition distribution, \(\boldsymbol{\mu}_{\mathbf{z}\mid\mathbf{y}_i}\), is computed using the transform() method.
mu_z = fa.transform(y)
The parameter \(\boldsymbol{\sigma}^2\) is given by the noise_variance_ parameter.
sigma = np.sqrt(fa.noise_variance_)
The loading matrix is given by the components_ attribute.
# D x L loading matrix is the transpose of the components_ attributeW = fa.components_.T
Scikit-learn does not compute the covariance matrix of the recognition distribution, \(\mathsf{\Sigma}_\mathbf{z}\). As shown in Section 50.1.3, \(\mathsf{\Sigma}_\mathbf{z}\) is the same for all data points, given by Equation 50.2. We can therefore compute it.
Sigma = np.linalg.inv(np.eye(2) + W.T @ np.diag(1/ fa.noise_variance_) @ W)# Take a lookSigma
/var/folders/8h/qwnxpqcx6vldhxr71n1582d00000gn/T/ipykernel_33816/4192835866.py:1: RuntimeWarning: divide by zero encountered in matmul
Sigma = np.linalg.inv(np.eye(2) + W.T @ np.diag(1 / fa.noise_variance_) @ W)
/var/folders/8h/qwnxpqcx6vldhxr71n1582d00000gn/T/ipykernel_33816/4192835866.py:1: RuntimeWarning: overflow encountered in matmul
Sigma = np.linalg.inv(np.eye(2) + W.T @ np.diag(1 / fa.noise_variance_) @ W)
/var/folders/8h/qwnxpqcx6vldhxr71n1582d00000gn/T/ipykernel_33816/4192835866.py:1: RuntimeWarning: invalid value encountered in matmul
Sigma = np.linalg.inv(np.eye(2) + W.T @ np.diag(1 / fa.noise_variance_) @ W)
Evidently, \(\mathsf{\Sigma}_\mathbf{z}\) is approximately diagonal. This is not the case in general for factor analysis, but is in this case; the components of the latent variables are essentially independent of each other.
50.2.2 Visualizing FA results
In a factor analysis, instead of having a point estimate for the latent variables \(\mathbf{z}\) as in PCA, we have a posterior distribution for them. We have already computed its parameters \(\boldsymbol{\mu}_{\mathbf{z}\mid\mathbf{y}}\) and \(\mathsf{\Sigma}_\mathbf{z}\). If we want to make a plot of one latent variable plotted against another, we can do so using ellipse glyphs that encompass 95% of the posterior probability mass. To plot an ellipse shape, Bokeh requires a major axis length, minor axis length, and angle from horizontal of the ellipse. Let \(\lambda_1\) and \(\lambda_2\) be the eigenvalues of \(\mathsf{\Sigma}_\mathbf{z}\) with \(\lambda_1 \le \lambda_2\) with corresponding eigenvectors \(\mathbf{v}_1\) and \(\mathbf{v}_2\). Then, for a 2D Normal distribution, the ellipse that encompasses a fraction \(p\) of the total posterior probability mass has:
Major axis length: \(2\sqrt{\lambda_2\,\chi_2^2(p)}\)
Minor axis length: \(2\sqrt{\lambda_1\,\chi_2^2(p)}\)
Angle from horizontal: \(\arctan(v_{2,y} / v_{2,x})\)
Here, \(\chi_2^2(p)\) represents the percentile of a \(\chi\)-squared distribution with two degrees of freedom. We can calculate these quantities from \(\mathsf{\Sigma}_{\mathbf{z}}\).
chi = st.chi2.ppf(0.95, df=2)# Eigensystem (ordered using eigh)lam, v = np.linalg.eigh(Sigma)# Major and minor axesb =2* np.sqrt(lam[1] * chi)a =2* np.sqrt(lam[0] * chi)# Angle from verticalangle = np.atan2(v[1, 1], v[1, 0])
Now let’s make the plot!
p = bokeh.plotting.figure( frame_height=350, frame_width=350, x_axis_label="z₁", y_axis_label="z₂",)# Set up date for the plottime_color = bebi103.viz.q_to_color(t, bokeh.palettes.Viridis256)sex_color = ["orchid"if s else"dodgerblue"for s in sex_vector]cds = bokeh.models.ColumnDataSource(dict(z1=mu_z[:, 0], z2=mu_z[:, 1], t=t, color=time_color))# We'll allow selection of which color we want to visualizecolor_selector = bokeh.models.RadioButtonGroup(labels=["time", "sex"], active=0)js_code ="""cds.data['color'] = color_selector.active == 0 ? time_color : sex_color;cds.change.emit();"""color_selector.js_on_event("button_click", bokeh.models.CustomJS( code=js_code, args=dict( time_color=time_color, sex_color=sex_color, cds=cds, color_selector=color_selector, ), ),)# Populate glyphsp.ellipse(source=cds, x="z1", y="z2", width=b, height=a, angle=angle, alpha=0.2, color="color")# Lay out and show!bokeh.io.show( bokeh.layouts.row(p, bokeh.layouts.column(bokeh.models.Div(text="Color by"), color_selector)),)
This is a very similar result to what we got when doing PCA, but we can see a measure if the posterior variation in the inferred latent variables.
Reconstruction of the data based on the reduced representation is not as straightfoward, as we would have the sample out of the generative distribution parametrized but the posterior distribution. We could take \(\boldsymbol{\mu}_{\mathbf{z}\mid \mathbf{y}}\) as a point estimate for the latent parameter \(\mathbf{z}\).
# Compute the reconstructiony_hat = np.dot(W, mu_z.T).T# Uncenter the reconstructiony_hat = scaler.inverse_transform(y_hat)# Kwargs for plottingkwargs =dict( x_interpixel_distance=1/30, y_interpixel_distance=1, frame_height=150, frame_width=600, x_axis_label="time (s)", y_axis_label="neuron",)# Plot original datap_neuron = bebi103.image.imshow(neural_data, **kwargs)# Plot reconstructed datap_hat = bebi103.image.imshow(y_hat.T, **kwargs)p_neuron.yaxis.major_label_text_font_size ='0pt'p_hat.yaxis.major_label_text_font_size ='0pt'bokeh.io.show(bokeh.layouts.column(p_neuron, p_hat))