51  Special cases of factor analysis

In Chapter 50, we wrote the generative model for factor analysis.

\[\begin{align} &\boldsymbol{\mu} \sim \text{prior for }\boldsymbol{\mu}, \\[1em] &\mathsf{W} \sim \text{prior for }\mathsf{W}, \\[1em] &\boldsymbol{\sigma} \sim \text{prior for }\boldsymbol{\sigma} \\[1em] &\mathsf{\Psi} = \text{diag}(\boldsymbol{\sigma}^2) \\[1em] &\mathbf{z}_i \sim \text{Norm}(\mathbf{0}, \mathsf{I})\;\forall i, \\[1em] &\mathbf{y}_i \mid\boldsymbol{\mu}, \mathsf{W}, \mathbf{z}_i, \mathsf{\Psi} \sim \text{Norm}(\mathsf{W}\cdot \mathbf{z}_i + \boldsymbol{\mu}, \mathsf{\Psi})\; \forall i \end{align} \]

Here, we consider three special cases of factor analysis that correspond to techniques commonly known as probabilistic principal component analysis (PPCA), PCA (which we have already seen from a heuristic point of view), and nonnegative matrix factorization (NMF).

51.1 Probabilistic PCA

We consider first the special case where each entry in \(\boldsymbol{\sigma}\) is the same, say \(\sigma\). In that case, \(\mathsf{\Psi} = \sigma\mathsf{I}\). We further restrict \(\mathsf{W}\) to have orthonormal columns. The assumption here is that each data point is drawn from a Normal distribution with the same variance, but with mean \(\mathsf{W}\cdot \mathbf{z}_i + \boldsymbol{\mu}\). This special case is called probabilistic principle component analysis (PPCA).

51.1.1 PPCA generative model

The full model is

\[\begin{align} &\boldsymbol{\mu} \sim \text{prior for }\boldsymbol{\mu} \text{ (usually assumed Uniform)}, \\[1em] &\mathsf{W} \text{ with orthonormal columns, otherwise uninformative prior}, \\[1em] &\sigma \sim \text{prior for }\sigma \text{ (usually assumed Uniform)} \\[1em] &\mathbf{z}_i \sim \text{Norm}(\mathbf{0}, \mathsf{I})\;\forall i, \\[1em] &\mathbf{y}_i \mid\boldsymbol{\mu}, \mathsf{W}, \mathbf{z}_i, \sigma \sim \text{Norm}(\mathsf{W}\cdot \mathbf{z}_i + \boldsymbol{\mu}, \sigma^2 \mathsf{I})\; \forall i \end{align} \]

Probabilistic PCA is a reasonable model; it assumes that each data point results from a linear combination of the latent variables and is uncorrelated with all other data points (except through the latent variables) with homoscedasticity.

51.1.2 MAP estimate for PPCA

PPCA has an added advantage that the MAP estimate may be computed analytically for uniform priors on \(\boldsymbol{\mu}\) and \(\sigma\). The result is

\[\begin{align} &\boldsymbol{\mu}_\mathrm{MAP} = \bar{\mathbf{y}},\\[1em] &\mathsf{W}_\mathrm{MAP} = \mathsf{U}\cdot(\mathsf{\Lambda} - \sigma^2\mathsf{I})^{\frac{1}{2}} \\[1em] &\sigma^2 = \frac{1}{D - L}\sum_{j=L+1}^D \lambda_j, \end{align} \tag{51.1}\]

where \(=(\lambda_1, \ldots, \lambda_D)\) are the eigenvalues of the empirical covariance matrix

\[\begin{align} \hat{\mathsf{\Sigma}} = \frac{1}{N}\sum_{i=1}^N(\mathbf{y}_i - \bar{y})\cdot(\mathbf{y}_i - \bar{y})^\mathsf{T} \end{align}\]

ordered from largest to smallest, \(\mathsf{\Lambda} = \text{diag}((\lambda_1, \ldots, \lambda_L))\), and \(\mathsf{U}\) is a matrix whose columns are the eigenvectors corresponding to eigenvalues \(\lambda_1, \ldots, \lambda_L\).

51.1.3 Marginal likelihood for PPCA

Referring to the marginal likelihood from factor analysis, Equation 50.1 and considering the special case where all \(\sigma\)’s are the same, the marginal likelihood for PPCA is

\[\begin{align} \mathbf{y}_i\mid \boldsymbol{\mu}, \mathsf{W}, \sigma\sim \text{Norm}(\boldsymbol{\mu},\mathsf{W}\cdot\mathsf{W}^\mathsf{T} + \sigma^2 \,\mathsf{I}). \end{align} \]

51.1.4 Recognition distribution for PPCA

Recall from Section 50.1.3 that the recognition distribution for factor analysis is

\[\begin{align} \mathbf{z}_i\mid \mathbf{y}_i, \boldsymbol{\mu}, \mathsf{W}, \mathsf{\Psi} \sim \text{Norm}(\boldsymbol{\mu}_{\mathbf{z}\mid\mathbf{y}_i}, \mathsf{\Sigma}_{\mathbf{z}}), \end{align} \]

where

\[\begin{align} \mathsf{\Sigma}_{\mathbf{z}} = (\mathsf{I} + \mathsf{W}^\mathsf{T}\cdot\mathsf{\Psi}^{-1}\cdot\mathsf{W})^{-1} \end{align} \]

and

\[\begin{align} \boldsymbol{\mu}_{\mathbf{z}\mid\mathbf{y}_i} = \mathsf{\Sigma}_{\mathbf{z}} \cdot\mathsf{W}^\mathsf{T}\cdot \mathsf{\Psi}^{-1}\cdot(\mathbf{y_i - \boldsymbol{\mu}}). \end{align} \]

In the special case of PPCA where \(\mathsf{\Psi} = \sigma^2 \mathsf{I}\), the scale and location parameters for the recognition distribution are

\[\begin{align} &\mathsf{\Sigma}_{\mathbf{z}} = \sigma^2\,(\sigma^{2}\,\mathsf{I} + \mathsf{W}^\mathsf{T}\cdot\mathsf{W})^{-1},\\[1em] &\boldsymbol{\mu}_{\mathbf{z}\mid\mathbf{y}_i} = (\sigma^{2}\,\mathsf{I} + \mathsf{W}^\mathsf{T}\cdot\mathsf{W})^{-1}\cdot\mathsf{W}^\mathsf{T}\cdot(\mathbf{y_i - \boldsymbol{\mu}}) = (\sigma^{2}\,\mathsf{I} + \mathsf{W}^\mathsf{T}\cdot\mathsf{W})^{-1}\cdot\mathsf{W}^\mathsf{T}\cdot(\mathbf{y_i - \bar{\mathbf{y}}}), \end{align} \]

where in the last equality we have plugged in the MAP estimate for \(\boldsymbol{\mu}\).

51.2 PCA as a limit of PPCA

Consider now the limit of PPCA where \(\sigma\) goes to zero. That is, we consider the limit where there is no noise in the measured data. The scale parameter of the recognition distribution vanishes and the location parameter becomes

\[\begin{align} \boldsymbol{\mu}_{\mathbf{z}\mid\mathbf{y}_i} = (\mathsf{W}^\mathsf{T}\cdot\mathsf{W})^{-1}\cdot\mathsf{W}^\mathsf{T}\cdot(\mathbf{y_i - \bar{\mathbf{y}}}), \end{align} \]

which is the solution to the least squares problem

\[\begin{align} \mathbf{z}_i = \text{arg min}_{\mathbf{z}_i} \left\lVert (\mathbf{y}_i - \bar{\mathbf{y}}_i) - \mathsf{W}\cdot\mathbf{z}_i\right\rVert_2^2. \end{align} \]

Comparing to the PCA loss function, Equation 49.2, this is exactly what PCA does (provided the data are centered)! So, PCA is the noise-free (\(\sigma\to 0\)) limit of PPCA, which is itself the special case of a factor model with \(\boldsymbol{\sigma} = \sigma\,\mathbf{1}\).

51.3 Nonnegative matrix factorization

In factor analysis and its special cases of PPCA and PCA, we have been organizing our data into \(D\)-vectors \(\mathbf{y}_i\), where \(i \in[1, N]\), and assuming a multivariate Normal likelihood. For example, for PPCA,

\[\begin{align} &\mathbf{z}_i \sim \text{Norm}(\mathbf{0}, \mathsf{I})\;\forall i, \\[1em] &\mathbf{y}_i \mid\boldsymbol{\mu}, \mathsf{W}, \mathbf{z}_i, \sigma \sim \text{Norm}(\mathsf{W}\cdot \mathbf{z}_i + \boldsymbol{\mu}, \sigma^2 \mathsf{I})\; \forall i, \end{align} \tag{51.2}\]

where we introduce an \(L\)-vector \(\mathbf{z}_i\) for each data point.

We can arrange this in a different way. We define an \(N\times D\) matrix of data \(\mathsf{Y}\) whose rows consist of the \(\mathbf{y}_i\) we have been considering. We can similarly group the latent variables into a \(N \times L\) matrix \(\mathsf{Z}\), where each row is comprised of an \(L\)-dimensional latent variable. We now write

\[\begin{align} y_{i,j} \mid \mathsf{W}, \mathsf{Z}, \sigma \sim \text{Norm}(A_{ij}, \sigma)\; \forall i,j, \end{align} \tag{51.3}\]

where \(A_{ij}\) is an entry in a matrix \(\mathsf{A}\) defined as

\[\begin{align} \mathsf{A} = (\mathsf{W}\cdot\mathsf{Z}^\mathsf{T})^\mathsf{T}, \end{align}\]

which is equivalent to Equation 51.2. If we have uniform priors with the likelihood given by Equation 51.3, the MAP finding problem is equivalent to finding the minimizer of

\[\begin{align} \left\lVert \mathsf{Y} - (\mathsf{W}\cdot\mathsf{Z}^\mathsf{T})^\mathsf{T} \right\rVert_F^2, \end{align} \]

where the Frobenius norm of a matrix \(\mathsf{M}\) is defined as the sum of the squares of its entries,

\[\begin{align} \left\lVert \mathsf{M}\right\rVert_F^2 = \sum_i\sum_j\,M_{ij}^2 = \mathrm{tr}(\mathsf{M}^\mathsf{T}\cdot\mathsf{M}). \end{align} \]

We would additionally need to find the MAP estimate for \(\sigma\), which in the PPCA case is given by Equation 51.1.

If \(\mathsf{Y}\) has all nonnegative entries, we can add an additional constraint that \(\mathsf{Z} > 0\) and \(\mathsf{W} > 0\). This is conceptually done with the prior for \(\mathsf{W}\) and \(\mathsf{Z}\), but in practice the positivity is a constraint on the optimization problem of minimizing the Frobenius norm. In this constrained case, Equation 51.1 no longer gives a MAP estimate for \(\sigma\). Nonetheless, we see two nonegative matrices, \(\mathsf{W}\) and \(\mathsf{Z}\) that minimize the above Frobenius norm. The result is referred to as the nonnegative matrix factorization (NMF) of the data set \(\mathsf{Y}\).

Some researchers claim that NMF results are more interpretable than PCA of PPCA because computing \(\mathsf{W}\cdot \mathsf{Z}^\mathsf{T}\) amounts to constructively adding components to reach an approximation of the observed data \(\mathsf{Y}\). Within an interpretation of the model, however, it is a similar model as PPCA, except with a positive constraint on the parameters.

NMF is implemented in scikit-learn.

As a final note about NMF, be careful about nomenclature. In much of the NMF literature, what we are calling \(\mathsf{Z}\) is called \(\mathsf{W}\) and what we are calling \(\mathsf{W}\) is called \(\mathsf{H}\). This is really confusing. Be sure to read documentation of any package you use.