A Corrected Criterion for Selecting the Optimum Number of Principal Components

Abstract: Determining the optimum number of components to be retained is a key problem in principal component analysis (PCA). Besides the rule of thumb estimates there exist several sophisticated methods for automatically selecting the dimensionality of the data. Based on the probabilistic PCA model Minka (2001) proposed an approximate Bayesian model selection criterion. In this paper we correct this criterion and present a modified version. We compare the novel criterion with various other approaches in a simulation study. Furthermore, we use it for finding the optimum number of principal components in hyper-spectral skin cancer images.


Introduction
Principal component analysis (PCA) was introduced by Pearson (1901) and is a linear orthogonal data transformation.The data is transformed in such a way that in the new coordinate system the components are uncorrelated and sorted in a descending order according to their variance.PCA is a standard tool for reducing multidimensional data sets to lower dimensions for further statistical analysis by omitting higher-order components.Determining the optimum number of components to be retained is a major problem in PCA.Besides certain rule of thumb estimates there exist numerous sophisticated methods for automatically selecting the dimensionality of the data.For the Gaussian probabilistic PCA model of Tipping and Bishop (1999) an approximate Bayesian model selection criterion was proposed by Minka (2001).Although mentioned in books and used in various applications, we found some mistakes in its proof.In the following we correct this criterion and present a modified version.
The paper is organized as follows.Section 2 presents the probabilistic PCA model and mentions its basic properties.In Section 3 the model selection criterion proposed by Austrian Journal of Statistics, Vol. 38 (2009), No. 3, 135-150 Minka (2001) is corrected.Section 4 compares the corrected criterion with several other approaches in a simulation study while Section 5 presents the results when it is applied to hyper-spectral skin cancer image data.Section 6 is devoted to conclusions.

Probabilistic PCA
A probabilistic model for PCA was introduced by Tipping and Bishop (1999) and is a special case of the factor analysis model (see Basilevsky, 1994).It assumes that the random vector x is a linear combination of basis vectors and an additive noise term, where x has length d, w has smaller length k, µ is the mean of x and the noise model is isotropic Gaussian, p(ε) ∼ N (0, σ 2 I d ).Moreover, the density of w is assumed to be standard Gaussian, p(w) ∼ N (0, I k ).Therefore, the probability of observing x follows a normal distribution, p(x | w, H, µ, σ 2 ) ∼ N (Hw + µ, σ 2 I d ), and the marginal distribution of x is p(x | H, µ, σ 2 ) ∼ N (µ, HH T + σ 2 I d ).With this, the likelihood of a data set, D = (x 1 , . . ., x n ), is (3) Next we show how the maximum-likelihood estimates for H and σ 2 can be obtained.
Using the symmetry of C and Ŝ the gradient of (3) with respect to H is Setting the gradient to zero and multiplying with C from the left yields We are interested in the solutions of (4) for H.A possible solution would be Ĥ = 0 which corresponds to a minimum of the likelihood.A second solution is C = Ŝ, where the covariance model is exact.In this case, HH T = Ŝ − σ 2 I d , which has the solution Ĥ = U (L − σ 2 I d ) 1/2 R, where U ∈ R d×d is the orthogonal matrix with the eigenvectors of Ŝ in its columns, L ∈ R d×d is the diagonal matrix containing the corresponding eigenvalues and R ∈ R d×d is an arbitrary orthogonal matrix.As Tipping and Bishop (1999) note, having an exact covariance model is generally undesirable because this information is effectively discarded in the dimensionality reduction process anyway.
For the general and more interesting case consider all solutions of (4), but Ĥ = 0 and C = Ŝ.Using the singular value decomposition we know that the estimator can be written as Ĥ = U DR T , where now U ∈ R d×k is a matrix with orthonormal columns (u 1 , . . ., u k ), D ∈ R k×k is the diagonal matrix containing the singular values d 1 , . . ., d k , and R ∈ R k×k is again an arbitrary orthogonal matrix.We deduce from (4) that If d i = 0, i = 1, . . ., k, the vector u i can be chosen arbitrarily such that U has orthonormal columns.If d i = 0, we have Ŝu i = (σ 2 + d 2 i )u i which means that each column of U has to be an eigenvector of Ŝ corresponding to the eigenvalue and all maximum-likelihood solutions for H must have the form where now L ∈ R k×k is the diagonal matrix with entries l i and U contains the corresponding eigenvectors.With this, we can simplify where q is the number of non-zero d i and equals the rank of Ĥ.Therefore, the eigenvalues l 1 , . . ., l q correspond to those components retained in PCA and l q+1 , . . ., l d correspond to those discarded.Using the Woodbury matrix identity we find that Plugging the latter result and (6) into the log-likelihood (3) we get 138 Austrian Journal of Statistics, Vol. 38 (2009), No. 3, 135-150 The maximum-likelihood estimate for σ 2 turns out to be the average of the discarded eigenvalues: Substituting σ2 into the log-likelihood (8) we arrive at (10) In order to find those eigenvalues of Ŝ that correspond to discarded components we have to maximize the above log-likelihood with respect to their choice.The sum of all eigenvalues is constant implying that maximization of (10) with respect to the eigenvalues is equivalent to minimizing The above function is minimized when all l i , i = q + 1, . . ., d, have the same value.Hence, the discarded eigenvalues are adjacent in the ordered spectrum of Ŝ. From the fact that l i > σ 2 , ∀i = 1, . . ., q and (9) we know that the smallest eigenvalue of Ŝ is definitely discarded.This implies that the smallest d − q eigenvalues of Ŝ are discarded and the top q eigenvalues are retained.Seen from this angle, the estimate for σ 2 can be interpreted as the average information loss when reducing the dimension of the data.

Choosing the Number of Principal Components Using Bayesian Model Selection
For the probabilistic PCA model with positive definite covariance matrix Minka (2001) developed a fast and efficient criterion to select the optimum number of principal components.It is based on Bayesian model selection and is therefore sometimes called the MIBS criterion (Minka Bayesian model Selection) in the literature.A variety of books refer to this criterion, for example Cichocki and Amari (2002), Smidl and Quinn (2005) and Bishop (2008).It showed good results for PCA and also for independent component analysis.Although the MIBS criterion is widely used, we found some errors in its proof.
In this section we give a corrected proof and propose a modified criterion for selecting the optimum number of principal components.

Choice of the Prior
In contrast to maximum-likelihood estimation performed in Section 2 we have to assign priors to the parameters µ, H and σ 2 in the probabilistic PCA model (1) when we want to go the Bayesian way.For the mean µ we use a non-informative, flat prior, p(µ) ∝ 1.
Integrating the likelihood (2) with respect to µ yields The prior for H is constructed similar to decomposition ( 5 In contrast to Section 2, here the orthogonal matrix R and the matrix U , which has orthonormal columns, are model parameters.Λ is a diagonal matrix containing parameters λ i , i = 1, . . ., k. Minka (2001) found a conjugate prior for U , Λ, R and σ 2 .This prior is parameterized by a single parameter α > 0 and by applying simplifications similar to ( 6) and ( 7) it can be written as With this definition of the prior the parameters are a-priori independent since their joint prior density ( 12) factors into the product of the marginal prior densities for each of the parameters: To be least informative the prior for U is chosen as the reciprocal of the area of the orthonormally constrained subset of the Cartesian product of k d-dimensional unit hyperballs, known as the Stiefel manifold (see Khatri andMardia, 1977 andJames, 1954): The prior for R can also be assumed constant but as the variable does not appear in the likelihood ( 11) we can integrate it out.

The Corrected MIBS Criterion
Among the well-known statistical model selection criteria are Akaike's information criterion (AIC) and the minimum description length (MDL) criterion.These can be easily used for finding the optimum number of principal components.Assume we observe samples D = (x 1 , . . ., x n ) from a zero-mean normal random vector.An appropriate number of principal components is the value of k = 1, . . ., d for which is minimized.In ( 13) and ( 14) the function ρ(k) is .
The criterion that was recommended in Minka (2001) has shown reasonable performance in various applications and is fast to evaluate (see Cichocki andAmari, 2002 andLeonowicz, Karvanen, Tanaka, andRezmer, 2004): where m = dk − k(k + 1)/2 and |A| is defined as The number of principal components to retain is taken to be the value of k = 1, . . ., d for with p(D | k) is maximized.Dropping all the terms that do not grow with n a BIC approximation was also proposed in Minka ( 2001) The following theorem corrects the existing MIBS criterion (15) and serves as a guideline for easy implementation of the novel criterion proposed therein.
Theorem: Consider the probabilistic principal component model ( 1) with a prior for the parameters H, µ, and σ 2 as discussed in Section 3.1 where α > 0 is the prior parameter.By applying Laplace approximation the marginal likelihood of

, k, and
The value for α is typically chosen very small to make the prior less informative.Large values for α may lead to inferior results especially when the number of samples is small.

Proof of the Corrected MIBS Criterion
The marginal likelihood can be written as where N = n + 1 + α.The multiplier c k defined in (19) combines the constants of the prior ( 12) and the integrated likelihood (11).To evaluate the integral in (20) we use Laplace approximation as described in Lindley (1980).The matrix U ∈ R d×k has m = dk − k(k + 1)/2 free parameters since we have k(k + 1) constraints because of the orthonormality condition for the columns of U .Laplace approximation can therefore be written as where Û , Λ, σ2 = arg max U ,Λ,σ 2 p(D, U , Λ, σ 2 | k) and A U Λσ 2 is the negative Hessian of the integrand at Û , Λ, σ2 .It is important to know that Laplace approximation is a basis dependent method, so we have to choose an appropriate parameterization, Z, Λ, σ2 , for (U , Λ, σ 2 ).In the case of Λ and σ 2 it is easy to guess a good parameterization, since Laplace approximation works better when using integrals over the whole real axis than just over the positive real numbers.Therefore, one can use σ2 = log σ 2 and λi = log λ i for all i = 1, . . ., k, where λi are the diagonal elements of Λ.The Jacobian matrices for the transformations are 142 Austrian Journal of Statistics, Vol. 38 (2009), No. 3, 135-150 The matrix U is expressed in Euler vector coordinates where Z ∈ R d×d is a skew-symmetric matrix of parameters and U d ∈ R d×d is a fixed orthogonal matrix.The free parameters of Z are the first k rows of the upper triangle: which is known to be the von Mises-Fisher matrix distribution (see Khatri and Mardia, 1977).This density is maximized for U at Û , which contains the first k eigenvectors of Ŝ.In the parameterization ( 22) this corresponds to the case where Ẑ = 0 and U d is equal to the matrix of eigenvectors of Ŝ. Note, that the density has the same value if the sign of a column of U is changed.This can happen 2 k times, so the density has 2 k extreme points with the same function values and we need to multiply ( 21) with 2 k .The determinant of the Jacobian for the transformation ( 22) at Ẑ is obviously equal to 1.
Using the same argument as in ( 7) it follows that The estimate for Λ can be calculated by differentiation of the latter equation: If we set this derivative to zero, we get that exp{ λi } = nl i +α n−1+α or equivalently The cross derivatives are zero and the second derivatives with respect to λi are The negative Hessian of log p D, Ẑ, Λ, σ2 | k at Λ = Λ = diag λ1 , . . ., λk can now be calculated as The estimate of σ 2 is obtained analogously: Substituting the estimates Ẑ, Λ and σ2 into (23) we get that which leads to the following equation for p D, Ẑ, Λ, σ2 | k : Next, we have to evaluate the Hessian of and certain matrix differential rules we obtain the first and second differential of U : Austrian Journal of Statistics, Vol. 38 (2009), No. 3, 135-150 The differential of log p D, Z, Λ, σ2 | k can be evaluated as The second differential is obtained by taking the differential of (28): T L, just containing the top k rows of L, we can rewrite (29) evaluated at Ẑ by applying ( 26) and ( 27) as If we define the diagonal matrix T and T = BdZ T + dZB, we can simplify (30).Using the skew-symmetry of Z we get If we define a diagonal matrix Λ with entries λi , i = 1, . . ., d, as we get from the definition of B and T that the ij-th element of T can be written as . The next step is to show that (31) can be simplified to To prove (32) we note that the ij-th element of Hence, the elements of T LdZ can be written as Taking into account the skew-symmetry of dZ the trace of T LdZ is The latter equation can be rewritten as It remains to show that the second term of ( 33) is zero and we prove it using induction.
Let B (k) denote the value of this second term in (33).It is clear that B(1) and B( 2) are both equal to zero.Let us assume that the claim holds for B(s) with a certain s ≥ 2. For s + 1 simple calculation yields where the first term on the right side of the equation is zero due to the induction hypothesis and the sum of the last two terms is clearly also equal to zero.With this we have shown that the cross derivatives are zero and therefore the Hessian of log p D, Z, Λ, σ2 | k is diagonal at Z = 0. Using (32), the determinant of the negative Hessian with respect to Z, required in ( 21), is Since all the cross derivatives between Λ, σ2 and Z are also zero, the negative Hessian of log p D, Z, Λ, σ2 |k is block diagonal.Its determinant equals where the factors are given in ( 24), (25), and (34), respectively.Now that we have all the terms and estimates which are needed in ( 21), Laplace approximation of the integral directly leads to which is our recommended criterion.In this formula a few errors of Minka ( 2001) have been corrected.In the original paper the term exp{k + 1} and the Jacobians of the transformations were missing.

Simulation Study
In this section we want to test the performance of the different criteria for selecting the optimum number of principal components using synthetic data.In a simulation study we compare four criteria that make use of the probabilistic PCA model: the original MIBS criterion ( 15), its corrected version (18), the BIC ( 16) and the orthogonal variational Bayes approximation, called OVPCA (see Smidl and Quinn, 2005).Additionally we consider the AIC (13) and MDL ( 14) criteria.
Austrian Journal of Statistics, Vol. 38 (2009), No. 3, 135-150 The prior parameter α for the corrected MIBS criterion is chosen to be α = 0.01 to make the prior diffuse.The subsequently presented results for the simulation study would not change much as long as the the value for α is kept small, say smaller than 1.Large values for α, however, would adversely affect the results, especially for small samples sizes when the prior information dominates the information obtained from the likelihood.
In the first experiment the data is sampled from a 10-dimensional zero-mean Gaussian distribution with variances [10 8 6 4 2] in the first five directions.The remaining five directions have a variance equal to 1.The results for samples sizes varying between 5 and 350 are shown in Figure 1(a).For every sample size we simulate 1000 data sets and count how often the true dimensionality is recovered by a certain criterion.One can see that AIC is the most accurate criterion for sample sizes smaller than 75, however, fails to increase the recovery rate when the number of samples gets large.The AIC is an inconsistent estimator for the true number of dimensionality which is visible in nearly all subsequent experiments.The corrected MIBS criterion, abbreviated by corMIBS in Figures 1(a) -(h), has excellent recovery rates for both small and large number of sampled data.Notably, it gives far better results than the original MIBS criterion for all different sample sizes.MDL turns out to be less accurate than the corrected and the original MIBS criterion but outperforms the BIC.The OVPCA shows the worst performance when there are less than 100 data samples but catches up to BIC and even MDL when we further increase the sample size.
The second experiment, displayed in Figure 1(b), is the same as the first except that the noise dimension is changed from 5 to 10. Again, the corrected MIBS criterion performs best and gives larger recovery rates than the original MIBS criterion.Compared to the first experiment, the MDL criterion and BIC need much more samples to reach a high recovery rate.This effect is far smaller for both MIBS criteria and also for the OVPCA criterion, which leads to superior results compared to MDL and BIC in this experiment.
The third experiment differs from the second experiment only by the fact that here we have a noise variance of 0.5 instead of 1.Compared to the second experiment, all criteria need less data to reach high recovery rates.The corrected MIBS criterion is the top performer for all sample sizes, however, the difference to the original MIBS criterion is smaller than in the first two experiments.Again, OVPCA is weak when the number of samples is small.Detailed results are shown in Figure 1(c).
In the fourth experiment the noise variance is reduced even more and set equal to 0.1.As can be seen in Figure 1(d), all criteria yield recovery rates higher than 95% when the sample size exceeds 25.The corrected MIBS criterion shows a far better performance than the original MIBS criterion for the very small sample sizes of 8 and 11.Note that this is the only experiment in which the BIC gives slightly better results than the MDL criterion.OVPCA is as good as the MIBS criteria when the number of samples is 17.
The fifth experiment is the same as the third except that here we have 20 noise dimensions instead of 10.The facts already mentioned in the second experiment also apply here.Figure 1(e) shows that the MDL criterion and BIC have a weak performance compared to the third experiment due to the added noise dimensions.The MIBS criteria and the OVPCA criterion are much less affected.Again, the corrected MIBS criterion gives the best results, especially for small sample sizes.Austrian Journal of Statistics, Vol. 38 (2009), No. 3, 135-150 The sixth and the seventh experiment, displayed in Figure 1(f)-(g), examine the case where the number of noise dimensions is 95 and, therefore, very large compared to the true dimensionality of the data which is equal to 5. When the data dimension is large the OVPCA is a very time-consuming method.This is the reason why we omit it for these two experiments.The corrected MIBS criterion is the most accurate method, followed by the original MIBS criterion.The AIC shows good results for sample sizes larger than 100.As we would expect, both MDL and BIC show a poor performance because of the large data dimensionality.
The last experiment is designed to test the case in which the true dimensionality is larger than the number of noise dimensions.The data is generated from a five-dimensional Gaussian distribution with variances [10 8 6] in the first three directions.The remaining two directions have a variance equal to 3. Again, the corrected MIBS criterion leads to the best results.However, in this experiment the difference in terms of recovery rate between the corrected MIBS criterion and all other criteria is more striking.The original MIBS criterion and the MDL criterion perform equally well.The BIC and especially the OVPCA give poor results.

Analysis of Hyper-Spectral Skin Cancer Data
If spectral measurements using hundreds of narrow contiguous wavelength intervals are performed, the resulting image is called a hyper-spectral image and is often represented as a hyper-spectral image cube.In contrast to RGB-images, where every pixel can be represented as a three-dimensional vector with entries corresponding to the red, green and blue channels, a hyper-spectral image contains pixels represented as multidimensional vectors with elements indicating the reflectivity at a specific wavelength.Thus, these vectors correspond to spectra in the physical meaning and are equal to spectra measured with e.g.spectrometers.
A set of 310 hyper-spectral images (171 × 170 pixels and 270 spectral bands after preprocessing steps) of malign and benign lesions were taken in clinical studies at the Medical University Graz, Austria.They are classified as melanoma or mole by human experts on the basis of a histological examination.Kazianka, Leitner, and Pilz (2008) used this data set and reported on the classification results for unobserved skin cancer images.In their paper they mention that as the dimensionality of the data equals the number of spectral bands, using the full spectral information in classification and clustering approaches leads to computational complexity.Moreover, the spectral bands are highly correlated and contain noise.To overcome the curse of dimensionality PCA is used to reduce the dimensions of the data, and inherently also the unwanted noise.Preceding analysis and inspection of scores and loadings showed that the optimum number of principal components to be retained is 7 (see Kazianka, 2007).
To test the performance of the AIC, BIC, MDL, MIBS and corrected MIBS criterion we selected 500 pixels from the training set.The prior parameter α for the corrected MIBS criterion is again set to α = 0.01.The dimensionality picked by each method is shown in Table 1.The original MIBS and the corrected MIBS criterion choose the number of components that was suggested by the preceding analysis.The values of the criteria for varying number of retained components are shown in Figure (2).As can be seen in Figure 2(a), the corrected MIBS criterion gives less probability mass for larger dimensionalities than the original formulation.

Conclusion
The corrected version of the MIBS criterion for automatically selecting the optimum data dimensionality in the probabilistic PCA model shows excellent results in a simulation study.It is the constant top performer for all eight different experiments and works especially well with small sample sizes.The proposed criterion not only outperforms the original MIBS criterion but also other well-known methods such as BIC, AIC, MDL and OVPCA.The study also reveals that the performance of the BIC and the MDL criterion strongly depends on the data dimension.Both criteria give poor results when the number of dimensions is large.Besides the promising results for synthetic data the application to the hyper-spectral skin cancer images shows that the corrected MIBS criterion is also suitable for real world data which do not necessarily follow a Gaussian distribution.
as desired.If Λ and σ 2 are fixed, we can use the result in (7) and write p(D, U | k, Λ, σ 2 ) as ) and if L = diag(l 1 , . . ., l k ) denotes the diagonal matrix containing the first k eigenvalues of Ŝ, we can rewrite log p D, Ẑ, Λ, σ2 | k as

Figure 1 :
Figure 1: Recovery rates for the different criteria in the simulation study.The results of experiments 1-8 are displayed in (a)-(h).

Figure 2 :
Figure 2: Values of the criteria for varying dimensionality in the skin cancer study.

Table 1 :
Number of components picked by the different criteria for the skin cancer data.