A Comparison of Three Procedures for Robust PCA in High Dimensions

In this paper we compare three procedures for robust Principal Components Analysis (PCA). The first method is called ROBPCA (see Hubert et al., 2005). It combines projection pursuit ideas with robust covariance estimation. The original algorithm for its computation is designed to construct an optimal PCA subspace of a fixed dimension k. If instead the optimal PCA subspace is searched within a whole range of dimensions k, this algorithm is not computationally efficient. Hence we present an adjusted algorithm that yields several PCA models in one single run. A different approach is the LTS-subspace estimator (see Wolbers, 2002; Maronna, 2005). It seeks for the subspace that minimizes an objective function based on the squared orthogonal distances of the observations to this subspace. It can be computed in analogy with the computation of the LTS regression estimator (see Rousseeuw and Van Driessen, 2000). The three approaches are compared by means of a simulation study.


Introduction
We compare different procedures for robust Principal Components Analysis (PCA).The first method ROBPCA (ROBust PCA) (see Hubert et al., 2005) combines projection pursuit ideas with the MCD covariance estimator (see Rousseeuw, 1984) and is extremely suitable for high-dimensional data (when the dimension p is larger than half the number of observations n).In Section 2 we describe two different algorithms for its computation.The first one corresponds with the original algorithm described in Hubert et al. (2005).It is used to find the optimal PCA-subspace of a certain fixed dimension k p. Consequently the output only includes the k eigenvectors and eigenvalues of this optimal k-dimensional subspace.When the PCA results are required for several dimensions, typically k = 1, . . ., k max , this algorithm is not very efficient as it needs to run the whole procedure for each k.However, some computations are common to all dimensions k and thus should not be repeated.This leads to the ROBPCA-k max algorithm, described in Section 2.2.
As an alternative to ROBPCA, we also consider the LTS-subspace estimator of Wolbers (2002); Maronna (2005) which seeks for a subspace that minimizes an objective function based on the squared orthogonal distances of the observations.The corresponding algorithm is described in Section 3.
In Section 4 a comparison of the three procedures is made by means of a simulation study, whereas Section 5 concludes.

The Original ROBPCA Algorithm
For a complete description of the ROBPCA method we refer to Hubert et al. (2005).Here, we only indicate the major steps of the algorithm.
The first step of ROBPCA consists of performing a singular value decomposition of the data in order to project the observations on the space spanned by themselves.If p is much larger than n this step already yields a huge dimension reduction, whereas no information is lost.
Then a measure of outlyingness is defined for every sample by projecting all the points on many univariate directions through two data points.On this line, the standardized distance of each point to the center of the data is determined.For every observation the largest distance over all directions is stored which is called the outlyingness.The h(> n/2) points with smallest outlyingness are then gathered in an h-subset H 0 and their empirical covariance matrix Σ0 is computed.The choice of h determines the robustness as well as the efficiency of the method.The larger h is taken, the more accurate ROBPCA will be, but the less robust.The default value of h is set equal to 75% of the total number of observations n.
As initial robust k-dimensional subspace estimate, ROBPCA considers the subspace V 0 spanned by the k dominant eigenvectors of Σ0 .Note that the influence function of this estimator is bounded, as shown in Debruyne and Hubert (2005).
To increase the efficiency a first reweighting is then carried out.All data points which are close to V 0 receive weight 1, whereas the observations far away from it receive zero weight.More precisely, for each observation its orthogonal distance is computed as with xi,k the projection of x i in the subspace V 0 .In Hubert et al. (2005) it is argued that the orthogonal distances to the power 2/3 are approximately normally distributed N (µ, σ 2 ).Consequently, observations with OD smaller than (μ + σΦ −1 (0.975)) 3/2 , with μ and σ robust estimates of µ and σ, are retained and their covariance matrix Σ1 is computed.An improved robust subspace estimate is now obtained as the subspace V 1 spanned by the k dominant eigenvectors of Σ1 .Note that this reweighting step is not yet described in the original paper (Hubert et al., 2005) but now has been added as it turned out to improve the results.
In the next step all the n data points are projected on V 1 .Within this subspace a slightly adapted version of the FAST-MCD algorithm (see Rousseeuw and Van Driessen, 1999) is performed in order to find a robust center and a robust covariance estimate of the projected samples.This means that the algorithm first looks for an optimal h-subset H 1 that contains the h observations of which the determinant of their empirical covariance matrix is minimal.The center μraw and scatter matrix Σraw of these h observations are calculated.Next, a second reweighting procedure is performed to increase further the efficiency of the algorithm.Weights are based on the robust distance of every point with respect to μraw and Σraw .Samples obtain a zero weight if their robust distance is too large.Regular observation receive weight 1.Finally, the reweighted center μ and covari-ance matrix Σ are determined as the classical mean and covariance matrix of the weighted observations.
The last stage of ROBPCA consists of representing μ and Σ in the original p-dimensional data space.The robust principal components then correspond with the eigenvectors of Σ.
Note that V 1 , the k-dimensional subspace spanned by the ROBPCA eigenvectors, depends on k through the first reweighting step.Hence, V 1 will in general not be nested into the k + 1-dimensional subspace that is found by applying ROBPCA with k + 1 components.Consequently, also the resulting principal components are not subsets of each other.

The Adjusted Algorithm for Several Components
When analyzing real data, we usually do not know in advance how many components k we need to select.In Hubert et al. (2005) it was suggested to make a scree plot of the eigenvalues of Σ0 , but this is a too rough approximation.A more refined technique is based on the Predicted Residual Error Sum of Squares (PRESS) value (see Joliffe, 1986), of which the robust version is defined for a certain dimension k as in Hubert and Engelen (2004) and in Engelen and Hubert (2004) : Here, the index j runs over all observations from a test set, and xj,k denotes the projection of test sample x j in the k-dimensional PCA subspace.The weights w i are added to obtain a robust PRESS-value.For more details see Hubert and Engelen (2004).If a test set is not available, a cross-validated version of the R-PRESS statistic can also be used.The optimal number of components is then selected as the k for which the R-PRESS value is small enough.
One sees from the definition of the R-PRESS value that the ROBPCA algorithm has to be run for every dimension k = 1, . . ., k max .It becomes even worse to compute the cross-validated R-PRESS as then also every observation i (i = 1, . . ., n) at a time has to be removed from the data set.This results in a very time consuming approach, especially for robust resampling algorithms such as ROBPCA.
Therefore we propose the ROBPCA-k max algorithm as a much faster alternative to ROBPCA.It proceeds as follows: 1.As the first step of ROBPCA (the singular value decomposition of the data) and the computation of the outlyingness do not depend on k, we compute it just once.We obtain again Σ0 as the covariance matrix of the h data points with smallest outlyingness.
2. Then we project the data on the k max dominant eigenvectors of Σ0 , perform the first reweighting based on the orthogonal distances, compute Σ1 and define V 1 as the k max -dimensional subspace spanned by the dominant eigenvectors of Σ1 .Within this subspace, we compute the MCD estimates of location μ and covariance Σ.
Austrian Journal of Statistics, Vol. 34 (2005), No. 2, 117-126 3. The optimal PCA subspace of dimension k is now defined as the space spanned by the k dominant eigenvectors of Σ.
We see that this approach avoids the computation of the outlyingness and the MCD covariance matrix k max times.Moreover, the principal components obtained by applying ROBPCA-k max with k components are now a subset of the components that are found by applying ROBPCA-k max with more than k components.
In Hubert and Engelen (2004) the ROBPCA-k max approach is used to obtain a fast method for the computation of the cross-validated R-PRESS value.The approximate method then needs e.g.only 19.03 seconds for a data set with n = 100 observations and p = 500 variables versus 5191.1 seconds for the naive approach, whereas the R-PRESS curves are very similar.Also robust calibration methods such as robust principal component regression and robust partial least squares benefit from ROBPCA-k max as shown in Engelen and Hubert (2005).
On the other hand, this algorithm has the disadvantage that it computes the MCD estimator in k max dimensions.This becomes less precise, less robust and more time consuming if k max is chosen too large.Of course it depends on the number of observations, but our experience is that with moderate data sets of sizes up to 100, the differences between the original ROBPCA algorithm and its adjusted version remain small if k max is at most 10 and if the curse of dimensionality is taken into account, i.e. n/k max > 5.This will also be illustrated in Section 4. Moreover in many data sets in chemometrics and bio-informatics with thousands of variables a small number of components is usually sufficient to summarize the data well.Hence we always set k max ≤ 10.
Another disadvantage of ROBPCA-k max is the estimation of the eigenvalues.The MCD covariance estimator is always multiplied with a consistency factor, which decreases with the dimension.If the optimal subspace has a dimension k opt , much smaller than k max , the estimate is not inflated enough.This affects the robust distances within the subspace, and consequently the reweighted MCD estimates.Hence, we propose to use the consistency factor based on k max /2 components instead of the factor associated with k max .But when ROBPCA-k max is used to compute R-PRESS values, which then allows to select a certain k, we apply ROBPCA for k components with the consistency factor for k-dimensional variables.

The LTS-subspace Estimator
The LTS-subspace estimator is introduced in Rousseeuw and Leroy (1987) and studied in Wolbers (2002) and in Maronna (2005).It is based on the connection between classical PCA and subspace estimation.The subspace spanned by the k first principal components (the dominant eigenvectors of the empirical covariance matrix of the data) has the property that it also minimizes the sum of the squared orthogonal distances of the observations to that subspace.From this point of view, a robust alternative can be found in searching the h-subset that minimizes the objective function: the ordered squared orthogonal distances and h defined as in ROBPCA.The resulting subspace is called the LTS-subspace (of dimension k).When k = p − 1, it coincides with robust orthogonal regression.
For its computation, an algorithm can be developed which is very similar to the FAST-LTS algorithm to compute the LTS regression estimator (see Rousseeuw and Van Driessen, 2000).It starts by drawing many random (k + 1)-subsets.A classical PCA is performed on these (k + 1) points and the orthogonal distances of all the observations are calculated with respect to this k-dimensional subspace.Then for each subspace the h points with smallest orthogonal distance are stored, and used to compute the classical PCA subspace of dimension k.Next, the orthogonal distances with respect to this subspace are calculated for all data, and the h points with the smallest orthogonal distance are again retained.This procedure is guaranteed to converge.The optimal h-subset is then defined as the hsubset which has the lowest objective function over all the obtained h-subsets.Finally classical PCA is performed on this optimal h-subset in order to obtain eigenvectors and eigenvalues.Remark that several time saving techniques as in the FAST-LTS algorithm can be included as well, but they will not be discussed here.

Simulations
In this section a simulation study is performed to gain insight in the performance of the three robust methods compared with classical PCA.Also the effect of k max on the ROBPCA-k max results is studied in more detail.
All the data are generated from the following contamination model : with the percentage outliers included in the data.This means that the bulk of the data is normally distributed with center µ = 0 and covariance matrix Σ = diag(10, 8, 2, 1, . ..),where the dots indicate eigenvalues that are negligibly small and diag represents a diagonal matrix.
By varying the values for , n, p, k, k max and μ different simulation settings are created.Here is an overview of the values assigned to these parameters : • The data matrix is n = 40 by p = 200, and n = 100 by p = 1000.
• The quantile h used in all the algorithms is set to roughly 0.75n (the default) in order to obtain an outlier resistance of approximately 25%.As we prefer to use the same h value for all algorithms, we take , n} with α = 0.75, k max = 8 if n = 40 and k max = 10 if n = 100.This formula for h yields an interpolated value between the minimal one [(n + k max + 1)/2] if α = 0.5 and the maximal value n if α = 1.Here, it implies that h = 32 if n = 40 and h = 77 if n = 100.
• In ROBPCA-k max , the maximal number of components k max is taken as 5 or 8 if n = 40 and as 7, 10 or 15 if n = 100.No higher values are considered to avoid the curse of dimensionality.
• The number of principal components is chosen as k = 2 and k = 4.For n = 40, two components account for 79% of the total variance, whereas four components explain 92% of the variance.For n = 100 these percentages are 84%, and 98% respectively.
For every setting we have constructed 100 data sets.We did not take a higher value because the LTS-estimator is rather computationally intensive, as we will see further on.
To evaluate the simulation results, we have computed the maximal angle between the space spanned by the estimated principal components and E k , where E k is the subspace spanned by the k dominant eigenvectors of Σ.Thus, E k = span{e 1 , . . ., e k } with e j the jth column of I p,k (the subscripts indicate the dimensions of the matrix).A measure to calculate this angle is proposed by Krzanowski (see Krzanowski, 1979) and is called maxsub (see Hubert et al., 2005).Let the loading matrix P p,k contain the estimated eigenvectors columnwise, then maxsub is computed as where λ k is the smallest eigenvalue of I k,p P p,k P k,p I p,k .It represents the largest angle between a vector in E k and the vector most parallel to it in the estimated PCA subspace.
To standardize this value, we have divided it by π 2 .The closer this Krzanowski measure is to 0, the more accurate the method.
From the simulation results in Tables 1-4, we conclude that at uncontaminated data both PCA and the three robust methods work well, with (not surprisingly) the best performance achieved by classical PCA.In all situations where outliers were included, we see that PCA immediately breaks down, whereas ROBPCA always yields very accurate results and clearly outperforms the two other robust approaches.These robust results indicate that the k-dimensional subspaces V 0 and V 1 are not influenced by outliers, nor is the MCD-estimator applied to the projected data in V 1 .
With 10% contamination, both LTS-subspace and ROBPCA-k max still attain robust results, but they break down in some situations with ε = 20%.With this large amount of contamination, we notice for the LTS-subspace estimator a high bias when n = 40 and k = 4, but the outliers are no longer a cause for concern when n = 100.
With ROBPCA-k max breakdown occurs in several situations where a large value of k max is taken (equal to 8 if n = 40 and to 15 when n = 100).When k max is small to moderate, we obtain low values for maxsub.From these results and the very accurate estimates of ROBPCA-k max for smaller k max -values, we can conclude that this method can be used as a robust alternative for PCA, at least when selecting k max small enough.
Finally, we compare the mean computation times (in seconds) of the three robust algorithms (on a Pentium IV with 2.40Ghz).Table 5 shows that their computation time is comparable in small dimensions, but the LTS-subspace method becomes significantly slower when the data matrix becomes larger.
Further we notice that ROBPCA is slightly faster than ROBPCA-k max , which is to be expected as ROBPCA computes the MCD only in a k-dimensional subspace.However, S.

Conclusions
We have shown that the original ROBPCA method is very robust as it can withstand many types of contamination.To save computation time (for example in order to select the number of principal components), we have constructed the fast ROBPCA-k max method.In our simulations we found that ROBPCA-k max performs almost as well as the original ROBPCA algorithm, unless k max is set too large.This makes ROBPCA-k max a valuable alternative to ROBPCA, especially to compute cross-validated statistics.We also compared these ROBPCA methods with the LTS-subspace estimator.This method appears to be less robust in some particular contamination settings, and it requires more computation time.We thus conclude that ROBPCA and ROBPCA-k max are more in favor to use.The ROBPCA method is available at the website http://www.wis.kuleuven.ac.be/stat/robust.htmlas part of LIBRA, the Matlab Library for Robust Analysis, (see Verboven and Hubert, 2005).In the future also ROBPCA-k max will be integrated into this library.

Table 4 :
The simulation results for n = 100, p = 1000 and k = 4. are required, for example the components for k = 2 and k = 4, the computation times for ROBPCA need to be cumulated, whereas ROBPCA-k max yields those components in one single run of the algorithm.

Table 5 :
The mean computation time in seconds of ROBPCA, ROBPCA-k max and the LTS-subspace.