Robust Independent Component Analysis Based on Two Scatter Matrices

that, under general assumptions, any two scatter matrices with the so called independent components property can be used to estimate the unmixing matrix for the independent component analysis (ICA). The method is a generalization of Cardoso's (Cardoso, 1989) FOBI estimate which uses the regular covariance matrix and a scatter matrix based on fourth moments. Different choices of the two scatter matrices are compared in a simulation study. Based on the study, we recommend always the use of two robust scatter matrices. For possible asymmetric independent components, symmetrized versions of the scatter matrix estimates should be used.


Introduction
Let x 1 , x 2 , . . ., x n be a random sample from a p-variate distribution, and write X = x 1 x 2 . . .x n for the p × n data matrix.We assume that X is generated by where Z = (z 1 z 2 . . .z n ) and z 1 , . . ., z n are independent and identically distributed latent random vectors having independent components and A is a full-rank p × p mixing matrix.This model is called the independent component (IC) model.The model is not well defined in the sense that the model may also be written as where A * = AP D −1 and Z * = DP Z for any diagonal matrix D (with nonzero diagonal elements) and for any permutation matrix P .(A permutation matrix P is obtained from identity matrix I p by permuting its rows.)If Z has independent components, then also the components of Z * = DP Z are independent.The problem in the so called independent component analysis (ICA) is to find an unmixing matrix B such that Bx i has independent components.Based on the discussion above, the solution is then not unique: If B is an unmixing matrix, then so is DP B.
Most ICA algorithms then proceed as follows.(For a recent review of different approaches, see Hyvärinen, Karhunen, and Oja, 2001.)Austrian Journal of Statistics, Vol. 37 (2008), No. 1, 91-100 1.To simplify the problem it is first commonly assumed that the x i are whitened so that E(x i ) = 0 and cov(x i ) = I p .Then X = U Z * with an orthogonal matrix U and Z * with (columns having) independent components such that E(z * i ) = 0 and cov(z * i ) = I p 2. For the whitened data X, find a p × r matrix U with orthonormal columns (r ≤ p) which maximizes (or minimizes) a chosen criterion function, say g(U X).Measures of marginal nongaussianity (negentropy, kurtosis measures) g(u X) and likelihood functions with different choices of marginal distributions are often used.
In the FastICA algorithm (Hyvärinen and Oja, 1997) for example in each iteration step (for stage 2) the columns of U are updated one by one and then orthogonalized.The criterion of the FastICA algorithm maximizes the negentropy which is approximated by with z ∼ N (0, 1) and with several possible choices for the function h(•).
A different solution to the ICA problem, called FOBI, was given by Cardoso (1989): After whitening the data as above (stage 1), an orthogonal matrix U is found as the matrix of eigenvectors of a kurtosis matrix (matrix of fourth moments; this will be discussed later).The data transformation consists of a joint diagonalization of the regular covariance matrix and of the scatter matrix based on fourth moments.FOBI was generalized in Oja et al. (2006) (real data) and Ollila et al. (2007) (complex data) where any two scatter matrices which have the so called independent components property can be used.An interesting question then naturally arises: How should one choose these two scatter matrices in a good or optimal way?
The paper is organized as follows.First, in Section 2 scatter matrices and their use in the estimation of an unmixing matrix is reviewed.In Section 3 we describe the results from simulation studies where new ICA estimates with several choices of scatter matrices are compared to classical FastICA and FOBI estimates.Also an image analysis example is given.The paper ends with some conclusions in Section 4.

Two Scatter Matrices and ICA
Let x be a p-variate random vector with cdf F x .A functional T (F ) is a p-variate location vector if it is affine equivariant in the sense that T (F Ax+b ) = AT (F x ) + b for all x, all full-rank p × p matrices A and all p-variate vectors b.Using the same notation, a matrixvalued p × p functional S(F ) is called a scatter matrix if it is positive definite, symmetric and affine equivariant in such way that S(F Ax+b ) = AS(F x )A for all x, A and b.The regular mean vector E(x) and covariance matrix Cov(x) serve as first examples.There are numerous alternative techniques to construct location and scatter functionals, e.g.Mfunctionals, S-functionals, etc. See e.g.Maronna, Martin, and Yohai (2006).
A scatter matrix S(F ) is said to have the independent components (IC-) property if S(F z ) is a diagonal matrix for all z having independent components.The covariance matrix naturally has the IC-property.Other classical scatter functionals (M-functionals, S-functionals, etc.) developed for elliptical distributions do not generally possess the ICproperty.However, if z has independent and symmetrically distributed components, then S(F z ) is a diagonal matrix for all scatter functionals S. It is therefore possible to develop a symmetrized version of a scatter matrix S(F ), say S sym (F ), which has the IC-property; just define where x 1 and x 2 are two independent copies of X. See Oja et al. (2006), Ollila et al. (2007) and Sirkiä, Taskinen, and Oja (2007).
An alternative approach to the ICA using two scatter matrices with IC-property (Oja et al., 2006, Ollila et al., 2007) has the following two steps: 1.The x i are whitened using S 1 (instead of the covariance matrix) so that S 1 (F with an orthogonal matrix U and with Z * with (columns having) independent components such that S 1 (z * i ) = I p . 2. For the whitened data X, find an orthogonal matrix U as the matrix of eigenvectors of S 2 (F x i ).
The resulting data transformation X → BX then jointly diagonalizes S 1 and S 2 (S 1 ( BX) = I p and S 2 ( BX) = D) and the unmixing matrix B solves The matrix B is the matrix of eigenvectors and the diagonal matrix D is the matrix of eigenvalues of S −1 2 S 1 .Note the similarity between our ICA procedure and the principal component analysis (PCA): The direction u of the first eigenvector of S −1 2 S 1 maximizes the criterion function (u S 1 u)/(u S 2 u) which is a measure of kurtosis (ratio of two scale measures) rather than a measure of dispersion (as in PCA) in the direction u, etc.The independent components are then ordered according to this specific kurtosis measure.The solution is unique if the eigenvalues of S −1 2 S 1 are distinct.Different choices of S 1 and S 2 naturally yield different estimates B. First, the resulting independent components BX are rescaled by S 1 and they are given in an order determined by S 2 .Also the statistical properties of the estimates B (convergence, limiting distributions, efficiency, robustness) naturally depend on the choices of S 1 and S 2 .
3 Performance Study

The Estimates B to be Compared
We now study the behavior of the new estimates B with different (robust and non-robust) choices for S 1 and S 2 .The classical FastICA procedures which use in equation ( 1) serve as a reference.These algorithms will be denoted as FastICA1 and as FastICA2, respectively.According to Hyvärinen and Oja (2000), these choices are more robust than the traditional negentropy estimate with criterion The FOBI estimate by Cardoso (1989) assumes that the centering is done using the mean vector, and Then S 2 is a scatter matrix based on the fourth moments, both S 1 and S 2 possess the IC-property, and the independent components are ordered with respect to their classical kurtosis measure.The FOBI estimate is member in the new class of estimates but highly non-robust due to the choices of S 1 and S 2 .
In our simulation study we consider scatter matrices which are (unsymmetrized and symmetrized) M-functionals.Simultaneous M-functionals for location and scatter corresponding to chosen weight functions w 1 (r) and w 2 (r) are functionals which satisfy implicit equations where r is the Mahalanobis distance between x and T (F x ), i.e.
In this paper we consider Huber's M-estimator (Maronna et al., 2006) with The tuning constant c is chosen to satisfy q = Pr(χ 2 p ≤ c 2 ) and the scaling factor σ 2 so that E[χ 2 p w 2 (χ 2 p )] = p.Tyler's shape matrix (Tyler, 1987) is often called the most robust M-estimator.Tyler's shape matrix and simultaneous spatial median estimate, see (Hettmansperger and Randles, 2002), have the weight functions Symmetrized versions of Huber's estimate and Tyler's estimate then possess the ICproperty.The symmetrized version of Tyler's shape matrix is also know as Dümbgen's shape matrix (Dümbgen, 1998).
In this simulation study we compare • FastICA1 and FastICA2 estimates • E1: FOBI estimate • E2: Estimate based on the covariance matrix and Tyler's shape matrix • E3: Estimate based on Tyler's shape matrix and the covariance matrix • E4: Estimate based on Tyler's shape matrix and Dümbgen's shape matrix • E5: Estimate based on Tyler's shape matrix and Huber's M-estimator (q = 0.9) • E6: Estimate based on Dümbgen's shape matrix and symmetrized Huber's Mestimator (q = 0.9).
All computations are done in R 2.4.0 (R Development Core Team, 2006); the package fastICA (Marchini, Heaton, and Ripley, 2006) was used for the FastICA solutions and the package ICS (Nordhausen, Oja, and Tyler, 2006) for the new method.

Simulation Designs
In this simulation study the independent components are all symmetrically distributed.Therefore all choices of S 1 and S 2 are acceptable.The designs were as follows: • Design I: The p = 4 independent components were generated from (i) a normal distribution, (ii) a uniform distribution, (iii) a t 3 distribution, and (iv) a Laplace distribution, respectively (all distributions with unit variance.)The sample sizes ranged from n = 50 to n = 2000.For each sample size, we had 300 repetitions.For all samples, the elements of a mixing matrix A were generated from a N (0, 1) distribution.
• Design II: As Design I but with outliers.The max(1, 0.01n) observations x i with the largest L 2 norms were multiplied by s i u i where s i is +1 or −1 with probabilities 1/2 and u i has a Uniform(1, 5) distribution.This was supposed to partially destroy the dependence structure.

Performance Index
Let A be the "true" mixing matrix in a simulation and B an estimate of an unmixing matrix.For any true unmixing matrix B, BA = P D with a diagonal matrix D and a permutation matrix P .Write G = (g ij ) = BA.The performance index (Amari, Cichocki, and Yang, 1996) is then often used in comparisons.Now clearly P I(P G) = P I(G) but P I(DG) = P I(G) is not necessarily true.Therefore, for a fair comparison, we standardize and reorder the rows of B = (b 1 . . .b p ) (B → P DB) such that For the comparison, also A −1 is standardized in a similar way.The performance index P I(G) can take values in [0, 1]; the smaller is P I( BA) the better is the estimate B.

Simulation Results
The results of the simulations are summarized in Figure 1 and show, that in the noncontaminated case (Design I) the two versions of the fastICA algorithm dominate all estimates based on two scatter matrices.Surprisingly, in this case, the FOBI estimate seems to be the worst choice among all, whereas the best is estimate E6 which is based on two symmetrized scatter matrices.The differences are minor, however.The results change considerably when adding outliers (Design II).The procedures E4, E5 and E6 based on two robust scatter matrices are least affected by the outliers.Estimate E6 using robust symmetrized estimates presumably has a lowest breakdown point among the robust estimates which may explain its slightly worse behavior here.The order in which the two scatter matrices are used has no effect on the results; E2 and E3 have naturally the same performance in the simulations.

An Example
To demonstrate the effect of outliers in a real example we will attempt to unmix three mixed images.The original images which show a cat, a forest track and a sheep, are all in a greyscale having each 130 × 130 pixels and are part of the the R-package ICS.In the analysis of image data, the pixels are thought to be individuals (n = 130 × 130), and each individual has three measurements corresponding to the three pictures (p = 3).The three pictures are first mixed with a random 3 × 3 matrix using the vector representation of the pictures.Contamination is added to the first mixed image by blackening 60 pixels in the right upper corner, which corresponds to less than 1 percent of outliers.The algorithms E5 and F astICA2 are then applied to recover the original images.To retransform the independent components to a reasonable greyscale, for all independent components, values smaller than the 2.5% quantile are replaced by the quantile and the same was done for values larger than the the 97.5% quantile.The result is shown in Figure 2.
As can be seen, some images are negatives of the original images.This is due to the arbitrary sign of the independent components.Nevertheless, it can be observed, that E5 performs better than F astICA2 even when the amount of contamination is so small.The algorithm E5 recovers the two images with the sheep and the cat well and only in the image of the forest track the head of the cat is slightly present.In the images recovered by F astICA2 however none could be called well separated.The picture with the cat has still the windows that belong to the picture with the sheep and in the picture of the sheep and of the forest track the head of the cat is still visible.The good performance of E5 is noteworthy here especially when considering that the images probably do not have underlying symmetric distributions.Using two robust scatter matrices having the IC-property like symmetrized scatter matrices might therefore even improve the result.However the dimension of this example with 16900 observations and three variates is currently too large to apply symmetrized scatter matrices since the resulting large number of pairwise differences is a too huge computational task and hence not feasible.

Conclusion
Based on the simulation results, we recommend the use of two robust scatter matrices in all cases.For possible asymmetric independent components, symmetrized versions of the scatter matrix estimates should be used.Symmetrized scatter matrices are however based on U-statistics and computationally expensive; n = 1, 000 observations for example means almost 500, 000 pairwise differences.However, as the image example shows, ICA problems have easily several thousand observations and therefore this is not feasible yet.To relieve the computational burden, the original estimate may then be re- placed by an estimate which is based on an incomplete U-statistic.Further investigation is needed to examine the situations where the components are not symmetric.For asymmetric independent components, FastICA algorithms for example are known to have a poorer performance.

Figure 1 :
Figure 1: Results of the simulations.The top row shows the results for Design I and the bottom row for Design II.The left column shows the mean of P I( BA) for 300 repetitions and the right column boxplots of P I(G) when n = 1000.The estimates based on two scatter matrices besides F OBI are E2: covariance matrix & Tyler's shape matrix, E3: Tyler's shape matrix & covariance matrix, E4: Tyler's shape matrix & Dümbgen's shape matrix, E5: Tyler's shape matrix & Huber's M-estimator and E6: Dümbgen's shape matrix & Symmetrized Huber's M-estimator.

Figure 2 :
Figure 2: ICA for three pictures.The first row shows the original pictures, the second row the mixed pictures including some contamination.The third row used two robust scatter matrices (E5) to recover the pictures and the fourth row the F astICA2 algorithm.