A Simple Method for Testing Independence of High-Dimensional Random Vectors

: A simple, data-driven and computationally efﬁcient procedure for testing independence of high-dimensional random vectors is proposed. The procedure is based on interpretation of testing goodness-of-ﬁt as the clas-siﬁcation problem, a special sequential partition procedure, elements of sequential testing, resampling and randomization. Monte Carlo simulations are carried out to assess the performance of the procedure.


Introduction
Let X := (X(1), . . ., X(N )) be a sample of the size N of i.i.d.observations of a random vector X having distribution function (d.f.) F on R d .We are interested in testing some properties of F .Let F H and F A be two disjoint classes of d-dimensional distributions.Consider a nonparametric hypothesis testing problem: (1) Testing the independence of two components where G 1 and G 2 denote the marginal distributions of G corresponding to the components X 1 and X 2 , respectively.
Our goal is to propose a relatively simple, data-driven and computationally efficient procedure for testing problem (1), with key example (2), in case the dimension d of X is large.The procedure is based on Vapnik and Chervonenkis (1981) idea of bounding a discrepancy between empirical and true distribution by that of two independent empirical distributions (Vapnik and Chervonenkis, 1981) and a well-known interpretation of testing goodness-of-fit as the classification problem (see, e.g.Hastie, Tibshirani, and Friedman, 2001, pp. 447-449), a special sequential data partition procedure, randomization and resampling (bootstrap), elements of sequential testing.Monte Carlo simulations are used to assess the performance of the procedure.
Thus far, there is no generally accepted methodology for the multivariate nonparametric hypothesis testing.Traditional approaches to multivariate nonparametric hypothesis testing are based on empirical characteristic function Baringhaus and Henze (1988), nonparametric distribution density estimators and smoothing Bowman and Foster (1993), Austrian Journal of Statistics, Vol. 37 (2008), No. 1, 101-108 Huang (1997), and classical univariate nonparametric statistics calculated for data projected onto the directions found via the projection pursuit L. Zhu, Fang, and Bhatti (1997), Szekely and Rizzo (2005).
More advanced technique is based on Vapnik-Chervonenkis theory, the uniform functional central limit theorem and inequalities for large deviation probabilities Vapnik (1998), Bousquet, Boucheron, and Lugosi (2004).Recently, especially in applications, the Bayes approach and Markov chain Monte Carlo methods are widely used (see, e.g.Verdinelli and Wasserman, 1998 and references therein).Multidimensional copulas are a convenient way to represent the statistical dependence between components of random vectors.Therefore asymptotic behavior and power of independence testing criteria based on empirical copula processes are extensively studied (see, e.g.Genest and Remillard, 2004).However, these results are not directly applicable in our setting since the components X 1 and X 2 themselves have a large dimensionality.
To identify dependence-independence structure of high-dimensional data the independent component analysis (ICA), a recent extension of principal component analysis and projection, is employed.We refer to the monograph by Hyvärinen, Karhunen, and Oja (2001).An efficient method for testing of (conditional) independence is essential here.Related references to our approach are Szekely and Rizzo (2006), Polonik (1999), L.-X.Zhu and Neuhaus (2000).
In Section 2 the procedure of nonparametric hypothesis testing is introduced.The Monte Carlo simulation results and concluding remarks are presented in the last section.

Test Statistic
Let F := F H F A .Suppose that the mapping Ψ : Consider a mixture model .
Here f and f H denote distribution densities (with respect to a σ-finite measure µ) of F and F H , respectively.Let us introduce a loss function possibly dependent on Y, and let {A k , k = 0, 1, . . ., K} be the corresponding sequence of σ-algebras generated by these partitions.Remark 1.A computationally efficient choice of P is the sequential dyadic coordinatewise partition minimizing at each step the mean square error with some restrictions (bounds from below) on the number of the sample Y elements in the partition sets.An alternative might be a partition into sets with the approximately equal number of the sample Y elements.
In view of the definition of the loss function (F, F 0 ) a natural choice of the test statistics would be χ 2 -type statistics where E stands for the expectation with respect to the empirical distribution F of Y and for some k ∈ {1, . . ., K}.The integer k can be treated as a "smoothing" parameter.It characterizes how small is the partition.We also consider a weighted version of (3) where W k is some A k -measurable weight function.The choice on the partition set S ∈ P k , yields the L 2 distance between the observed and the expected frequencies for the true hypothesis H. Since the optimal value of k is unknown, we prefer the following definition of the test statistic T := max where k 0 ≥ 1, a k and b k are centering and scaling parameters, respectively, to be specified.Remark 2. Since the critical region of the criterion is of the form C α := {T > c α }, where c α is the critical value corresponding to the significance level α, it is natural to express C α as the sequential testing procedure: Step 1: Set k = k 0 − 1.
Step 3: Fix the sample Y.For the partition be the J k -dimensional vectors of the observed in P k frequencies of the elements of Y and X, respectively.Then Z τ k = ν τ j (k)/n j (k) on the partition set S k,j .Since the discrete random vector ν τ (k) has the multivariate hypergeometric distribution with the parameters N + M , n(k), and N , the conditional distribution of T τ , given Y and the partition P, depends on Y only through the "sizes" of the partition sets, n(k), k = 1, . . ., K.This provides a basis for determining a k and b k in ( 5), the analysis of asymptotic distribution of the statistic T , and exponential inequalities for probabilities of large deviations of T .In this study, however, we prefer to perform a simulation experiment.

Testing the Independence: A Simulation Experiment
To generate a sample from F H = F 1 • F 2 we apply bootstrap method and resample from the distribution FH := Ψ( F ) = F1 • F2 where Fi denotes the empirical distribution of F i , i = 1, 2.
Let X be the repeated independent observations of X having standard multivariate Student distribution with m degrees of freedom.Although the components of X are uncorrelated they are dependent.Since X converges in distribution to a standard normal random vector as m → ∞, the dependence of the components vanishes for large m.
The centering and scaling parameters for the statistics T k are calculated using approximations by the normal distribution.In what follows it is assumed that M = N and J k ≡ k + 1.Thus, the standardized versions Tk and T (2) k of χ 2 -type statistic (3) and L 2 distance defined by (4) with the weight function W k = |S Y|, respectively, are given by and In the sequel, the test statistic T (5) based on ( 6) is considered.As compared with (7) it places greater weights on the partition sets with smaller number of the sample elements.Our experience suggests that k 0 = 10 and K = (M + N )/5 is an appropriate choice for the maximization interval [k 0 , K] of T k .The critical value c α for the test is to be chosen in such a way that a portion of samples for which the valid null hypothesis is rejected does not exceed a given significance level α, say α = 0.05.Monte Carlo method is used to find c α .Preliminary results of Monte Carlo simulations show that for a wide range of dimensions, sample sizes and null distributions the behavior of the test statistics for samples from the null distribution (control data) is quite similar (see Figure 1 and Figure 2).In particular, we have also applied the procedure to testing goodness-of-fit for a mixture of multivariate Gaussian distributions.The choice c 0.05 = 2.7 is admissible for most cases.In the sequel, the test procedure based on T (5) is referred to as JRS test for brevity.The performance of the procedure is compared with the classical criterion of Blum, 1), of two populations Ω H and Ω with d.f.F H and F , respectively.Fix p and let Y = Y (p) ∼ F (p) denote a random vector (r.v.) with the mixture distribution F (p) .Let π(Y ) denote the posterior probability of the population Ω given Y , i.e. π(Y ) :

Figure 1 :
Figure 1: Maximum, minimum and two-side 0.9 confidence limits of T k for a sample from the Cauchy distribution (m = 1) and for the corresponding control data; d = 20, d 1 = d 2 = 10, N = 1000.

Figure 2 :
Figure 2: Maximum, minimum and two-side 0.9 confidence limits of T k for a sample from the Student distribution with d.f.m = 3 and for the corresponding control data; d = 10, d 1 = 1, d 2 = 9, N = 1000.The computer simulations are performed for d ≤ 20, 200 ≤ N ≤ M ≤ 1000, and m = 1, . . ., 7, 25, 100, ∞.The dimensions d 1 and d 2 of the independent components X 1 and X 2 , respectively, are chosen in two ways.In the first case d 1 = d 2 = d/2, and in the second case d 1 = 1, d 2 = d − 1.The typical number of simulations R = 1000.Below only results for the d = 2, 10 and N = M = 1000 are presented.In the sequel, the test procedure based on T (5) is referred to as JRS test for brevity.The performance of the procedure is compared with the classical criterion ofBlum,
) is equal to the prior probability p if and only if F = F H .Let X (H) := (X (H) (1), . . ., X (H) (M )) be a sample of size M of i.i.d.random vectors from Ω H independent of X.The joint sample is denoted by and only if F = F H , G. Jakimauskas et al. 103 since the posterior probability π(Y reject H 0 and STOP, otherwise go to Step 2. {1, . .., N + M } with equal probabilities and Y τ denote the corresponding permutation of Y.For any statistic ξ, let ξ τ indicate that this statistic is calculated for the randomized sample Y τ .In particular, X τ = (Y τ (1), . .., Y τ (N )).Under the hypothesis H, Y τ = Y in distribution.Therefore one can deal with the conditional distribution of the randomized test statistic T τ k given the sample Y in order to assess the properties of the initial test statistic T k .
Let τ : I → I be a random permutation of I :=

Table 1 :
The power of the tests BKR and JRS.