Nonparametric Rank Tests for Independence in Opinion Surveys

Nonparametric rank tests for independence between two characteristics are commonly used in many social opinion surveys. When both characteristics are ordinal in nature, tests based on rank correlations such as those due to Spearman and Kendall are often used. The case where some ties exist has already been considered whereas Alvo and Cabilio (1995) have studied the case when there are missing values but no ties in the record. However, it frequently happens that the survey data may contain simultaneously many tied observations and/or many missing values. A naive approach is to simply discard the missing observations and then to make use of the rank correlations adjusted for ties. This approach would be less powerful as it does not fully utilize the information associated with the incomplete data set. In this article, we generalize Alvo and Cabilio’s notion of distance between two rankings to incorporate tied and missing observations, and define new test statistics based on the Spearman and Kendall rank correlation coefficients. We determine the asymptotic distribution of the Spearman test statistic and compare its efficiency with the corresponding statistic based on the naive approach. The proposed test is then applied to a real data set collected from an opinion survey conducted in Hong Kong.


Introduction
Discrete ordinal variables are very popularly seen in many social opinion surveys.To test for the independence between any two such variables, a nonparametric test based on Spearman rank correlation is commonly used.However, it frequently happens that the survey data may contain some missing values.For example in a public opinion survey carried out in early 1999 in Hong Kong by the Social Science Research Centre of the University of Hong Kong, it was of interest to determine whether the age of the respondents is related to the level of satisfaction of the Policy Address of the Chief Executive of the Hong Kong Special Administrative Region.The response is an ordinal variable with the seven options being: 1 -very satisfied, 2 -satisfied, 3 -neutral, 4 -unsatisfied, 5 -very unsatisfied, 6 -not sure, and 7 -refuse to answer.Options 6 and 7 will be classified as missing (non-response).The age at last birthday of the respondents was recorded when available.The following table shows the frequencies of female respondents classified by their responses and age groups.
It can be seen from Table 1 that the problem of missing values is quite severe, about 29.3% of respondents did not respond on either one or both questions.A naive approach is to simply discard the missing observations and make use of the classical Spearman rank test for testing the independence between the two variables (see Lehmann, 1975, p. 301).
Clearly, this approach would be less powerful as it does not utilize the information associated with the incomplete data set.Since the responses and the ages of the respondents are classified into a few categories, the problem of ties is also very severe.Another approach consists of analyzing the data as a contingency table.However, in that case, the natural ordering which exists among the age groups and similarly among the responses would not be used.The objective of this paper is to develop rank tests for testing independence between two ordinal variables which can incorporate the presence of both missing values and ties.Only tests based on Spearman and Kendall rank correlations will be considered here.
Rank-based correlations due to Spearman and Kendall play an important role as measures of association between two factors and as tests of independence between two random variables.When the data contain missing observations but no ties, Alvo and Cabilio (1995) proposed a new class of rank correlations based on the concept of distance between rankings and derived the corresponding asymptotic distributions of the test statistics.However, their method could not be directly applied to the above-mentioned two-way ordinal classification problem as the data contain many ties.
Suppose in a group of t respondents, each respondent is asked to assign scores on two characteristics, say A and B. Ranking the respondents according to their scores (called objects hereafter) on characteristics A and B results in two complete rankings of t ob- ), which can be viewed as permutations of the integers (1, 2, • • • , t), if there are no ties and missing values.The Spearman distance between µ and ν given by can also be expressed in terms of a similarity measure

281
where The Spearman rank correlation ρ S (Spearman, 1904) can be expressed in terms of c S and . In an analogous manner, one may define the Kendall rank correlation with distance where sgn(x) is either 1 or −1 depending on whether x> 0 or x < 0. In fact, the Kendall rank correlation (Kendall, 1938) A detailed and complete review of the distance based approach to the analysis of rank data is given in Alvo and Cabilio (1993).
In the next section, we recall the notion of compatibility of Alvo and Cabilio (1995) and use it in Section 3, to introduce new test statistics based on the Spearman and Kendall distances when both ties and missing observations are present.The proposed tests specialize to the known test statistics of Alvo and Cabilio (1995) when there are only missing observations and to the classical Spearman and Kendall tests when only ties are present.The asymptotic distribution of the new Spearman test statistic is derived.It is also found that the two proposed tests are asymptotically equivalent.Some remarks on the asymptotic efficiency of the new Spearman test are made.In Section 4, we apply the new Spearman test to the above opinion survey data.We conclude with some remarks in Section 5.

Extensions to Incomplete Rankings with Ties
In this section, we propose two new test statistics based on the Spearman and Kendall distances which make use of all the data.First of all, we introduce two separate definitions of compatibility for missing and tied data.
Definition 1 A complete ranking of t objects is said to be compatible with an incomplete ranking of a subset of k of these objects, 2 ≤ k ≤ t if every pair of the specified k objects is given the same relative ranking in both rankings.
A tied ordering of t objects is partitioned into e sets, 1 ≤ e ≤ t, each containing g i objects, g 1 + g 2 + • • • + g e = t, so that the g i objects in each set share the rank i−1 j=1 g j + g i (g i + 1)/2, 1 ≤ i ≤ e.Such a tie pattern is denoted by δ = (g 1 , g 2 , ..., g e ).
Definition 2 A complete ranking of t objects is said to be compatible with a tied ranking of these objects with tie pattern δ = (g 1 , g 2 , ..., g e ) if every pair of objects which receive distinct ranks is given the same relative ranking in both rankings.
The definition of compatibility when there are both ties and missing values is then a blend of the two previous definitions.We shall denote an incomplete ranking of k out of t objects with tie pattern δ by µ * = (µ * (1), µ * (2), • • • , µ * (k)).At times, this k-vector is written as a t-vector in which the missing ranks are denoted by the symbol ' '.In the presence of ties, the concept of compatibility suggests that µ * (i) is defined as the midrank of all the items tied with item i.We shall denote by C δ (µ * ) the class of complete rankings compatible with ranking µ * .
The notion of distance of Alvo and Cabilio (1995) can be generalized to include both missing and tied rankings as follows.

Definition 3 The Generalized Distance
The distance between two incomplete rankings µ * and ν * is defined to be the average of all distances d(µ, ν) taken over all pairs of complete rankings µ i and ν j , compatible with µ * and ν * , respectively.More formally, let be the total number of complete t-rankings compatible to µ * and ν * respectively, and set κ = κ 1 κ 2 .Then we have It follows from Alvo and Cabilio (1995) that the generalized Spearman distance d * S (µ * , ν * ) can be expressed in terms of a similarity measure A S as where δ(j) = 1 if both rankings of item j are not missing, or 0 otherwise.Similarly, it is readily seen that the generalized Kendall similarity measure is given by where the a 1 (i, j)'s are the scores for ranking 1 given by The scores for ranking 2 are defined similarly.
In the next section, we shall study a null hypotheses H 0 for testing independence between two random variables.We assume that the number of ranked observations in µ * and ν * are fixed as are the tie patterns and the pattern of missing observations.Moreover, under H 0 , the elements in C δ 1 (µ * ) are equally likely and are independent of those in C δ 2 (ν * ).Hence, it is easily shown that under H 0 , the measures A * (µ * , ν * ) for both Spearman and Kendall are conditional expectations given the classes of complete compatible rankings: As noted in Alvo and Cabilio (1995), this remark along with the fact that implies that the generalized test statistics are asymptotically equivalent as t → ∞.Consequently, in what follows we shall be concerned only with the Spearman case.

S
The generalized Spearman distance remains unchanged under any permutation relabeling of the items.This is a property known as right invariance (see Alvo and Cabilio, 1993).
Assuming k 1 ≤ k 2 , we may consequently relabel the items in such a way that in ranking 2, the first k 2 objects are the one ranked and similarly tied items can be arranged arbitrarily among themselves in any sequence accordingly.Hence, the missing items can be placed at the end and the new rankings ν * (j) appear in natural order in ranking 2. We let o j be the label of the j th item ranked in ranking 1, and k * be the number of items ranked in ranking 1 among the k 2 ranked in ranking 2. Define As an example, consider the following measurements (X 1 , X 2 ) from t = 10 individuals.Here For the analyses which follow, we shall focus on the similarity measure A * S .Following Lehmann (1975, p. 360), let U 1 , • • • , U k 1 be independent random variables uniformly distributed on (0, 1), let R j be the rank of U j (j = 1, • • • , k 1 ) and define the function a k 1 (u) by (2) It follows that under H 0 , A * S has the same distribution as S k 1 where S k 1 is a linear rank statistic of the form Then under H 0 , the distribution of S k 1 is asymptotically normal as k * → ∞.Proof: See the Appendix.
Using Theorem 1, A * S is also asymptotically normally distributed under H 0 .Moreover, by applying Theorem 'a' on p. 160 of Hájek and Šidák (1967), it can be shown that under H 0 , the expected value of A * S is zero and the exact variance of

Efficiency of the Test Statistic
An important consideration in rank tests is the efficiency of the test statistic.In particular, we would like to compare the proposed statistic A * S with Spearman statistic obtained by discarding all the missing observations.It is shown below that under the location shift alternative to H 0 , A * S is always more powerful than the corresponding Spearman statistic based on the reduced sample.
Following the approach of Hájek and Sidák (1967), let X 1 , • • • , X t be independent random variables whose joint density under the location shift alternative to H 0 is given by q β = t j=1 f 0 (x j − β j ) where f 0 is a known density function having finite Fisher information I(f 0 ) and β = (β 1 , • • • , β t ) is an arbitrary vector.Upon deletion of all pairs with missing values, we let k 2 = t, and k 1 = k where k is simply the actual number of X's observed.Therefore, the Spearman type statistic based on the reduced sample can be written in the form where o # j is defined as the midrank of the j th item ranked in ranking 1.The statistic (3) can be expressed as it follows immediately that, under the alternative q β , both ĀS and A * S are asymptotically normal with means and variances given respectively by and F is the cumulative distribution function of f .Moreover, the asymptotic efficiencies for ĀS and A * S can be obtained as where Q 1 is a positive function of f 0 and the limit is taken as t → ∞, k → ∞ with k/t → λ > 0. Therefore, the asymptotic relative efficiency of A * S relative to ĀS is given by e A * S /e ĀS .Consider the case where ) and the remaining β j 's are arbitrary.This situation includes alternatives of the form E(X i ) = ψ 0 +ψ 1 R(X i ) ψ 1 > 0 where R(X i ) is just the midrank of item j.It can be seen that irrespective of the density f 0 , the asymptotic relative efficiency of A * S relative to ĀS is given by Note that R(k, ν * ) > 1 in most cases.One exception would be the case of no tied and no missing observations in which case both A * S and ĀS reduce to A S .As in Alvo and Cabilio (1995) we may illustrate the results of the calculation of this efficiency.Suppose that t = 19, k =7, o * hence indicates a negative association between age and level of satisfaction.That is, young people tend to be less satisfied with the Policy address.We also performed a test of independence using a contingency table analysis of the same data whereby the "missing" categories for age and response were dropped.The row corresponding to response "1" was also dropped since there are no occurrences.The chi-square statistic based on 12 degrees of freedom yielded a value of 15.806 and the corresponding p-value is 0.200.

Concluding Remarks
Rank tests are widely applicable in many contexts.However, one main disadvantage of the rank tests is that they may not be applicable when the data contain missing observations and/or tied values.The problem appears very often in two-way ordinal classifications used in analyzing survey data.In this paper, we proposed a rank test for independence which is a generalization of the Spearman rank correlation based on a natural extension of the concept of distance between two incomplete rankings in order to include data consisting of both missing observations as well as ties.The test is simple and easily applicable.The test statistic reduces to the classical Spearman/Kendall rank statistic when there are no missing values; it reduces to the test proposed by Alvo and Cabilio (1995) when there are missing values but no ties.
Sometimes, we might want to have a measure of association to indicate the direction of influence between the two characteristics if the test for independence is rejected.It is easy to do so by defining a correlation measure in terms of the generalized Spearman where M and m be the maximum and minimum value of the generalized Spearman distance d * taken over all possible patterns of the missing and tied observations when the number of tied groups in rankings 1 and 2, e 1 and e 2 , respectively are fixed.Note that −1 ≤ α * ≤ 1.The calculations of M and m are not straightforward and this interesting problem is worthwhile for future research.tend to zero; then Since max(g 2i /k * ) is bounded away from 1 as k * → ∞, there exists an 2 (0 and hence, the above condition is satisfied.Further, condition (A.1) is obviously satisfied with m = 0 and M = 1.As to condition (A.2), note that from (2), k 1 a k 1 are just the .
By an argument similar to the above with 0 < 1 < 1, such that g 1i ≤ (1 − 1 )k 1 for all i, we have which shows that condition (A.2) is also satisfied.
To show that condition (A.3) is satisfied, recall the definition of R 1 and let It can be seen that a k 1 (u) = 1 2k 1 (g 11 + 1)a (1) k 1 (u).
From the inequality (X 1 + • • • + X e 1 ) 2 ≤ e 1 (X 2 1 + • • • + X 2 e 1 ) and since the coefficients of a (j) k 1 (u) are all less than or equal to 1, it follows that Note that k 1 j a (1) where W is the number of U 's less than or equal to g 11 /k 1 , the variables a (1) are independently and identically distributed with mean 0, and W has the binomial distribution with k 1 trials and probability of success g 11 /k 1 .Consequently, we have E a (1) as k 1 → ∞ and g 11 /k 1 is bounded away from 1. Similarly, for h < e 1 , 0 ≤ E a Hence, we have E a k 1 (U 1 ) − a k 1 R 1 k 1 2 → 0 as k 1 → ∞ and this completes the proof.

Table 2 :
Results of the analyses