Comparison of Partially Ranked Lists

In this paper we introduce a measure of closeness of partial rankings based on a metric on permutations, and we analyze some of its properties. We consider two types of partial rankings: ranking the k favorite items out of n and classiﬁcation into several ordered categories.


Introduction
In many situations, there are different methods for analyzing the same data.For example, several methods exist for finding differentially expressed genes using RNA-seq data.They tend to produce similar, but not identical significant genes and rankings of the gene list.When comparing different methods applied to the same data, we are interested in how close are their outputs.The main idea is to define appropriate distance on the sample space.Further, the interpretation of the rough distance between two rankings should be made on the basis of its statistical significance.That means we need to know the distribution of the distance under some common hypotheses about a sample of rankings.In recent years, many new applications appear in different areas including bioinformatics pattern recognition, information retrivial Jurman, Merler, Barla, Paoli, Galea, and Furlanello (2007), Jurman, Riccadonna, Visintainer, and Furlanello (2009), Chan, Yan, Kittler, and Mikolajczyk (2015), Fagin, Kumar, Mahdian, Sivakumar, and Vee (2006), Fagin, Kumar, and Sivakumar (2003), etc.
In this paper we define an appropriate mathematical framework that include special cases of partially ranked lists of items.Any ranked list can be complete, which means all n items are ranked, or incomplete, which means some items are not ranked.The incomplete ranking include the case where the most significant k items are ranked, with group k + 1 consisting of the remaining items.Any ranking of n items corresponds a permutation α(1), . . ., α(n) from the set of all permutations S n .We define appropriate distance measures on S n in order to compare full or incomplete rankings or rankings of different types.The distance can be thought of as a measure of the similarity of the two rankings.
Let α and β be two permutations from S n corresponding to two rankings and let d be a metric on the permutation group S n .Then d : Invariance is natural in many problems.Right-invariance means that the distance does not depend on arbitrary labeling or reordering of the data: Here ατ is the product of two permutations α and τ and defined by ατ (i) = α(τ (i)).Right-invariant property allows to compute the distance between two permutations α and β through the the distance of αβ −1 to the identity permutation.
For α and β ∈ S n the following functions are commonly used as statistical measures of association: All these measures are right-invariant metrics on S n .By right-invariance of a distance it is sufficient to study its statistical properties when one of the rankings is the identity permutation.

Complete or incomplete ranking
A ranking of n items is represented by an ordered n-tuple, which simply lists the items in their ranked order.The most preferred item is listed first, and the least preferred item appears in the n-th position.
Any ranking corresponds to a permutation which is an element of the set S n of permutations.Given a set of rankings, the problem of their comparison reduced to a problem of choosing appropriate measure of association on the set of all rankings.There are several usefull distance measures on S n thoroughly discussed in statistical literature like Kendall's τ , Spearman's ρ, Spearman's footrule.Therefore, for two permutations α, β ∈ S n the distance d(α, β) can be thought of as a measure of similarity of the two rankings.Excellent references on statistical analysis of rankings are the monographs by Diaconis 1988, Critchlow 1985, and Marden 1995.There are many situations, in which complete ranking of all n items is not compulsory.The goal might be to rank only their favorite k out of n items or just to choose their k favorite items.In other cases it is important to classify items into groups or categories according to some criterion of "goodness".Further, we need appropriate distances to measure closeness of such rankings.
The general partitioning problem can be described as follows.Let {1, . . ., n} be n given items.We wish to partition them into a fixed number of disjoint categories, such that each category contains a certain preassigned number of items.The first category contains n 1 favorite items, the second category contains the n 2 next preferred items, and so on; the final category contains the n r least favorite items, where n i = n, n i ≥ 1.We do not state any preferences among members of the same category.
If we assign values to r and n i we obtain several special cases of interest.
(1) To choose the best single item (r = 2, n 1 = 1, n 2 = n − 1); (2) To choose the best k items without regard to order (r = 2, (3) To choose the best k items with regard to order (r (4) To order all items (r = n, n 1 = . . .= n r = 1); (5) To partition the items into a fixed number of categories.
Many of the decision procedures that one might use within the scope of these ranking problems have a corresponding structure which is invariant under a group of transformations.We consider suitable models for analysis of such partially ranked data thoroughly described by Critchlow 1985.
The full ranking (goal 4) of n items is viewed as an element of the permutation group S n .The corresponding permutation α ∈ S n is a bijection function from α : {1, . . ., n} → {1, . . ., n} onto itself, where α(i) denotes the rank given to item i and α −1 (i) denotes the item assigned to rank i.

The ranking of the k favorite items out of n
This is probably the most popular goal in ranking problems.Any such partial ranking is identified with permutation from the subgroup S n−k ⊂ S n which leaves the first k integers fixed and permutes the remaining n − k integers between themselves: Define an equivalence relation on S n as follows: two permutations α and β are equivalent if and only if there exists γ ∈ S n−k so that α = γβ.For any α ∈ S n , the equivalence class S n−k α induced by α consists of all permutations equivalent to α.Hence, each partial ranking of k out of n items can be identified with the set of all full permutations which induce it.The set of all such partial rankings can be identified with the set of all such right cosets.Clearly, there is a one-to-one correspondence between the partial rankings of type "k out of n" and right cosets of S n−k .This coset space is denoted by S n /S n−k .

Classification into r ordered categories
Let n 1 , . . ., n r be an ordered sequence of r strictly positive numbers summing to n.Such an ordered partition corresponds to a partial ranking with n 1 items in the first group, n 2 items in the second group and so on.No further information is conveyed about orderings within each group.The special case of ranking the top k items corresponds to Formally, denote N 1 , . . ., N r are the following partition of {1, . . ., n}: Let S denote the subgroup of all rankings which permute the first n 1 items among the first n 1 ranks, and which permute the next n 2 items among the next n 2 ranks, and so on.The equivalence class [α], that assigns the same set of ranks to the items from the each category as α, is the right coset Sα.There is a one-to-one correspondence between the partitioning "of type n 1 , . . ., n r " and the right cosets of S.

Distances on partial rankings
In the above algebraic structure the problem of comparing of partial rankings is reduced to a problem of extending the metrics on the permutation group S n to metrics on the corresponding coset space.We discuss an extension of the above metrics for the cases of partial rankings.One natural way of extending it is to construct the induced Hausdorff metrics.Its particular benefit is that it keeps the metric properties of the original distance.We focus on Chebyshev's metric between partial rankings.These ere obtained by suitable generalization the M distance on S n to coset spaces of S n .The Hausdorff versions of five other metrics are due to Critchlow 1985.

Chebyshev's metric for partial rankings
In this section we derive an extension of Chebyshev's metric for partial rankings of type (3) and ( 5).Theorems 1 and 2 below state the extensions of Chebyshev's metric to the metric on the coset spaces S n /S n−k and S n /S.The extensions preserve the invariant properties of the metric.The construction is based on the Hausdorff distance between cosets.
The Hausdorff metrics on S n /S n−k induced by Chebyshev's metric is defined by taking G = S n and K = S n−k in Proposition 1.
Theorem 1.Let A, B, D, E be the following partition of {1, . . ., n}: Then the Hausdorff metrics metrics on S n /S n−k induced by Chebyshev metric are Here p 1 < . . .< p h is an ordering of the set ∪ i∈B {α(i)}, s 1 < . . .< s h is an ordering of the set ∪ i∈D {β(i)} and h is the number of elements in B (or D), and δ(x) = 1 for x > 0 and 0 for x ≤ 0.
The proof is in the Appendix.
The Hausdorff metric on S n /S induced by Chebyshev's metric is defined by taking G = S n and K = S, in the Proposition 1.
Theorem 2. Let n ij be the number of elements in the set {α The proofs of these theorems are based of the fact that Chebyshev's metric on S n satisfies the transposition property.
The proof can be found in Stoimenova (2000).For metrics possessing the transposition property, the permutations β max , α min (β max ), α max and β min (α max ) have a simple special form.

Distributional properties of the metrics
Suppose that two partial rankings α * and β * are generated independently from a uniform distribution on all possible partial rankings and calculate the distance d * (α * , β * ).Thus the distance is a random variable and one might study its distribution on the set of permutations.Figure 1 (left) shows the distribution of Chebyshev's metric for full rankings based on 10000 choices of σ from a uniform distribution.Since the metric is right invariant we calculate the distribution of the distance from the identity permutation.Normal approximation is in the following sense.
Definition 2. The metric d * on S n /S is asymptotically normally distributed if for partial rankings α * and β * the following limit distribution is valid for all real numbers x, where Φ, is the standard normal cumulative distribution function.
By using the exact or the approximate distribution of a distance on permutations, one can calculate the probability that d * is less than or equal to the observed value d * (α * , β * ).This probability is the p-value for α * and β * .Smaller values of p indicate stronger evidence that α * and β * are "similar".
To compute the p-value, Critchlow finds the probability distribution of some popular metrics on permutations under the appropriate uniformity assumption.The critical values of the distribution of Chebyshev's metric under uniformity assumptions can be calculated for different choices of the sizes of the categories.The R code for Chebyshev's metric is available by the author on request.Therefore, the significance of the distance can be used to estimate the similarity between the two partial rankings.The interpretation is very much like as the significance of a correlation coefficient.
Chan et al. 2015 computes the distributions of several metrics between ranking descriptors in a texture image from a real dataset and applies them to image intensities and to filter responses.The distributions of Chebyshev's metric and Spearman's rho have the same features we see on Figure 1.

Proposition 1 .
Let G be an arbitrary finite group, K be any subgroup of G, and d be a right-invariant metric on G. Then d induces a right-invariant metric on the coset space G/K defined by d * (Kα, Kβ) = max max In this formula, the quantity min π∈Kα d(π, σ) is the distance between σ and the set Kα. Therefore, the quantity max σ∈Kβ min π∈Kα d(π, σ) is the maximal distance of a member of Kβ to the set Kα. Similarly, max π∈Kα min σ∈Kβ d(π, σ) is the maximal distance of a member of Kα to the set Kβ.

Figure 1 :
Figure 1: Distribution of Chebyshev distance and Spearman's rho between 2 random permutations It is evident in the figure that the distribution of the Chebyshev's metric is left skewed and it has similar form on partial rankings as well.We suggest a chi-square approximation of the distribution.The other metrics discussed in Section 1 have symmetrical distributions and for large n they exhibit normality.The distribution of Spearman's rho for full rankings is presented on Figure 1 (right).Normal approximation is in the following sense.