The Analysis of Ranking Data Using Score Functions and Penalized Likelihood

In this paper, we consider diﬀerent score functions in order summarize certain characteristics for one and two sample ranking data sets. Our approach is ﬂexible and is based on embedding the nonparametric problem in a parametric framework. We make use of the von Mises-Fisher distribution to approximate the normalizing constant in our model. In order to gain further insight in the data, we make use of penalized likelihood to narrow down the number of items where the rankers diﬀer. We applied our method on various real life data sets and we conclude that our methodology is consistent with the data.


Introduction
Ranking data occur quite frequently in practice.There are examples in sport competitions (Deng, Han, Li, and Liu 2014), in the selection of candidates in political elections (Croon 1989;Lee and Philip 2012), in the arrangement of web-pages when using search engines (Aslam and Montague 2001) and in flagging disease related gene in bioinformatics (DeConde, Hawley, Falcon, Clegg, Knudsen, and Etzioni 2006) just to name a few.In all cases it is of interest to present summaries of the data.Some of the methods include the calculation of the average ranks.Borda counts (Baba 1986) usually sum up the rank as each item's score and present transformations such as ordering or averaging.There are also more sophisticated methods which involve building models to explain the ranking data.For example, Kemeny rankings (Mallows 1957;Kemeny and Snell 1962) are based on fitting a distance-based model with Kendall distance and using the modal rank as a summary analogous to the mean for numerical data.Deng et al. (2014) suggested a Bayesian approach to aggregate the observed rankings whereas Dwork, Kumar, Naor, and Sivakumar (2001) suggested a model using a Markov chains approach to have a summary of the word association or the web-page for example.
Here we first introduce a parametric model for the one sample case which incorporates an arbitrary score function.Such score functions include both the Spearman and Kendal scores.The former focuses attention on a single item at a time whereas the latter allows for comparisons between pairs of items.In the era of big data, the number of items being ranked is usually very large.For this reason, we may apply penalized likelihood methods to focus on those items that are particularly preferred by the rankers.The parametric model can be extended to the two sample case whereby we now compare groups of judges.Bootstrap methods can be particularly useful for constructing a confidence interval for the parameters in the models.
The article is organized as follows.In Section 2 we formally introduce the parametric model and indicate how to calculate the maximum likelihood estimates (MLE).We also obtain the estimates using penalized likelihood (PMLE) .In Section 3, we introduce the Spearman score function, explain its significance and then apply it to three real data sets.In Section 4 we consider the Kendall score function, explain its significance and use it on the same data sets.In Section 5 we consider the two sample problem and illustrate our results on two real life data sets.We briefly summarize our results in Section 6.
2. The probability models Neyman (1937) first proposed smooth tests for testing the null hypothesis that data come from a uniform distribution on the interval (0, 1).The smooth alternative density proposed by Neyman was given by p(y, θ) = exp where θ = (θ 1 , θ 2 , ..., θ k ) is a set of unknown parameters, K(θ) is the normalizing constant and h i (y) are the Legendre orthonormal polynomials with respect to the uniform on (0, 1).Motivated by this model, we can define a similar probability function for ranking data.Assume there are n judges each of whom ranks t items.Let {ω j } be the set of t! possible rankings and define the probability that X(ω j ) = x j by where θ = (θ 1 , ..., θ k ) is a k-dimensional vector of parameters, K (θ) is a normalizing constant.X(ω j ) is a k-dimensional vector transformed from the original ranking ω j .The transformation rule X (ω j ) = x j is defined by certain scoring function.The likelihood function is obtained from the multinomial distribution and is proportional to where n j is the number of judges choosing ω j .Let the total sample size n = t! j=1 n j .The log-likelihood function is then given by As mentioned in the introduction, when the number of items being ranked is large, the dimension t! becomes very large.Penalized likelihood is a parametric method which helps to narrow down the number of items to look at.The main idea is to add a penalty term in the log likelihood function.In this paper, we consider l 2 norm penalized terms and minimize this function in order to obtain the Maximum Penalized likelihood (MPL) estimates of the vector parameter θ: for some prescribed values of the constant c.When t is large (say t ≥ 10), the exact computation of the normalizing constant K(θ) involves a summation of t! items.McCullagh (1993) suggested using the normalizing constant from the von Mises-Fisher distribution.Following on this suggestion, we approximate K(θ) with where θ is the norm of θ and I υ (z) is the modified Bessel function of the first kind given by We may now find approximately the Maximum Penalized likelihood estimation for θ after ensuring that x = 1 in our model.There are several ways to minimize the target function Λ(θ, c).In this paper, we used three methods: the BFGS Quasi-Newton method (Broyden 1970;Fletcher 1970;Goldfarb 1970;Shanno 1970),the Trust-Region-Reflective method (Coleman andLi 1994, 1996) and the Interior Point Algorithm (Byrd, Hribar, and Nocedal 1999;Byrd, Gilbert, and Nocedal 2000;Waltz, Morales, Nocedal, and Orban 2006).As well, we imposed the constraint: θ ≥ 0 on θ in order to compare the results when there is no constraint.Differentiating with respect to θ we have The critical points of the minimization occur at saddle points, rather than at local maxima (or minima).For the Quasi-Newton method, the target function is the magnitude of the gradient which is the square root of the sum of the squares of the partial derivatives instead of the Λ(θ, c) to solve the problem.In the applications that follow, the results from the three different methods yield the same solutions and the algorithms implemented in MATLAB converge very fast.
Following the estimation of θ, we proceeded to apply the basic bootstrap method in order to assess the distribution of θ.The basic idea of the bootstrap is to sample n rankings with replacement from the data.Then we find the MLE of each bootstrap sample.Repeating this procedure several times say 10 4 , leads to a distribution of θ.We can draw useful inference from the distribution θ and determine whether or not the θ j s are significantly different from zero.Two sided confidence intervals can also be calculated.We illustrate this on real life data.

Using the Spearman score function
3.1.The Spearman score function and its meaning In this section we restrict our attention to the Spearman score function defined as: where ω j (i) is the rank given to Item i.

Score Functions and Penalized Likelihood for Ranking Data
Let T S be the t × t! matrix of possible values of X: T S = (X(ω j )).Taking t = 3 as an example, let ω j be the possible rankings: Then the matrix T S becomes: Let π j = n j /n and let π be the column vector of π j .The first part of our likelihood function Λ(θ, c) becomes: We can notice that for θ 1 , π 5 + π 6 = Pr(giving rank 3 to Item 1) and π 1 + π 2 = Pr(giving rank 1 to Item 1).
So here θ 1 weights the difference in probability between giving the top rank and the lowest rank to Item 1 (Pr(giving rank 3 to Item 1) − Pr(giving rank 1 to Item 1)).There is a similar interpretation for θ 2 and θ 3 .In general, the components of the matrix T S actually focus on a special characteristic of the data, namely on the difference in weighted average of the rankings to the t objects.The weights here are i − t+1 2 , i = 1, ..., t.The θ i 's here represent the coefficients attributed to each item.
Similarly, we can also compute the matrix of possible scores for t = 4.It can be seen that the first row element for T S π is −1.5 Pr(giving rank 1 to Item 1) − 0.5 Pr(giving rank 2) + 0.5 Pr(giving rank 3) + 1.5 Pr(giving rank 4).This is the weighted average of probability giving the high rank compare with the low rank to Item 1.Notice that when t is odd, the weight of the middle item is 0 which means that our comparison is symmetric and balanced.

Sutton data
For the first example, we consider the Sutton data (t = 3) analyzed by C. Sutton in her 1976 thesis on leisure preferences and attitudes on retirement of the elderly for 14 white and 13 black females in the age group 70-79 years.Each individual was asked: with which sex do you wish to spend your leisure?Each female was asked to rank the three responses: male(s), female(s) or both, assigning rank 1 for the most desired and 3 for the least desired.The first item in the ranking corresponds to "male", the second to "female" and the third to "both".To illustrate the approach in the one sample case, we combined the data from the two groups as in Table 1.
We applied our penalized likelihood in this situation and the results are shown in Table 2.
To better illustrate, we rearrange our result (unconstrained θ, c=1) and data in Table 3.It can be seen that θ 1 is the largest coefficient and Item 1 "male" shows the greatest difference between the number of judges choosing rank 1 or rank 3 which means that the judges dislike spending leisure with male the most.For Item 3 "both", the greater value of negative θ 3 means judges prefer to spend leisure with both sex the most.θ 2 is close to zero and we deduce the judges show no strong preference on Female.This is consistent with the hypothesis that θ close to zero means randomness.To conclude, the results also show that θ i weights the difference in probability giving the top rank and the lower rank to Item i. Negative θ i means the judges prefer Item i more and positive θ i means the judges are more likely to give a lower rank to Item i.
Applying the bootstrap method on the Sutton data we plot the distribution of θ in Figure 1.
The bootstrap sample size is 10 4 in this case.For H 0 : θ i = 0, we can see that θ 1 and θ 3 are significantly different from 0 and θ 2 is not significantly different from zero.We can also see that the distribution of θ 1 and θ 2 is not completely bell shaped.This is mainly because the sample size of the data is small.Using a traditional t-test method may be misleading in this case.

Song data
Our second example is the Song data (t=5) from Critchlow, Fligner, and Verducci (1991).Ninety-eight students were asked to rank 5 words, (1) score, (2) instrument, (3) solo, (4) benediction and (5) suit, according to the association with the word "song".Critchlow et al. (1991) reported that the average ranks for words (1) to ( 5) are 2.72, 2.27, 1.60, 3.71 and 4.69 respectively.However, the available data given in Critchlow et al. (1991) is in grouped format and the ranking of 15 students are unknown and hence discarded, resulting in 83 rankings, as shown in Table 4.We applied the penalized likelihood in this situation and the results are shown in Table 5.
We also rearranged the results (unconstrained θ, c=1) and data in Table 6.Note that for convenience we only show the number of judges who rank the top and the lowest but ranking 2 or 4 also is involved in determining the value of θ.From the results, we can see the value of θ successfully captures the properties of the data.θ 5 is the largest positive value and most of the judges think word "suit" is not related to the word "song".θ 3 is the largest negative value and 55 of the 83 judges think that word "solo" is the closest to the word "song".From the results, we see that θ i successfully weights the difference in probability giving the upper rank and the lower rank to Item i.
We also applied our bootstrap method on the Song data and plotted the distribution of θ in Figure 2. The bootstrap sample size is 10 4 in this case.For H 0 : θ i = 0, we can see that all the θ's are significantly different from zero.As well, their distributions of θ are more bell shaped in view of the larger sample size.

Goldberg data
Our final example is due to Goldberg (1976) data (t=10).In the data, 143 graduates were asked to rank 10 occupations according to the degree of social prestige.These 10 occupations are: (i) Faculty member in an academic institution (Fac), (ii) Mechanical engineer (ME), (iii) Operation researcher (OR), (iv) Technician (Tech), (v) Section supervisor in a factory (Sup), (vi) Owner of a company employing more than 100 workers (Own), (vii) Factory foreman (For), (viii) Industrial engineer (IE), (ix) Manager of a production department employing more than 100 workers (Mgr) and (x) Applied scientist (Sci).The data are given in Cohen and Mallows (1980) and have been analyzed by many researchers.Feigin and Cohen (1978) analyzed the Goldberg data and found three outliers due to the fact that the corresponding graduates wrongly presented rankings in reverse order.After reversing these 3 rankings, the average ranks received by the 10 occupations are 8. 57, 4.90, 6.29, 1.90, 4.34, 8.13, 1.47, 6.27, 5.29, 7.85, with the convention that bigger rank means more prestige.Then the preference of graduates is in the order: Fac Own Sci OR IE Mgr ME Sup Tech For.
We applied our penalized likelihood method and the results are shown in Table 7.
We also rearranged our results (unconstrained θ, c=1) and data in Table 8.For convenience we only show the number of judges who rank the top and the lowest but giving other rankings also involves in determining the value of θ.We can see that the results are consistent with the average ranks and our preference result from θ is the also consistent with the results from Feigin and Cohen (1978).For Item 7 "factory foreman", 93 of 143 graduates give rank 1 which means most of them think factory foreman has the lowest social prestige and our θ 7 is the lowest (that bigger rank means more prestige).For t=10, it is almost impossible to calculate the exact value of K(θ).Our results show that the von Mises-Fisher distribution approximation of K(θ) actually works well and can be easily used when t is large.The bootstrap distribution of θ is exhibited in Figure 3.The bootstrap sample size is 10 4 in this case.For H 0 : θ i = 0, we see that all the θ i except θ 9 are significantly different from zero.We can also see that the distribution of the θ's are all bell shaped because the sample size is large.In this section, we re-consider the previous data sets through the lens of the Kendall score statistic.Specifically, for the Kendall score function, the X(ω j ) vector takes values(t K (ω j )) q where the q th element is the pair comparison between jth rank given to item m and n denoted by which is of dimension t 2 × t!.And θ q from 1 to t 2 weights the pair m and n.In the case of the Kendall score function, the θ s focus on the comparison between pairs of items ranked by the judges.As an example, consider once again the case t = 3.Then, where θ 1 weights the comparison between Items 1 and 2, θ 2 weights Items 1, 3 and θ 3 Items 2, 3.As well, When θ q < 0 (for say Item i and Item j), it means that the judges prefer Item j over Item i (µ(i) > µ(j)).When θ q is close to zero, the judges have no special preference between this pair.In the next section we apply the Kendall score function to the previous data sets.

Sutton data
For the first example, we consider the Sutton data (t = 3).We applied our penalized likelihood with Kendall score function in this situation and the results are shown in Table 9.We rearrange the Sutton data focusing on pair comparison and our results (c=1) in Table 10.First, from our estimated θ, we can find that all θ i s are negative.This is consistent with our interpretation of θ.The judges strongly prefer Males to Both and Males to Females.They least prefer Females to Both.We can conclude that our θ s well represent the paired comparisons among the judges.

Song data
As to the Song data (t=5) the results are shown in Table 11.
Since the Song data is about how much 5 words are close to the word "song".We can summarize the results in Table 12.We note that θ 7 , θ 8 and θ 9 all have the same value 0.29.For these, all of the judges think Item i is closer to Item j.Once again, we conclude that θ well represents the paired preferences among judges.

Goldberg data
For the Goldberg (1976) data (t=10) we note that a large rank means more prestige, so our interpretation is reversed.We applied penalized likelihood with the Kendall score function in this situation and part of the results are shown in Table 13.Behavior of θ Interpretation of θ θ 1 , θ 2 < 0 score≺instrument, solo θ 3 > 0 score benediction, suit θ 5 < 0 instrument≺solo θ 6 , θ 7 > 0 instrument benediction, suit θ 8 , θ 9 > 0 solo benediction, suit θ 10 > 0 benediction suit  The Goldberg data compares occupations with more social prestige.We can also interpret the estimated θ to get the pair comparison results.We arrange the behavior and interpretation of θ in Table 14.In the table, the underlined θ i in the left column means that its absolute value is larger than 0.1.The underlined occupation in the column on the right means that the preference is very strong.From the interpretation of θ, we find that the pair comparisons make sense.To conclude, the Kendall score function provides a nice way to look at the data by pair preference.

Extension to the two-sample ranking problem
5.1.Extension to two-sample ranking problems and its meaning We may extend the approach to the two sample case.Let X 1 , .X 2 be two independent random variables under our model whose distributions are given respectively by: where θ l = (θ l1 , ..., θ lt ) represents the vector of parameters for population l and x j is as in the one-sample case, a t-dimensional vector of scores.We shall use the Spearman scores here throughout the two-sample case.Set γ = θ 1 − θ 2 and write Suppose that the observed vector of frequencies for the l th population is The logarithm of the likelihood L as a function of (µ, γ) is proportional to We may now consider penalized likelihood to determine significant components of γ which most separate the populations.Hence, we consider minimizing with respect to the parameters µ and γ the function: for some prescribed values of the constant c and λ.We may continue to use the normalizing constant from the von Mises-Fisher distribution to approximate K(θ).Differentiating we get Here γ i shows the difference between the two groups with respect to their preference on Item i.A negative value of γ i means that group 1 shows more preference for Item i compared to population 2. A positive value of γ i means that group 2 shows more preference on Item i compared to population 1.For γ i close to zero, there is no difference between the two groups on that item.As we shall see, this interpretation is consistent with the results in the real data applications.From the definition, we know that µ is the common part of θ 1 and θ 2 .More specifically, µ is the weight average of θ 1 and θ 2 taking into account the sample sizes of the populations.

Two-sample Sutton data
For the first example, we consider the Sutton data (t = 3) found in Table 15.We applied penalized likelihood and the results are shown in Table 16.We rearranged one of our estimation results (c=1) and the original data in Table 17.First, it is easy to see that µ is just like the θ's in the one-sample problem.For example, µ 3 is the smallest value and the whole population prefers Item "both" best.µ 3 is largest and the whole population mostly dislikes Item "male".This is not surprising since we know that µ is the common part of θ 1 and θ 2 .For the parameter γ, we first consider Item "female".We see that white females prefer to spend leisure time with females (8 assign rank 1) whereas black females do not (6 give rank 3).We find that γ 2 is negative and is largest in absolute value.There is a significant difference between the opinions with respect to Item 2 "female".For Item "male" and "both", we find black females prefer them more than white females.To conclude, the results are consistent with the interpretation of µ and γ.

Game data
For the second example, we consider the Game data (t = 6) (Fok, Paap, and Van Dijk 2012).In this data, 91 Dutch students were asked to consider buying a new platform to play computer games.They had to rank 6 different platforms suitable to play computer games.The 6 platforms are the X-box (360), the PlayStation (2 or 3), the Gamecube (or Wii), the PlayStation Portable, the Gameboy or regular PC.In addition, we know the average number of hours that each student spends on gaming each week.We separate the students into two groups: frequent player (49 students) and seldom player (42 students).We classify a student as a frequent player if the average number of hours that he spends on gaming each week is larger than two.Otherwise he is classified as a seldom player The game data is too large to exhibit here.Instead.we present the average ranks for the two groups in Table 18.
We applied penalized likelihood in this situation and the results are shown in Table 19.
To better illustrate the estimation results and the data, we calculate the weighted average of the observed probability of giving the high rank compared with the low rank for each item.The weight for each rank is i − t+1 2 , i = 1, ..., t (-2.5,-1.5,-0.5,0.5,1.5,2.5)here.The observations giving ranks close to the top and the bottom will get higher weight.A negative weighted average means preference in this case.We summary the results in table 20.First, for the parameter µ, there is a strong relationship with the weighted average of the total population.For the total population, GameBoy receives the most low ranks and µ 5 is the largest, which is consistent with our interpretation of µ.Then, it can be found that the results for γ are consistent with the trend of the difference of weighted average between two samples.For example, for the Item "PC", frequent players exhibits the largest difference of  opinion with seldom players and the γ 5 has the largest absolute value among γ.To conclude, our application on Game data also shows that our methodology is consistent with the data.

Conclusion
In this paper, we considered both the Spearman and Kendall score functions in the one or two sample problems as a way to summarize ranking data.A parametric model was formulated and then we applied penalized likelihood on the original parametric model to narrow down the items being ranked.We used the von Mises-Fisher distribution to approximate the normalizing constant and then determined the MLE estimates in several examples.This estimation procedure is fast and simple.In all cases, the estimation is shown to be consistent with the data.Our methodology was applied on various popular ranking data sets.
Figure 1: The distribution of θ for Sutton data by bootstrap method

Figure 2 :
Figure 2: The distribution of θ for Song data by bootstrap method

Figure 3 :
Figure 3: The distribution of θ for Goldberg data by bootstrap method

Table 1 :
Sutton data on leisure preferences

Table 2 :
Maximum penalized likelihood estimation for Sutton's data

Table 3 :
The Sutton data and the estimation of θ

Table 4 :
Song data set

Table 5 :
Maximum penalized likelihood estimation for Song's data

Table 6 :
The Song data and the estimation of θ

Table 7 :
Maximum penalized likelihood estimation for Goldberg's data

Table 8 :
The Goldberg data and the estimation of θ

Table 9 :
MPLE using Kendall score function for Sutton data

Table 10 :
Pair comparison from the Sutton data and the estimation of θ

Table 11 :
MPLE using Kendall score function for Song data

Table 12 :
Interpretation of θ in song data.A B means A is preferred to B, that is A is closer to the word "Song".

Table 13 :
MPLE using Kendall score function for Goldberg data

Table 15 :
Sutton data on leisure preferences (two-sample problem)

Table 16 :
Maximum penalized likelihood estimation for the two-sample Sutton data

Table 17 :
Two-Sample Sutton data and the estimation of µ, γ

Table 18 :
Average rank for the game data

Table 19 :
Maximum penalized likelihood estimation for the game data

Table 20 :
Weighted average of observed probability and the estimation for the game data