Some Properties of a Recently Introduced Approach to Ordinal Regression

Abstract: The statistical properties of a novel approach to ordinal regression which was only recently introduced in the literature are discussed. It is shown that for ordinal explanatory variables the approach is equivalent to isotonic regression, with some advantages when dealing with two-sided alternatives. For ordinal response variables the procedure behaves very differently and is asymptotically equivalent to a two-sample t-test between the extreme categories. A penalized version is introduced to improve power, and the procedure is evaluated using Monte Carlo simulations. Finally the method is applied to microarray gene expression data on prostate cancer.


Introduction
The study of regression models for ordered categorical response variables dates back to the 1950's, for a comprehensive recent review article see Liu and Agresti (2005).Incited by a seminal article of McCullagh (1980) the topic of ordinal regression became a fruitful field of research.In a recent paper Torra, Domingo-Ferrer, Mateo-Sanz, and Ng (2006) proposed a new approach to regression with ordinal variables without latent variables.They suggest to map the levels of the ordinal variable into the interval [0, 1] and take this mapping into account when calculating least squares estimates.Up to our knowledge this approach has not been investigated before, and the purpose of this article is to analyze the statistical properties of such a procedure, which is not really dealt with in Torra et al. (2006).
To this end we will restrict ourselves to simple univariate models, and we will study separately the cases where only the explicatory variable or only the dependent variable is ordinal.In Section 1 we will show that in the first case the procedure is equivalent to isotonic regression, though it has some advantages when dealing with two-sided alternative hypothesis.The second case will be dealt with in Section 2, where the approach suggested by Torra et al. (2006) actually leads to a novel procedure, which is however asymptotically equivalent to a two-sample t-test between the values of the explicatory variable which correspond to the lowest and the highest level of the dependent variable.We will introduce penalties to obtain a more interesting procedure, and compare its power with simple regression where the levels of the dependent variable are assumed to be equidistantly fixed, and with logistic regression based on the proportional odds assumption (McCullagh, 1980).Finally the method is applied to publicly available gene expression data from a study on clinical prostate cancer behavior Singh et al. (2002).The performance of the new method is compared with the original data analysis and with an approach to ordinal regression based on Gaussian processes Chu, Ghahramani, Falciani, and Wild (2005).

Independent Variable Ordinal
Let X be an ordinal variable without an underlying latent variable, and Y a metric response variable.Without loss of generality we will code throughout this section the l + 1 levels of X as Ω X := {0, 1, . . ., l}.We study the following type of model where f ∈ F := {f : Ω X → [0, 1], nondecreasing, not constant} is the set of all mappings of the levels of X into the interval [0, 1] with f (j) ≥ f (i) if j ≥ i.We do not allow constant mappings to avoid trivial degenerations.The error term ε is some random variable with E(ε) = 0, which we will not further specify.
The goal is to perform least squares estimation of b 0 and b 1 , which also minimizes over F. This means that for a random sample of observations (y j , x j ), j = 1, . . ., n, we want to solve (2) We can actually identify any function f ∈ F with a vector f of dimension l + 1, where f = (c 0 , c 0 + c 1 , . . ., c 0 + • • • + c l ) , with c i ≥ 0 and c i ≤ 1.Now due to linearity the infimum of ( 2) is clearly shift invariant with respect to f , and without loss of generality we can assume that c 0 = 0. Having done so the infimum of ( 2) is also scale invariant and we can further assume that c 1 + • • • + c l = 1.By denoting c = (c 1 , . . ., c l ) , we can thus formulate a parameterized version of (2): where is the standard simplex of dimension l, and g j (c) := f (x j ) = c 1 + • • • + c x j corresponds to the state of the j−th observation of X.If we only want to consider nonincreasing regression lines, i.e. in the situation of a one-sided test, we ask for b 1 ∈ R + , otherwise we have b 1 ∈ R.
The optimization problem (3) has a vary particular structure, it is a biquadratic program over R 2 ⊗ ∆ l .Keeping c fixed we have the usual convex quadratic program of ordinary linear regression, whereas keeping b 0 and b 1 fixed we obtain a so called standard quadratic program (Bomze, 1998).Due to this special structure it is easy to prove that in (3) the minimum is actually attained: The same result holds for the one-sided situation where the minimum is attained in R⊗R + ⊗∆ l .We will denote by b * 0 , b * 1 ; c * a vector that minimizes (3).Note that uniqueness of the solution is not always guaranteed, e.g. in case of b * 1 = 0 there are no conditions on c * .In general it is fairly easy to solve (3), because we can explicitly calculate the critical points.To this end we will introduce some notation.For an event J ⊂ Ω X = {0, . . ., l}, we denote the mean of all observations y j with x j ∈ J as ȳJ := 1 n J x j ∈J y j , where n J = #(x j ∈ J) is the number of observations with value in J. Without loss of generality we will assume that #(x j = i) > 0 for all i ∈ Ω X , i.e. we will neglect categories without any observations for our analysis.
Furthermore, we define the index set I ⊆ {1, . . ., l}, where i ∈ I signifies that the corresponding boundary condition for c is not active, i.e. c i > 0, whereas for i ∈ {1, . . ., l}\I we have c i = 0.Because of c ∈ ∆ l it is not possible that c i = 0 for all i ∈ {1, . . ., l}.For any such We adopt the common convention that 0 r=1 c s r = 0, so specifically J 0 contains all categories of X which are mapped to 0 under f .
Let I := P({1, . . ., l})\∅ be the set of all nonempty index sets, I 1 := {I ∈ I : |I| = 1} and I 2 := I\I 1 .The index sets in I 1 with one index correspond to the edges of ∆ l , whereas the index sets of I 2 \Ω X correspond to the boundary manifolds of ∆ l .The minimum of (3) can occur either at the critical points of the manifolds indexed by I 2 , or at the points given by I 1 .The following lemma describes all potential critical points: Lemma 2.1 Let I = {s 1 , . . ., s ν } ∈ I 2 and assume that ȳJ 0 = ȳJ ν .If the vector is strictly (3)-feasible (i.e.c I s i > 0, for all i ∈ {1, . . ., ν}), then it is a critical point of 3. The corresponding value of the objective function equals Remark: Note that for the critical point in the interior of R 2 ⊗ ∆ l (i.e. for I = {1, . . ., l}) we obtain just the usual residual sum of squares of a one way ANOVA.This will be the solution of (3) if the means over the different categories ȳi are monotonic in i. Otherwise we will have a solution on the boundary of R 2 ⊗∆ l , which means that some categories are joint together.In the one-sided situation strict feasibility requires additionally that b 1 > 0.
Proof.For any given I we have c i > 0, for all i ∈ I and c i = 0, for all i ∈ Ω X \I.The residual some of squares becomes ν i=0 j: Partial derivation with respect to b 0 , b 1 , and c i for i ∈ I leads to the following system of linear equations: where we have used the abbreviation T ij := b 0 + b 1 i r=1 c s r − y j .Now for b 1 = 0 we can conclude that j:x j ∈J i T ij = 0 for each i ∈ I, and it follows that (4) is a solution of (6).
Obviously this problem is intimately related with isotonic regression.To clarify the connection we introduce the following definition: Definition 2.1 For a finite set X = {x 0 , . . ., x l } with simple order Let us write ȳ(i) := ȳ{i} for the sample mean over each category and ȳ := (ȳ(0), . . ., ȳ(l)) .As stated in the remark after Lemma 2.1 this natural estimate ȳ gives the optimal solution of (3) when the order restrictions are fulfilled.Otherwise we have: Theorem 2.1 For any sample (y j , x j ), j ∈ {1, . . ., n}, a solution of (3) in the one-sided situation (b 1 ∈ R + ) is given by the isotonic regression ȳ * of ȳ with weights (n 0 , . . ., n l ), which is defined as in the class of isotonic functions g on Ω X .
A general introduction to isotonic regression can be found either in Robertson, Wright, and Dykstra (1988), or Barlow, Bartholomew, Bremner, and Brunk (1972) where in Chapter 2 several algorithms are presented how to find the solution of (7).Specifically the pool-adjacent-violator algorithm (PAVA) finds the isotonic regression ȳ * by starting with ȳ and applying a process of amalgamation of means over categories till the pooled means fulfill the order restriction.These pooled means are just of the form ȳJ i as in Lemma 2.1, which provides another possibility to establish equivalence of ( 7) and (1).
Isotonic regression is more naturally formulated for the one-sided case, which corresponds to test the null hypothesis against the one-sided alternative This test was first introduced by Bartholomew (1959a), where one can find among others the distribution of the test statistic χ2 l+1 := σ −2 l i=0 (ȳ(i) − ȳ * (i)) 2 n i under H 0 .In a subsequent paper Bartholomew (1959b) considered the two-sided alternative which he dealt with by applying isotonic regression separately under the increasing and under the decreasing alternative.This situation can be easier handled within this framework by minimizing (1) with b 1 ∈ R. Solutions can be found e.g.via Lemma 2.1 and applying a PAVA-like algorithm.The following example illustrates a peculiarity of the two-sided case.Thus, in the non-trivial case we obtain two solutions, one with negative and one with positive slope.If the dependent variable Y itself is discrete this kind of behavior might be quite undesirable.Of course if Y is continuous, then such a situation would occur only with probability 0, but still there is some kind of instability with respect to small changes in the data of the dependent variable.This sort of problem becomes in the current context more transparent than in the formulation of two-sided isotonic regression.

Dependent Variable Ordinal
We next consider a metric variable X and an ordinal response with l + 1 categories, i.e.Ω Y := {0, 1, . . ., l}.The corresponding model has the form where and ε is some error function.We actually have to demand f (l) = 1, because otherwise the whole approach becomes entirely meaningless -for decreasing f (l) naturally the sum of mean squared errors would also decrease.In the extreme case we would obtain the trivial solution f ≡ 0 and b 0 = b 1 = 0.As in the previous section we obtain a parameterized version by letting In this section we will only consider the two-sided case, thus both b 0 and b 1 take values in R. Similar arguments as in Proposition 2.1 guarantee that a least squares estimate b * 0 , b * 1 , c * can be obtained by solving Actually, ( 9) is in some sense simpler than (3), because we now have to deal with a quadratic optimization problem, and not with a biquadratic one.However, the critical points for this model do not have such a straight forward interpretation, and they are qualitatively very different from those of (3).We will use again the notation of Lemma 2.1, furthermore we define for an index set J ⊂ Ω Y the variance of the x j values with corresponding y j ∈ J as To solve (9) we can use the following Lemma: is ( 9)-feasible (i.e.c s i > 0, for all i ∈ {1, . . ., ν}), then it is a critical point of ( 9).The corresponding value of the objective function equals Proof.Similar to the proof of Lemma 2.1 we can write the residual sum of squares for fixed index I as ν i=0 j: Now from the last set of ν −1 equalities we obtain j: The first equality of (12) gives and using (13) the second equality becomes Inserting ( 14) into ( 15) and some easy calculations finish the proof.
We want to remark that just as in Section 2 the solution of ( 9) is given by the minimum of ( 11) over the finite number of critical points (with indices I ∈ I 2 ) and over the edges of ∆ l (with indices I ∈ I 1 ).In the latter case ν = 1 and all categories of Y are either mapped to 0 or to 1.As we will see later on this is a rather untypical situation and will in practice hardly occur.The optimal values of b I 0 and b I 1 are still given by the corresponding formulas of (10).
The usual simple regression line has the property that the point (x, ȳ) lies on it.Here this is typically not the case, but (10) shows that the regression line goes through (x J 0 ∪Jν , ȳJ 0 ∪Jν ).The formula of the slope b I 1 resembles simple regression considering only J 0 and J ν , but the variances of J 1 , . . ., J ν−1 add to the denominator, and thus the regression line gets flatter.Note that for b I 1 > 0 the regression line defined by (10) will be above the point (x J 0 , 0) and below (x J ν , 1).In absolute terms its slope is smaller than a line running through those two points.The larger the variances var(x) J i the flatter the line.
Assume for a moment that we drop the order restriction implied by F and allow also for non-monotonous f .Clearly in the unrestricted case we obtain an optimal solution of (9) by R(b I 0 , b I 1 , c I ) as given in ( 11) for I = {1, 2, . . ., l}.In the terminology of isotonic regression this is again a natural estimate and the solution under order restrictions is obtained by isotonization.This leads to the problem of finding I which minimizes R(b I 0 , b I 1 , c I ) while maintaining the order relation, which can be accomplished as in the previous section by applying a PAVA.For a more general discussion of algorithms of this kind we refer to Best and Chakravarti (1990).

Inference
If we want to construct a statistical test based for example on the absolute value of b * 1 we have to face the problem of how to specify the error function ε.This is not particularly straight forward, actually even the specification of H 0 and H 1 are not as obvious as in Section 2. Taking into account that Y is an ordinal variable we may simply consider the case where Y has a discrete distribution with l + 1 levels and probabilities p 0 , . . ., p l .We then can formulate the null hypothesis where X might be considered as a random variable or as given data.Alternative hypothesis will involve dependence of Y on X with some kind of inherent order.Denoting by q i (x) := p 0 (x) + • • • + p i (x) the cumulative probabilities for given X one might consider e.g. the two-sided alternative H 1 : All q i (x) either increase or decrease with x .
The following theorem deals with consistency; under H 0 the regression line converges almost surely towards a constant.Theorem 3.1 Let X i and Y i be an i.i.d.sequences of independent random variables, X i with finite mean µ, finite variance σ 2 > 0 and Y i discrete with l+1 levels and probabilities p 0 , . . ., p l .Furthermore let (b l ) be the solution of ( 9) for the first n members of the random sequences.Then we have b Proof.Note that due to the strong law of large numbers xJ i a.s.→ µ for any J i .Because of σ 2 > 0 the variance terms in the denominator of b I 1 are bounded away from zero for any I, and we immediately obtain that b Next consider that the scaled objective function of (9) formally converges to Austrian Journal of Statistics, Vol. 39 (2010), No. 3, 182-202 It is obvious that the minimum of ( 17) is obtained at b lim 0 := p l (p 0 + p l ) −1 and c lim := (p l (p 0 + p l ) −1 , 0, . . ., 0, p 0 (p 0 + p l ) −1 ), and it remains to argue that b (n) 0 and c (n) converge almost surely towards this limit.
From ( 13) it is evident that lim sup b and the fact that by our assumptions the first two moments of X are bounded we conclude that the convergence in ( 17) is almost sure uniformly for the compact set b 0 ∈ [0, 1], c ∈ ∆ l .Now by standard arguments we conclude that b So asymptotically b * 1 converges towards 0, and it is easy to see that also E(b * 1 ) = 0 for finite n.Therefore one might consider to construct a statistical test based on b * 1 .For fixed I the numerator of b 1 in (10) can be rewritten as Due to the central limit theorem we have ) for any J i and under H 0 we have independence of the x j corresponding to different J i , because by definition all J i are disjoint.Unfortunately independence between xJ 0 and xJ ν is not given because n J 0 and n J ν are dependent random variables.However, it is still possible to obtain for fixed A formal proof is provided in the appendix.Using the notation for the denominator of b 1 we define the test statistic Note that S2 I /σ 2 is asymptotically χ 2 -distributed with n − ν degrees of freedom, and therefore t I will be approximately t-distributed with n − ν degrees of freedom.
Of course what we are actually interested in is t I * , the test statistic corresponding to b * 1 , the optimal solution of (9).By conditioning we immediately can see that the distribution of t I * will be a mixture of t-distributions.Like in the case of isotonic regression one could try to determine the level properties for each I, but we will not pursue this direction here.For testing one might consider the quantile of a t-distribution with n − 1 degrees of freedom as upper bound, and with n − l degrees of freedom as lower bound for the true critical value.
We want to emphasize that in the given form the procedure roughly resembles a t-test between the X-values corresponding to the largest level and the smallest level of Y ; as a consequence of Theorem 3.1 we can be almost sure for large n under H 0 that J 0 = {0} and J ν = {l}, even when we cannot be certain what I * will exactly look like.We want to illustrate this behavior with a simple example, where we use step functions to simulate models under the alternative hypothesis H 1 .
Example 3.1 Let X be uniformly distributed on [0, 1] and Y categorical with five levels, i.e. l = 4.We will generate test examples where the cumulative probabilities q i (x) are non-increasing step functions with jumps at x = 0.1, 0.2, 0.8, 0.9.In other words we partition [0, 1] into five intervals and define on each of those intervals Y a different discrete distribution, such that there is an order with respect to X.Note that non-increasing q i (x) lead to positive correlation between X and Y .
The joint distribution of X and Y can be specified by a 5 × 5 matrix P , where P (i, j) is the probability that Y = i − 1 given that X ∈ M j−1 .We will consider matrices of the form P = 1 5 E + δB where E is the all one matrix, B is some fixed matrix specifying the effect of a particular alternative and δ is a scalar parameter.For δ = 0 we just obtain the null hypothesis H 0 , where Y is uniformly discrete.Letting δ grow we can study the power of ( 19) which we compare with simple linear regression (keeping Y fixed) and with logistic regression using the proportional odds assumption (McCullagh, 1980).
We will consider three instances: Figure 1 shows the cumulative probabilities q i (x) for δ = 0.09.In Figure 2 we provide the corresponding power functions obtained by simulating 10000 replicates with n = 100 for various values of δ ranging from 0 to 0.1.Using the quantile of a t-distribution with 99 degrees of freedom as critical values leads to reasonable asymptotic behavior: At a significance level of α = 0.05 and under the null hypothesis 5.16% of all test statistics lie above the threshold.For B 1 the new procedure is only slightly less powerful than simple regression, for B 2 it is much more powerful, whereas for B 3 it cannot compete at all.In all three cases the multinomial logistic regression turned out to be strictly less powerful than simple regression.
The results of Example 3.1 are easy to understand when we consider that the new procedure is very similar to applying a two sample t-test between the data x j with corresponding y j = 0 and y j = l respectively.Basically all the information contained in the middle three rows of B i does not enter into the decision process.This does not have a strong effect in case of instance B 1 , though some power is lost by neglecting the information of the second and the fourth row.On the other hand in case of B 2 when only considering the three intermediate levels one observes an inverse trend, and discarding this information actually gives larger power.The situation for instance B 3 is the other extreme, where the dependence between X and Y is entirely due to the three intermediate levels.It is actually quite interesting to observe that the procedure has some very small power to detect large effects and is therefore not entirely equivalent with a two-sample t-test.But we conclude that a test based on t I * looses too much information to be of general interest, except in fairly particular situations.

Penalizing
One possibility of improvement consists in adding penalties for values of c 2 , . . ., c l−1 getting too small, and for c 1 and c l getting too large.To retain a quadratic optimization problem one might consider penalties of the form K i (c i − γ i ) 2 , i ∈ {1, . . ., l}.To decide how to choose the K i we note that under H 0 the residual sum of squares ( 11) for the unpenalized model converges asymptotically towards R * = (n 0 n l )(n 0 + n l ) −1 .Thus if K i = o(n) the penalties will have for large n hardly any influence, whereas if K i grows faster than n the contribution of the penalty terms for c i = γ i will dominate R * , which enforces c i ≈ γ i for large n.So the most interesting situation arises if the K i are actually of order n, i.e.K i = λ i n.
We will present here a particularly simple penalization scheme, with all γ i set to 0: For the discussion of a more general penalization scheme we refer to Frommlet (2008).The crucial idea is to choose the vector λ = (λ 1 , . . ., λ l ) in such a way, that under the null hypothesis c i → c lim i for some prespecified values of c lim i .These asymptotic values can be interpreted as a subjective expression of how close one believes different categories to be.

Choosing for example c
amounts to some prior believe that all categories are equidistant -which relates to a simple regression model.However, depending on the actual data the values of c i are not fixed, which distinguishes the method from simple weighted regression.The original test statistic (19) based on Torra et al. (2006) corresponds to the rather unusual prior believe that all categories except for the two extreme ones are more or less identical.
By adding the penalty terms we loose the simple structure of ( 9) which allowed us to give explicit solutions of b 0 , b 1 and a simple recursion formula to compute c.However, we can still follow the approach of Lemma 3.1 to obtain critical points.In the typical situation where no merging occurs for Y = 0 and Y = l, i.e. for I = (1, s 2 , . . ., s ν−1 , l) we obtain b Therefore, both b I 0 and b I 1 depend on c I which can in general not be expressed in a simple way.In any case all Ψ i are bounded and therefore consistency still holds: b → 0 for n → ∞ under H 0 .However, a general result like ( 16) in Theorem 3.1 cannot be established, because everything depends now on the choice of the penalty parameters.Still we obtain the asymptotic distribution of the scaled numerator of b I 1 .For the sake of convenient notation we write xJ 0 := x0,l as well as xJν := x0,l and thus obtain where is the variance contribution from the penalty terms and Ω lim is its asymptotic limit.This leads immediately for given I to the test statistic which is a generalization of ( 19) taking into account the penalties and which is again approximately t-distributed with n − ν degrees of freedom.To choose the penalties λ i we assume that in the asymptotic limit under H 0 no merging between categories occurs, i.e.I lim = {1, 2, . . ., l}.For given c lim 1 , . . ., c lim l the penalties λ i are computed by solving the following system of linear equations: where b lim 0 = p l p 0 +p l − 1 p 0 +p l (λ 1 c lim 1 − λ l c lim l ).This system has l − 1 equations for l penalty parameters.To fully determine the penalties we add an equation for the overall weight of the penalty According to simulations the actual choice of K has no strong influence on the results and we choose K = l.Using this penalty scheme we study again Example 3.1.To assess the effect of various parameters on the statistical power of the given model we study three cases: Austrian Journal of Statistics, Vol. 39 (2010), No. 3, 182-202 Model 1: c lim 1 = c lim 2 = 0.25, which corresponds to simple linear regression, Model 2: c lim 1 = 0.4, c lim 2 = 0.1, which is closer to the model without penalty, Model 3: c lim 1 = 0.1, c lim 2 = 0.4, and according to symmetry c lim 4 = c lim 1 and c lim 3 = c lim 2 .Table 1 provides the power of the various models at a significance level α = 0.05 and for δ = 0.1, the largest effect parameter in our simulations.We used the same random instances generated in the previous simulations.In all simulations the asymptotic approximation worked quite well and one can rely upon the test statistic being t-distributed under the null hypothesis.For sample sizes much smaller than n = 100 one might prefer to use permutation tests or bootstrap sampling rather than threshold levels based on asymptotics.As expected, for the first models with c lim i = 0.25 the results are fairly close to simple regression for all three instances B 1 , B 2 , and B 3 .Specifically the advantage of the new approach for instance B 2 is lost by introducing this kind of penalties.For model 2, where c lim 1 = 0.4 and c lim 2 = 0.1, the general behavior is similar to the model without penalties, though a little bit less extreme -the model without penalties corresponds to c lim 1 = 0.5 and c lim 2 = 0. Compared to simple regression model 2 performs slightly worse for instance B 1 , it performs much better for instance B 2 and it has smaller power for instance B 3 .Finally model 3 shows exactly the opposite behavior, it outperforms simple regression for instance B 3 , but it does not work well for instance B 2 .Figure 2 illustrates the general behavior of the power for the various models.

Gene expression analysis
We will use now our approach to analyze publicly available gene expression data from a study on prostate cancer Singh et al. (2002).Based on microarray experiments the expression levels of 12600 genes are compared with the Gleason score, which evaluates how effectively cancer cells are able to structure themselves.A Gleason score between 2 and 4 means well differentiated cells, a score between 5 and 6 describes intermediate differentiation, a score of 7 is intermediate to badly differentiated, and a Score between 8 and 10 means badly or undifferentiated tumors.The study included 52 patients, 26 with score 6 and 20 with score 7. The 6 patients with score larger than 7 were merged to form the top level.
Figure 2: Results for the three instances of example 3.1: The power of ordinal regression procedures with penalties (Model 1, 2, and 3) is compared with simple regression and with the ordinal regression procedure without penalties.The order of performance is reflected by the order in the legends.In the original analysis of Singh et al. (2002) the Gleason score was treated as a metric variable.It is indicated that significant association was statistically determined by permutation tests with respect to Pearson correlation coefficients.However, 29 genes are reported with p-values smaller than 0.001, and it would appear that these are not p-values based on permutation tests but ordinary p-values.Furthermore these 29 genes are presented in such a way, that 8 of them are not at all identifiable in the original list of 26000 genes, and only 18 can be determined with certainty.For several other genes there are two possible interpretations of the given specification.We base our comparison with the original results on the 18 clearly identifiable genes and include if possible also the ambiguous ones in our discussion.
In Chu et al. (2005) it was pointed out that the Gleason score is clearly an ordinal variable, and thus the problem of predicting the score from gene expression data is a typical problem of ordinal regression.Chu et al. (2005) develop an ordinal regression procedure based on Gaussian processes.They assume there is an unobservable latent function f (X i ) ∈ R associated with the gene expression labels X i , where the categories of the ordinal scale correspond to intervals on the real line, and the specific value of Y i is determined by f (X i ).They use a Bayesian approach for prediction of categories, and based on leave-one-out cross validation they determine a set of 21 genes to predict the Gleason score.
For the selected genes of Chu et al. (2005) the serial numbers of the original data are given, which makes a comparison of results much easier.It is remarkable that there is not a single gene which was selected both by Singh et al. (2002) and Chu et al. (2005), which puts the results of both manuscripts somewhat into perspective.Clearly the focus of Chu et al. (2005) was on prediction, whereas Singh et al. (2002) was applying a multiple testing approach, but the discrepancy between their results is quite disturbing.
We will reanalyze the data applying simple linear regression as well as our ordinal regression procedure without penalties, and with the following penalties:  Table 2: Results from micro array data analysis.The first column gives the serial numbers of the original data, bold print indicates genes that were significant under permutation test for at least one model.The next five columns give unadjusted p-values for ordinal regression (based on asymptotic approximation) as well as linear regression; only p-values with p ≤ 0.0005 are listed.Genes are ordered according to which models had p ≤ 0.0005.
indicates significance for permutation tests at level α = 0.05, at level α = 0.1.The last two columns specify genes which were also detected by Chu et al. (2005) and Singh et al. (2002), respectively.The column 'Chu #' gives the serial number in Table 6 of Chu et al. (2005), the column 'Singh' gives the annotation provided in Figure 1 of Singh et al. (2002).
Table 2 provides the results of our analysis.Unadjusted p-values based on asymptotic theory are provided for all genes with p ≤ 0.0005.This threshold value was chosen in such a way that comparison with the results of Singh et al. (2002) and Chu et al. (2005) becomes possible.For a proper treatment of multiple testing permutation tests were performed.Based on 10000 random permutations of the Gleason score variable test statistics for the Five models were computed and the maximum over the 12600 genes taken.Threshold values were than obtained for each of the 5 models based on the upper tail quantiles controlling at α = 0.05, and α = 0.1.Test statistics exceeding those threshold values are indicated with and respectively in Table 1.Based on permutation tests we would recommend to consider only five (α = 0.05) or six (α = 0.1) genes as Figure 3: Gleason Score data for three genes: Gene expression levels on x-axis vs. Gleason score (GS) on y-axis.For better visibility of the data random noise was added to GS.The symbol + gives the mean values of x within groups.significant.Interestingly four of them were also detected by Chu et al. (2005), though none of them by Singh et al. (2002).
Concerning the original results reported by Singh et al. (2002) only three of the 29 reported genes show up in Table 2, one of them twice because the description 'Collagen type I, alpha-2' fits both to gene #1666 and #1667.Taking into account that the statistical analysis described in Singh et al. (2002) is fairly similar to the linear regression approach it is difficult to understand how the results presented in their Figure 1 were actually obtained.From the reported genes which were identifiable in particular #2741 (HSU66684), #4756 (STE20-like kinase), #8935 (acetolactate synthase) and #9153 (cyclin H) showed no association with the Gleason score at all.It is not clear if the particular results presented in Figure 1 of Singh et al. (2002) are actually based on the data published at http://www.broadinstitute.org/cgi-bin/cancer/publications/pub_paper.cgi?mode=view&paper_id=75 It is certain however, that the work of Chu et al. (2005) is based on the very same data.In Figure 3(a) the measurements for gene #8803 are shown, one of the two genes which we consider as important that was not reported in Chu et al. (2005).There appears to be a rather strong positive trend, and the graph for gene #12092 looks very similar.One reason why those two genes might have been neglected by Chu et al. (2005) is that they take a model based approach, and as usual in micro array data there is a strong amount of multicollinearity.Specifically both genes #8803 and #12092 are negatively correlated with gene #583, which has the strongest association with the Gleason score.In this particular setting it would appear to be rather desirable to learn about all genes which are related with the Gleason score.Missing two important genes would speak against the algorithm from Chu et al. (2005).
On the other hand Figure 3(b) presents data for the second of the 21 genes reported by Chu et al. (2005).From visual inspection it is difficult to argue that there would be any influence of gene #7714 on the Gleason score at all.Among the 21 reported genes there are 5 which are entirely unrelated with GS (these are gene #7714, #9264, #7049, #9878 and #11233).All other reported genes had at least in one of our considered models unadjusted p-values smaller than 0.05, but except for those listed in Table 2 no reported gene had p-values smaller than 0.0001.
We mentioned already that the expression levels of the first three genes (#583, #8803, #12092) selected under permutation tests were strongly correlated.Their data suggests a trend, where larger expression levels of gene #583 (as well as smaller expression levels of #8803 and #12092) are associated with less cell differentiation.This is best captured by Model 2 and the linear regression model.It is striking that under permutation tests none of the three genes was detected by the model without penalties.On the other hand genes #4325 and #10787, whose expression levels were also strongly correlated, were only detected by Model 1 and the model without penalties.Figure 3(c) provides the explanation why this is the case.Expression levels for GS = 6 and GS = 7 appear practically identical, but 4 of the 6 individuals with extremely bad cell differentiation had particularly large expression levels of gene #4325 and #10787.Both of these genes were also found by Chu et al. (2005), but based on our analysis it becomes quite clear that for those two genes the influence on GS is qualitatively much different compared with the previous three genes.

Conclusion
In this article we have analyzed the statistical properties of an approach to ordinal regression introduced by Torra et al. (2006), where ordinal variables are mapped into the interval [0, 1].The authors claim that fixing these mappings a priori would lead to a bias in the resulting models, and therefore they suggest to estimate them by least squares optimization.We have shown in Section 2 that in the case of an ordinal explicatory variable the approach actually works, though it coincides with the well known procedure of isotonic regression.
In Section 3 we have seen that if the dependent variable is ordinal then the least squares estimation of the maps f in (8) does not at all lead to an "unbiased" situation.Asymptotically the procedure is equivalent to choosing the map f (i) = p l (p 0 +p l ) −1 for 0 < i < l, a choice which in most situations leads to a significant loss of power compared e.g. to simple regression, which is equivalent to the choice f (i) = i/l.Torra et al. (2006) introduced their new approach to ordinal regression actually when both dependent and explicatory variables are ordinal.In Frommlet (2008) the shortcomings in this specific situation are discussed.
In summary the procedure of Torra et al. (2006) cannot really be recommended for dependent ordinal variables.An improvement based on penalties was suggested and evaluated in a small simulation study.In principle the choice of such penalties based on asymptotic considerations is not much different from choosing a-priori a specific map f .Different penalties are suitable for different alternative hypothesis, which can be described in terms of the relative proximity of adjacent ordinal scales.This was made transparent by Example 3.1 as well as by the analysis of micro array data.
Specifically in the gene expression data analysis the type of model for which a gene is significant has some direct interpretation: Genes selected only by model 1 are significantly different expressed only for Gleason score GS > 7, genes selected by model 2 tend to have a trend, etc.In general our analysis suggests that the number of important genes reported both in Singh et al. (2002) as well as in Chu et al. (2005) are rather optimistic.In particular the findings of the original study Singh et al. (2002) can hardly be confirmed.
Based on the data we would suggest the following slightly more modest conclusion: Three genes (#583, #8803, #12092) have a general influence on cell differentiation.Gene #583 (RET finger protein-like 3) is known to be related with oncogenic activity Chu et al. (2005) and its role in the regulation of growth or differentiation of different cell types has been discussed for a long time Tezel et al. (1999).Gene #8803 is known to express insulin-like growth factor-binding protein-3, and gene #12092 expresses apolipoprotein E. For two genes (#4325, #10787), both known to be related with the development of drug resistance, expression in patients with GS > 7 was significantly enhanced.Finally for gene #6118, whose functional role is not yet known, expression was somewhat larger for better differentiated cells.Although one might expect that more genes have an influence on differentiation of tumor cells, evidence from the data appears to be not particularly strong, which is largely due to the rather small sample size.We would therefore suggest that all other genes reported both in Singh et al. (2002) and Chu et al. (2005) should be considered only with great precaution.
In this article we have only considered the least squares approach suggested by Torra et al. (2006).As an alternative to the presented penalizing scheme one might like to consider a likelihood ratio test.To this end it is necessary to specify a probabilistic model.The simplest scenario is to require that X conditional on each level of Y is normal distributed with fixed variance σ 2 .It is then easy to show that under these assumptions the MLestimate and the least squares estimate with regard to (8) coincide.However, to construct a likelihood ratio test we also need the likelihood function under the null hypothesis, but for b 1 → 0 the likelihood flattens out and converges towards 0. We are confronted with the situation that under the null hypothesis the parameters c are not identifiable and have to be treated as nuisance parameters.It might be an interesting topic for further research to address this particular problem of inference where some nuisance parameters are not identified under the null hypothesis.

Table 1 :
Power of various models for three instances of Example 3.1 at effect size δ = 0.1