Bayesian Variable Selection for Multiclass Classiﬁcation using Bootstrap Prior Technique

In this paper, the one-way ANOVA model and its application in Bayesian multi-class variable selection is considered. A full Bayesian bootstrap prior ANOVA test function is developed within the framework of parametric empirical Bayes. The test function developed was later used for variable screening in multiclass classiﬁcation scenario. Performance comparison between the proposed method and existing classical ANOVA method was achieved using simulated and real life gene expression datasets. Analysis results revealed lower false positive rate and higher sensitivity for the proposed method.


Introduction
In machine or statistical learning, multiclass classification problem involves grouping of observed samples into three or more classes.Variable selection in multiclass classification is the process of identifying relevant subset of input variables that can positively improve the performance of a multiclass classifier.The dual task of classification and subset selection often arise in biological and medical applications, especially in genomic studies (Liu, Bensmail, and Tan 2012).High-dimensional data with large input and small sample size are often observed or reported in genomic studies.Many variable selection methods for multiclass classification task have been developed within the framework of Bayesian and classical approaches.
In a broader sense, variable selection can be divided into three types; filter, wrapper and embedded.Variable selection techniques or methods that are independent of classifier are referred to as filter.Filter methods are fast and easy to apply but posed with the tendency of selecting irrelevant variables.Wrapper method are similar to filter except that the selection procedure is based on the scores generated from a predetermined classifier.The major drawback is that, it is classifier sensitive which implies the subset selected are mostly not optimal.Embedded is an hybrid form of wrapper and filter, with filter or wrapper at the training stage before proceeding to the classification stage (Guyon, Weston, Barnhill, and Vapnik 2002;Guyon and Elisseeff 2003;Guyon, Gunn, Nikravesh, and Zadeh 2008;Peng, Wu, and Jiang 2010).Filter method has been applied in multiclass classification especially for high-dimensional datasets because of its fast computational time.Forman (2003) used Chi-square method, Wright and Simon (2003) used the classical one-way ANOVA method for preliminary variable selection task.The two approaches are based on ranking the Chi-square or F statistic in descending order with top variables being the best subset.Alternatively, the p-values of the statistics may be reported and variables with p-value lower than a threshold level say 0.05 are selected as best candidate for further classification task (Hwang, Lee, and Park 2017).The classical one-way ANOVA method which authors like Guyon and Elisseeff (2003); Wright and Simon (2003); Qureshi, Oh, Min, Jo, and Lee (2017), among others used suffers from loss of information (Bertolino, Piccinato, and Racugno 1990;Solari, Liseo, and Sun 2008).Solari et al. (2008) has worked on Bayesian one-way ANOVA as a safe haven to loss of information issue.They suggested objective prior through Bayes factor as an alternative approach to handle one-way ANOVA model.Objective Bayes (let the data speak for themselves) are no way better than the classical approach as its often used when subjective priors are difficult to compute or elicit (Yahya, Olaniran, and Ige 2014;Olaniran, Olaniran, Yahya, Banjoko, Garba, Amusa, and Gatta 2016;Olaniran and Yahya 2017;Olaniran and Abdullah 2018;Olaniran, Abdullah, Pillay, and Olaniran 2018).Thus, in this paper, we developed a Bayesian one-way ANOVA test function using bootstrap prior Olaniran and Yahya (2017) for variable selection in multiclass classification problem.

One-way ANOVA multi-class variable selection
Suppose we have the training dataset [τ t , y t1 , y t2 , . . ., y tp , t = 1, 2, . . ., n], where τ t is a categorical outcome that assumes i = 1, 2, . . ., k values and y t is the vector of continuous input variables.The one-way ANOVA multiclass variable selection takes each input variable y t as response and τ t as treatment effect for a one-way ANOVA model given as; where y ij is now the response variable of interest, µ is the overall mean, τ i is the effect of ith categorical predictor and ij is the residual error that is distributed N (0, σ 2 ).The traditional Analysis of variance focuses on testing the hypotheses: for at least one pair of i = j.
The classical approach of testing the hypotheses relies on the use of statistic: (2) (Solari et al. 2008).
Model (1) can also be reparameterized to a regression format with categorical effect predictors treated as dummy variables.Let 1 p and 0 p be the p × 1 vectors of 1 s and 0 s.Thus the reparameterized model is given in matrix notation as: .
The Maximum Likelihood Estimate for β follows from; where γ = σ −2 is the model precision.
It is pertinent to note that design matrix X is not of full rank and one of the solution is to reparameterized such that k i=1 τ i = 0. Thus X now becomes X * with rows of X * corresponding to τ k observations replaced with −1 and columns of τ k omitted completely from X * .Therefore, The classical ANOVA follows from (6) with:

Bayesian one-way ANOVA
The exponent term in ( 5) can be re-arranged so that we have: The natural conjugate prior to the likelihood in ( 8) is normal gamma prior given by: and The posterior distribution p(β, γ) can be obtained from the standard Bayes formula: For simplicity, the denominator of ( 11) is often dropped such that the posterior is of the form; From ( 12), it can be observed that the p(β, γ|Y, X) is also Normal-Gamma distributed.Notationally,

Bootstrap prior one-way ANOVA
The Bayesian solution provided in section (3.1) requires either subjective prior elicitation or objective prior via monte -carlo sampling from the posterior distribution.In variable selection, monte-carlo approach will be computational intensive especially when posed with high-dimensional datasets.The bootstrap prior technique follows from empirical Bayes principle where prior hyperparameters are estimated from the data.Therefore, the empirical Bayes estimates of β, Σ n , v n , s −2 n are; The bootstrap Bayesian version of the estimates of β, Σ n involves the following steps; 1. Generation of bootstrap samples from the original data B desired number of times, 2. Estimating the hyperparameters (prior parameters) each time the samples are generated using Maximum Likelihood (ML) method, 3. Updating the posterior estimates using the hyperparameters in step (2) above using (13 & 14) and 4. Then obtaining the bootstrap empirical Bayesian estimates βBT and ΣBT using; The βBT proposed here has good statistical properties in terms of biasness and Mean Square Error (MSE).
The Bias property can be evaluated as: Also, the MSE is the combination of square of bias and variance of the estimate, then following from the above derivation the MSE is just the variance of the estimate.Thus; The bootstrap prior Bayesian ANOVA follows from (17) with: Again, the null hypothesis H 0 is rejected if The multiclass variable selection procedure will be repeated for all p input variables in the training set.This implies that the test will be carried out p times which will lead to multiple testing issue.To avert this, the False Discovery Rate (FDR) approach of Benjamini and Yekutieli ( 2001) is adopted as it has been adjudged to be more powerful than other method.The FDR procedure correct the multiple comparison issue by adjusting the p-values returned from the test functions.The function "p.adjust" in R with the option "fdr" was used to adjust the resulting p-values yielded by the method.

Simulation study
The R software was used to investigate the performance of classical ANOVA and bootstrap prior ANOVA in multiclass variable selection.The simulation procedure used was adapted from Olaniran et al. (2016) with little modifications.We simulated n = 100 observations representing the number of patients samples with k = 3, 9 distinct biological groups corresponding to different types of disease outcomes.For each observation, p = 100, 1000, 5000 covariates, Y = (y 1 , ...y p ), representing the observed gene expression profiles were simulated.
Table 1: Comparison results between F c and F BT for p = 1 and k = 3, 9 for k = 9 against alternative hypothesis H 1 : µ i = µ j for at least one pair of i = j.The table is partitioned into two conditions that correspond to the situation where the null hypothesis is true or false.The level of significance used for the testing is 0.05.The F values of F BT is lower than F c at various levels of k and conditions.The P (H|Y, X) is often interpreted as the p-value for frequentist procedure F c and the probability of the null hypothesis for Bayesian procedure F BT .At 5% significance level, F BT has a larger probability of H 0 being true when it is indeed true than F c .These results subsequently corresponds to lower Type I errors for F BT at k = 3, 9 when H 0 is true.For the second condition when H 1 is true and k = 3, the two procedures return similar probability of H 0 being true and thus similar Type II error was achieved.However, when k = 9 and H 1 is true, approximately similar probability of null hypothesis and Type II error were obtained.Thus regarding validity, F BT is more valid than F c regarding relatively closer Type I error rate to the imposed 0.05 level especially when k = 3.Also, regarding the power of detecting a true difference when it exists, F BT performance is relatively similar to F c .The performance result for p = 100 corresponding to moderate high-dimensional modeling scenario is presented in Table 2.The two approaches maintained same sensitivity level at various k levels.This implies the two approach will always detect relevant variables.However,the most important criterion is the false positive or false discovery rate, a good selection procedure should have a reasonably lower false positive rate as well as high power.The false positive rate of F BT is lower than F c at various levels of k.Similar results were equally observed when p = 5000 in Table 4.However, when p = 1000 in Table 3, the F P R is approximately similar.The large number of gene subset identified by F c can be attributed to high false positive rate as observed in the simulation studies.In addition, the p-values yielded by the two methods were used to rank the genes in increasing order of relevance.The three best subset genes were later used to plot the graph in Figure 1.
The plot showed that the two methods could only classify the tumors into three clear groups using the best three genes.But the classification from F BT is more distinct compared to that of F c .Also, two of the best three genes overlapped for the two methods.Tumor LUAD and BRCA were more evidently a product of high level of expressions for gene 220 and gene 219 using F BT and F c selections.Therefore, further clinical examination can be followed up on the identified genes for tumors with labels LUAD and BRCA.

Conclusion
In this paper, we considered Bayesian variable selection for multiclass classification task using the bootstrap prior technique.The bias derivation showed that the approach used is unbiased as well as maintaining lower mean square error property of Bayesian techniques.Simulation studies revealed that the proposed method F BT has lower false positive rate in addition to the high power property in ANOVA based methods.Additionally, the proposed method has higher accuracy when selecting biomarker subsets useful for disease classification as observed in the PANCAN dataset.Hybridizing the proposed F BT method with a classification technique is a good area of research that can be considered in future.Also, we have considered one way ANOVA because of its applicability to multiclass variable selection.The bootstrap prior technique can also be extended to multi-factor ANOVA.

Acknowledgement
We will like to appreciate Universiti Tun Hussein Onn (UTHM), Malaysia for supporting this research with grant [V ot, U 607].

Figure 1 :
Figure 1: 3D classification plot for selection with F c and F BT

Table 2 :
Performance results of the simulated data at p = 100The performance metrics used to assess the methods are True Positive Rate (Sensitivity or Power) T P R which measures the expected proportion of active variables that are declared active, False Positive Rate (False Discovery) F P R which measures the expected proportion of inactive variables that are declared active, True Negative Rate (Specificity) T N R which measures the expected proportion of inactive variables that are declared inactive and False Negative Rate F N R which measures the expected proportion of active variables that are declared inactive.

Table 3 :
Mills, Shaw, Ozenberger, Ellrott, Shmulevich, Sander, Stuart, Network et al. 2013)ott, Shmulevich, Sander, Stuart, Network et al. 2013).It represent a random extraction of 16384 gene expressions profiles of 801 patients with five different form of tumors labels BRCA, KIRC, COAD, LUAD and PRAD.The two methods were used to identify the most relevant biomarker genes for possible classification of tumors.
The dataset used here is a subset of RNA-Seq (HiSeq) PANCAN data set(Weinstein, Col- lisson,

Table 4 :
Performance results of the simulated data at p = 5000