Comparing Spike and Slab Priors for Bayesian Variable Selection

An important task in building regression models is to decide which regressors should be included in the final model. In a Bayesian approach, variable selection can be performed using mixture priors with a spike and a slab component for the effects subject to selection. As the spike is concentrated at zero, variable selection is based on the probability of assigning the corresponding regression effect to the slab component. These posterior inclusion probabilities can be determined by MCMC sampling. In this paper we compare the MCMC implementations for several spike and slab priors with regard to posterior inclusion probabilities and their sampling efficiency for simulated data. Further, we investigate posterior inclusion probabilities analytically for different slabs in two simple settings. Application of variable selection with spike and slab priors is illustrated on a data set of psychiatric patients where the goal is to identify covariates affecting metabolism.


Introduction
A major task in building a regression model is to select those regressors from a large set of potential covariates which should be included in the final model. Correct classification of regressors as having (nearly) zero or non-zero effects is important: omitting regressors with non-zero effect will lead to biased estimates whereas inclusion of regressors with zero effect causes loss in estimation precision and predictive performance of the model.
For the regression coefficients, many Bayesian variable selection methods use mixture priors with two components: a spike concentrated around zero and a comparably flat slab. In this paper we compare spike and slab priors with two different specifications for the spike: absolutely continuous and spikes defined by a point mass at zero, so called Dirac spikes. We consider here Dirac spikes combined with different normal slabs and priors where both spike and slab are normal distributions as in George and McCulloch (1993) or scale mixtures of normals as in Ishwaran and Rao (2005) and Konrath, Kneib, and Fahrmeir (2008).
Bayesian variable selection with spike and slab priors can be accomplished by MCMC methods, but depending on the type of the spike the specific implementations differ: A Dirac spike requires computation of marginal likelihoods, i.e. integrating over the parameters subject to selection, in each MCMC iteration. This is not necessary for spikes specified by an absolutely continuous distribution. However, regression effects are not shrunk exactly to zero and therefore the dimension of the model is not reduced during MCMC. In this paper we compare posterior inclusion probabilities under different spike and slab priors as well as their MCMC sampling efficiency.
The rest of the paper is structured as follows. Section 2 describes the basic model and the two types of spike and slab priors. Implementation of MCMC sampling schemes is outlined for both spike types in Section 3 and Section 4 presents results from a simulation study comparing five different spike and slab priors on simulated data. To get further insight, posterior inclusion probabilities are investigated analytically in two simple settings for Dirac spikes combined with different slabs in Section 5. Section 6 illustrates application of Bayesian variable selection on a data set where the goal is to identify covariates which have an effect on metabolism of psychiatric patients. Finally, Section 7 summarizes the results and indicates modifications for the slab component to be considered in further research.

The Linear Regression Model
In the standard linear regression model the outcome y = (y 1 , . . . , y N ) of subjects i = 1, . . . , N is modeled as a linear function of the regressors with a Gaussian error term, Here α is the d × 1 vector of regression coefficients. We assume that the covariate vectors are centered with the null vector as mean, so that X ′ 1 = 0 and the mean µ is constant over all models. As the columns of the design matrix are orthogonal to the unit vector, the log-likelihood can be written as l(y|µ, α, σ 2 ) = − N 2 log(2πσ 2 ) − 1 2σ 2 N(ȳ − µ) 2 + (y c − Xα) ′ (y c − Xα) , where y c = y − 1ȳ denotes the vector of centered responses. In a Bayesian approach, model specification is completed with priors for the model parameters (µ, σ 2 , α). We assume a prior of the structure p(µ, σ 2 , α) = p(µ, σ 2 )p(α|σ 2 , µ) with the usual uninformative prior for mean and error variance and use spike and slab priors for the regression coefficients α.

Spike and Slab Priors
Mixture priors with spike and slab components have been used extensively for variable selection, see e.g. Mitchell and Beauchamp (1988), McCulloch (1993, 1997) and Ishwaran and Rao (2005). The spike component, which concentrates its mass at values close to zero, allows shrinkage of small effects to zero, whereas the slab component has its mass spread over a wide range of plausible values for the regression coefficients.
To specify spike and slab priors we introduce indicator variables δ = (δ 1 , . . . , δ d ) where δ j takes the value 1, if α j is allocated to the slab component and we denote by α δ the vector comprising those elements of α where δ j = 1. We consider priors, where regression effects allocated to the spike component are independent of each other and independent of α δ a priori, whereas elements of α δ may be dependent. These spike and slab priors can be written as where p spike and p slab denote the univariate spike and the multivariate slab distribution respectively. The prior inclusion probability p(δ j = 1) of the effect α j is specified hierarchically as Note, that the indicator variables δ j are independent conditional on the prior inclusion probability ω, but dependent marginally. This might not be justified in practical applications and could be relaxed by using an individual inclusion probability ω j for each regression effect α j , Prior information on individual inclusion probabilities could be incorporated by appropriate choice of the parameters a ω j and b ω j . The introduction of indicator variables allows classification of regression effects as (practically) zero, if δ j = 0 and non-zero otherwise. Variable selection is based on the posterior probability of assigning the corresponding regression effect to the slab component, i.e. the posterior inclusion probability p(δ j = 1|y), which can be sampled by MCMC methods. Basically two different types of spikes have been proposed in the literature: Spikes specified by an absolutely continuous distribution and spikes specified by a point mass at zero, called Dirac spikes. Specifications of priors for both spike types, which are compared in this paper, are presented in more detail in the following sections.

Absolutely Continuous Spikes
To specify an absolutely continuous spike, in principle any unimodal continuous distribution with mode at zero could be used. Usually absolutely continuous spikes are combined with slabs, where the components of α δ are independent conditional on δ, i.e.
Here we consider priors where spike and slab components are specified by the same distribution family but with a variance ratio r considerably smaller than 1, We use only spikes and slabs which can be represented as scale mixtures of normal distributions with zero mean, and the distribution of ψ j may depend on a further parameter ϑ. In particular, we consider normal spikes and slabs with constant ψ j ≡ V (called SSVS prior) and normal mixtures of inverse Gamma distributions (NMIG prior), where ψ j ∼ G −1 (ν, Q). Priors with normal spikes and slabs were introduced in George and McCulloch (1993) to perform stochastic search variable selection and NMIG spikes and slabs were proposed in Ishwaran and Rao (2003) and Ishwaran and Rao (2005) for variable selection in Gaussian regression models and used in Konrath et al. (2008) for survival data. Note that for the NMIG prior marginally both spike and slab component are student distributions, p spike (α j ) = t 2ν (0, rQ/ν) and p slab (α j ) = t 2ν (0, Q/ν) .

Dirac Spike
A Dirac spike is specified as p spike (α j ) = p(α j |δ j = 0) = ∆ 0 (α j ). We combine Dirac spikes with slab components of the form where f N (x; µ, Σ) denotes the density of the multivariate N (µ, Σ)-distribution. In particular we use • the independence slab (i-slab), where a 0,δ = 0 and A 0,δ = cI, • the g-slab, where a 0,δ = 0 and A 0,δ = g(X ′ δ X δ ) −1 , • the fractional slab (f-slab), where a 0,δ = (X ′ δ X δ ) −1 X ′ δ y c and A 0,δ = 1/b (X ′ δ X δ ) −1 . X δ is the design matrix consisting only of those columns of X corresponding to nonzero effects, i.e. where δ j = 1. The g-slab is Zellner's g-prior (Zellner, 1986) for these effects and the f-slab is the corresponding fractional prior (O'Hagan, 1995). The idea of the fractional prior is to use a fraction b of the likelihood to determine a prior distribution for the parameters. In our specification the f-slab is not a fraction of the whole likelihood, but only of the part containing information on the regression coefficients α. Note that in contrast to the i-slab, regression coefficients α j are not independent conditional on δ for g-and f-slab, where the joint distribution of all effects with δ j = 1 is specified with a variance-covariance matrix equal to a scalar multiple of the Fisher information matrix. However, their mean is different: the g-slab is centered at the null vector, whereas the mean of f-slab is the LS estimate of the regression effects with δ j = 1. Figure 1 illustrates the differences between the three priors for two regressors showing the contours for the slab component for δ = (1, 1) together with the position of the spike for δ = (0, 0).

Inference
For both types of spike and slab priors posterior inference is feasible using MCMC methods, where the model parameters (µ, δ, α, ω, σ 2 ) and additionally, under the NMIG prior, Whereas for an absolutely continuous spike the indicators δ j can be sampled conditionally on the effects α j , for a Dirac spike it is essential to draw δ from the marginal posterior p(δ|y) integrating over the parameters subject to selection, see Geweke (1996) and Smith and Kohn (1996). This requires evaluation of marginal likelihoods in each MCMC iteration. In normal regression models with conjugate priors (which are used here) analytical integration over the regression effects is feasible and hence marginal likelihoods can be computed rather cheaply. Details of the MCMC sampling schemes are given in the following two subsections.

MCMC for Absolutely Continuous Spikes
For priors with an absolutely continuous spike the full conditional distribution of (δ, ψ) is given as Therefore, δ and ψ can be sampled together in one block and the sampling scheme involves the following steps: (1.) Sample µ from its posterior µ|σ 2 , y ∼ N (ȳ, σ 2 /N).

Sampling Steps for a Dirac Spike
For a Dirac spike, δ j = 0 implies α j = 0 and vice versa. To avoid reducibility of the Markov chain, it is essential to draw δ from the marginal posterior where effects subject to selection are integrated out. Here p(y|δ) denotes the marginal likelihood of the linear regression model (1) with design matrix X δ . For Dirac spikes combined with i-, g-or f-slab on α δ the marginal likelihood can be derived analytically as . a δ and A δ are parameters of the posterior of α δ : A δ = ((X ′ δ X δ ) + 1 c I) −1 for the i-slab, A δ = g g+1 (X ′ δ X δ ) −1 for the g-slab and A δ = (X ′ δ X δ ) −1 for the f-slab; the posterior mean is a δ = A δ X ′ δ y c for any of the three slabs. Details are given in Appendix A.
(1a.) Sample each element δ j of the indicator vector δ separately from p(δ j = 1|δ \j , y) given as Here δ \j denotes the vector δ consisting of all elements of δ except δ j . Elements of δ are updated in a random permutation order. (1b.) Sample the error variance σ 2 from the G −1 (s N , (3.) Set α j = 0 if δ j = 0. Sample the non-zero elements α δ from the normal posterior N (a δ , A δ σ 2 ).
For both g-and f-slab, the posterior variance covariance matrix A δ is a scalar multiple of the prior variance covariance matrix A 0,δ . Thus for computing the marginal likelihood (4), the determinant of A δ is not required which speeds up sampling compared to i-slabs.

Simulated Data
We investigate performance of the different MCMC implementations for simulated data.
Interest lies in correct selection of regressors as well as sampling efficiency of posterior inclusion probabilities. We expect draws of the posterior probabilities p (m) (δ j = 1), m = 1, . . . , M to have higher autocorrelations for continuous than for Dirac spikes. It is however not obvious which implementation will have higher computational cost in CPU time: With a Dirac spike only coefficients with δ j = 1 have to be sampled, as those with δ j = 0 are restricted exactly to zero, whereas for a continuous spike the dimension of the model is not reduced during MCMC. On the other hand, specifying a continuous spike will save CPU time as no marginal likelihoods have to be computed. To investigate correct model selection we simulate 100 data sets with N = 40 observations from a linear regression model with mean µ = 1 and σ 2 = 1 and nine covariates. We consider two setups for the covariate vectors x j , which are drawn from a N (0, C)distribution: independent regressors, where C = I, and correlated regressors generated as in Tibshirani (1996), where C is a correlation matrix with C jk = ρ |j−k| with ρ = 0.8. For both independent and correlated regressors we set three regression effects to each of the values "2" (strong effects), "0.2" (weak effects) and "0" (zero effects).
In the simulation studies, we use an uninformative B (1, 1)-prior for ω. To mimic Dirac spikes closely, a small variance ratio r of continuous spikes and slabs would be preferred, however r should not be too small to avoid MCMC getting stuck in the spike component. Following the recommendations in George and McCulloch (1993) we set r = 1/10000.
It is well known that the choice of the slab distribution is critical for model selection. Our choice for the slab variance is motivated by the fact that model selection based on Bayes factors is consistent for the g-prior with g = N (see Fernández, Ley, and Steel, 2001). Hence we choose g = N and match the variances of the other slabs to equal the variance of the g-slab if regressors are orthogonal, i.e. we choose b = 1/N and V = 1. For the NMIG-prior we choose ν = 5, which corresponds to a t-distribution with 10 degrees of freedom, and Q = 4.
For each data set, MCMC was run for M = 5000 iterations after a burn-in of 1000 draws. The first 500 draws of the burn-in were drawn from the full model including all regressors.

Model Selection Performance
Posterior inclusion probabilities are estimated by their posterior mean, i.e. the average of the inclusion probabilities p (m) (δ j = 1) in the MCMC iterations. Figure 2 shows box-plots of these estimates in the 100 simulated data sets with independent regressors. Regressors with strong effect are perfectly classified with estimated posterior inclusion probabilities being equal to 1 (rounded) in all 100 data sets. Variation of the estimated posterior inclusion probabilities is high for regressors with weak and zero effect which indicates that regression coefficients of smaller magnitude are hard to classify. Posterior inclusion probabilities tend to be slightly smaller for the Dirac/i-slab and the Dirac/g-slab priors than for the other priors. For orthogonal regressors Barbieri and Berger (2004) showed that the median probability model, i.e. the model including regressors with posterior inclusion probability larger than 0.5, is the best model with regard to predictive performance. Table 1 reports the num- ber of data sets where each of the regressors with weak or zero effect is included in the median probability model. Results are not shown for regressors with strong effect as these are included in all 100 data sets under any prior. Whereas under the Dirac spike combined with i-or g-slab classification is better for zero effects, weak effects are detected less often than under the other three priors. Overall performance is similar for all priors with mean misclassification rates (computed over weak and zeros effects) from 41.6 % (Dirac/f-slab) to 43.3 % (NMIG). Figure 3 shows the estimated posterior inclusion probabilities for simulated data with correlated regressors. The order of regressors with strong, weak and zero effects is different now, to get insight in the effects of correlations which are highest for neighboring regressors. Posterior inclusion probabilities of regressors with strong effects show more variation than for independent regressors but are close to 1 in almost all cases. A pronounced difference however occurs for regressors with weak and zero effects, which are slightly smaller for the Dirac/g-slab and Dirac/f-slab prior but considerably higher for priors with independent slabs (Dirac/i-slab, SSVS and NMIG) than in Figure 2. As a consequence, regressors with weak and zero effects are included in the median probability model less often under the Dirac/g-slab and Dirac/f-slab prior but more often under priors with independent slabs, when regressors are correlated. Table 2 reports in how many data sets each regressor is included in the median probability model. As estimated posterior inclusion probabilities of regressors with strong effects are higher than 0.5 in all data sets (except in one data set for the g-slab), only results for weak and zero effects are given. Mean misclassification rates (computed over weak and zeros effects) are higher than for independent regressors, but again very similar, ranging from 47.5 % (Dirac/f-slab) to 48 % (Dirac/i-slab). Obviously the correlation structure of the prior has an effect on the posterior inclusion probability when regressors are correlated. We will return to this issue in Section 5.2, where we investigate this effect analytically, though in a simpler setting.
Further simulations carried out in Malsiner-Walli (2010) indicate that posterior inclusion probabilities depend on the variance of the slab component: Posterior inclusion probabilities decrease with increasing slab variance. This is another issue which we investigate analytically in Section 5 and illustrate in the application in Section 6.

Comparing Sampling Efficiencies
As MCMC draws are correlated, it is of interest to compare MCMC implementations for the different priors with respect to their sampling efficiency. Table 3 reports mean inefficiency factors (also called integrated autocorrelation times) for regressors with weak and zero effects. Inefficiency factors, defined as τ = 1 + 2 L l=1 ρ(l), where ρ(l) is the empirical autocorrelation at lag l, were computed using the initial monotone sequence estimator (Geyer, 1992) for L. If inclusion probabilities p (m) (δ j = 1) are numerically equal to 1 in all iterations, inefficiency factors cannot be computed. This occurred for one effect in one data set under the SSVS prior and hence the average reported in Table  3 for the SSVS prior is based only on the remaining posterior inclusion probabilities.  Interestingly for correlated regressors inefficiency factors are lower for the Dirac/g-slab and Dirac/f-slab prior and higher for priors with independent slab. This might result from the decrease/increase of posterior inclusion probabilities: Wagner and Duller (2011) also observed smaller inefficiency factor for low inclusion probabilities, though in logit models. As expected, inefficiency factors are considerably higher for priors with continuous spikes than for Dirac spikes. Further simulations in Malsiner-Walli (2010) showed that, for continuous spikes, the choice of the variance ratio r as well as the actual implementation can have an impact on sampling efficiency: Autocorrelations and inefficiency are lower for higher values of r, e.g. r = 1/1000 yields similar estimates for posterior inclusion probabilities but with less autocorrelated draws. Under the NMIG prior, posterior inclusion probabilities could be computed alternatively conditional on the variance parameters ψ j as in Konrath et al. (2008), which however leads to considerably higher autocorrelations than using the marginal t-distribution.
To assess sampling efficiency with computing time taken into account, Table 4 reports effective sample sizes per second averaged over weak and zero effects. The effective sample size ESS = M/τ estimates the number of independent samples required to obtain a parameter estimate with the same precision as the MCMC estimate. Results in Table 4 are based on all MCMC chains, where inefficiency factors could be computed. Though sampling efficiency is much higher for priors with Dirac spikes differences in effective sample sizes are much less pronounced and priors with absolutely continuous spikes perform roughly similar to Dirac/g-and Dirac/f-slab. Even in this rather low-dimensional model with only nine regressors, computational cost for the Dirac/i-slab prior is too high to be outweighed by the smaller inefficiency factors. Due to lower inefficiency factors priors with g-and f-slabs have even higher ESS/sec for correlated regressors.

Posterior Inclusion Probabilities
Results of the simulation study indicate that posterior inclusion probabilities largely depend on the slab distribution. To get further insight into the effect of different slabs we investigate the inclusion probability of one regressor x j conditional on δ \j for priors with Dirac spikes (i.e. the Dirac/i-slab, Dirac/g-slab and Dirac/f-slab prior) in two simple special cases: for orthogonal regressors and in a model with only two correlated regressors. For simplicity we assume that the error variance σ 2 is known. Details on the computation of posterior inclusion probabilities are given in Appendix B. We will denote by s 2 y = 1 N y ′ c y c the sample variance of y, by r yj the sample correlation between y and x j and by s 2 j = 1 N x ′ j x j the sample variance of covariate x j .

Orthogonal Regressors
For orthogonal regressors, i.e. X ′ X = diag(Ns 2 j ), j = 1, . . . , d, the posterior inclusion probability of x j can be written as a function of the LS-estimateα j = r yj sy s j as p(δ j = 1|y, δ \j , σ 2 ) = 1 where θ is the variance parameter of the slab distribution, i.e. c for the i-slab, g for the g-slab and b for the f-slab. Under any of the three slabs, the inclusion probability does not depend on δ \j . In particular we obtain (see Appendix B.1) g-slab: h(α j , g) = − Nα 2 j s 2 j σ 2 g g + 1 + log(g + 1) , In formulas (6) -(8) the first term is proportional to Nα 2 j σ 2 which, following Dey, Ishwaran, and Rao (2008), can be interpreted as the signal of the regression coefficient contained in the data. Hence, posterior inclusion probabilities increase with both, sample size N and the size of the estimated effect |α 2 j |. The second term can be interpreted as a penalty term: It increases with the slab variance, and hence the posterior inclusion probability decreases as a function of the slab variance. Figure 4 shows posterior inclusion probabilities under the i-slab as a function of the LS estimateα for various samples sizes N and variances c. In contrast to i-slabs the penalty term does not depend on the scale of the regressor under g-and f-slabs. For standardized orthogonal regressors (s 2 j = 1) posterior inclusion probabilities are identical for g-slab and i-slab when Nc = g and slightly higher under the f-slab when b = 1/g. This corresponds to the simulation results, see Figure 2.
To illustrate the dependence of posterior inclusion probabilities on the effect signal α we generated 100 data sets of size N = 200, with 21 regressors generated as independent standard normal random variables and effects from 0 to 0.4 in increments of 0.02. Posterior inclusion probabilities were estimated under the less restrictive assumption of unknown error variance using the MCMC scheme described in Section 3.2. Figure  5 shows estimated posterior inclusion probabilities for the Dirac/i-slab plotted versusα (left panel). Posterior inclusion probabilities do not exactly equal the theoretical values computed from formula (5), which are shown as a line. This is not surprising as the assumptions for the derivation of the formula are not met exactly: Firstly, due to stochastic variation regressors are not perfectly orthonormal and secondly, in the MCMC scheme the marginal likelihood is computed using formula (4) with marginalization over the error variance σ 2 . In the right panel of Figure 5 estimated posterior inclusion probabilities are plotted against the "true" effect sizes α used for data generation. Conditional on α variation of the posterior inclusion probabilities is much higher as additionally the variation LS estimateα is reflected.

Two Correlated Regressors
To investigate the effect of correlation between regressors we consider a model with only two standardized regressors x 1 and x 2 (i.e. s 2 j = 1) and assume that x 1 is included in the model, i.e. δ 1 = 1. We denote by r 12 = 1 N x ′ 1 x 2 the sample correlation between x 1 and x 2 and byα 2 = s y (r y2 − r 12 r y1 )/(1 − r 2 12 ) the LS estimate of α 2 in the model including both regressors. We are interested in the conditional posterior inclusion probability of x 2 , which can be written as a function of h(α 2 , ∼) = 2 log p(y|δ 1 = 1, δ 2 = 0, σ 2 ) − log p(y|δ 1 = 1, δ 2 = 1, σ 2 ) as in equation (5). Under g-and f-slab it is straightforward to derive h(α 2 , ∼) as h(α 2 , g) = − Nα 2 2 σ 2 (1 − r 2 12 ) g g + 1 + log(g + 1) , see Appendix B.2 for details. For a given value of the LS estimateα 2 , the probability of including x 2 additionally to x 1 in the model therefore decreases with the square of the correlation r 12 between the two regressors. Figure 6 (left panel) shows the conditional posterior inclusion probabilities of x 2 under the Dirac/g-slab as a function ofα 2 for different values of r 12 . Obviously, for highly correlated regressors the inclusion probability of the second regressor can be reduced dramatically.
For the Dirac/i-slab prior, simple but tedious algebra yields h(α 2 , c) = − N Qσ 2 α 2 (1 − r 2 12 ) + r y2 s y Nc 2 + log Nc(1 − r 2 12 ) + 1 + The first summand in the function h(α 2 , c) is different from the corresponding term for gand f-slab. However, as it is dominated by − Nα 2 2 σ 2 (1 − r 2 12 ), this difference will vanish for increasing sample size N. Further, in contrast to g-and f-slab, the penalty term log(∼) depends on the regressor correlation r 12 leading to less penalization of the additional regressor x 2 compared both to orthogonal regressors and to g-and f-slabs. Therefore, posterior inclusion probabilities under i-slabs will be higher for correlated regressors. The conditional inclusion probabilities of x 2 under the i-slab depend not only onα 2 , but also on r 2y and are no longer symmetric inα 2 , at least for small sample size N. This is shown in Figure 6 (right panel), which compares the inclusion probability of x 2 for gand i-slab for different correlations r 12 . Posterior inclusion probabilities are considerably smaller under the g-slab for small absolute values ofα 2 . Results from our simulations, see Figure 3, suggest that differences in posterior inclusion probabilities under i-and g-slab can be even more pronounced in models with more regressors.

Application
We illustrate application of the different variable selection methods on a data set of psychiatric patients. Metabolic disorders and weight gain are common problems and side effects of psychiatric medication. To investigate how bodyweight and parameters of lipid and glucose metabolism are influenced by psychiatric inpatient treatment, a prospective study was performed at a department of the Wagner-Jauregg hospital in Upper Austria from October 2003 to March 2004. Several lipid and glucose parameters, namely total cholesterol (chol), high density lipoprotein cholesterol (hdl), low density lipoprotein cholesterol (ldl), triglycerides (nf) and fasting glucose (nbz) were measured at admission and at discharge of the department. Medication, if any, was assessed as prescribed at discharge and assigned to 16 drugs or types of drugs. Additionally, several patient-related variables were collected: age, sex, height, smoking, body mass index at admission and duration of the stay.
The focus of our analysis is to identify covariates influencing the change in HDL, and we used the lipid and glucose values at admission, the 16 different drug types and all patient variables as potential regressors. Excluding observations with missing values, leaves data on 231 patients with 27 regressors for the analysis. Pairwise correlations Following Gelman, Jakulin, Pittau, and Su (2008), metric covariates were standardized, and dummy covariates were centered. As a first step an exploratory Bayesian analysis of the unrestricted model under the prior N (0, cI) with c = 5 was carried out. Figure  7 shows the posterior estimates and 95 %-credible intervals of the regression effects. Only for 6 covariates (covariates number 6: chol admiss, 8: hdl admiss, 9: ldl admiss, 16: drug F, 20: drug J and 27: bmi admiss) these intervals do not contain zero, indicating that the corresponding effects "significantly" differ from zero.
As a next step, MCMC for variable selection was run for M = 5000 iterations (after a burn-in of 1000, with the first 500 draws of the burn-in drawn from the unrestricted model) for Dirac spike priors and M = 50000 iterations (after 10000 burn-in with the first 5000 draws from the unrestricted model) for priors with absolutely continuous spikes. To match the slab variances the response was standardized with the estimated residual standard deviation (s = 15.4) of the full regression model. Hyper-parameters were chosen as in the simulation studies: we used a variance ratio of r = 1/10000, ν = 5 and c = 1 and the other parameters were set to g = Nc, b = 1/g, V = c and Q = 4.
Posterior inclusion probabilities were roughly equal for all covariates under the Dirac/islab, the SSVS and the NMIG prior, however, considerably smaller for Dirac/g-and Dirac/f-slab priors. Table 5 reports estimated posterior inclusion probabilities for the covariates selected in the median probability model under the Dirac/i-slab prior. Results correspond well with the exploratory analysis of the unrestricted model: the selected covariates build a subset of those identified as having a "significant" effect, and in contrast to  the exploratory analysis, Bayesian variable selection automatically controls for multipletesting.
From the medical point of view the goal of the analysis was to obtain a classification of covariates into those which have nearly zero effect and can be excluded from the model and others which potentially affect the response variable. Therefore, variable selection was not based on the Dirac/g-and f-slab-priors which more heavily penalize dependent regressors than independent slabs. Table 6 shows inefficiency factors and effective sample size per sec. averaged over all covariates (except covariate 8). Again inefficiency factors of the posterior inclusion probabilities are considerably higher under priors with continuous spikes. However, when computational effort is taken into account again all priors except Dirac/i-slab prior perform similar.
Finally, to study the effect of the slab variance, we ran MCMC for different values c = 1, 2.5, 5, 10 for the i-slab and corresponding parameters of the other priors. The resulting posterior inclusion probability paths shown in Figure 8 for the Dirac/i-slab and Dirac/g-slab priors, demonstrate the effect of increasing penalization of regressors for larger slab variances.

Summary and Discussion
We compared different spike and slab priors which are widely used for Bayesian variable selection. Simulation studies suggest that for orthogonal regressors different priors act rather similar when the slab variances are matched, which is confirmed by theoretical results for Dirac spike priors (and known error variance). The posterior inclusion probability of a specific regressor increases with the signal of the effect in the data and decreases with the variance of the slab component. Compared to orthogonal regressors, both simulations as well as theoretical results, indicate that for a given effect signal in From a computational point of view, priors with continuous spikes are a fast alternative to the Dirac/i-slab prior as higher autocorrelations are outweighed by less computation time. Mixing of the sampler is better for the NMIG than the SSVS prior at the cost of a small additional computational effort. MCMC getting stuck at p(δ j = 1) = 1 is more severe for SSVS than NMIG priors, where it occurred only for regressors with strong effects.
A drawback of all priors considered here is that they do not well discriminate between regressors with zero and weak effects. Choosing a smaller variance for the slab component does not solve this problem as inclusion probabilities of all effects, even of zero effects, will increase. For Bayesian testing, (Johnson and Rossell, 2010) recently proposed so called non-local prior densities, which are zero in the parameter space of the null hypothesis to facilitate separation between null and the alternative. Spike and slab priors compared in this paper could be modified in this direction with slab components having a mode different from zero. Prior information on the size of "relevant" effects could be incorporated by specifying either one slab or, if no information on the effect sign is available, two slabs with a positive and a negative mode, respectively. For slabs which are normal or NMIG, MCMC schemes presented in this work could be used with slight modifications.

A.1 Conjugate Prior
Under the conjugate prior α ∼ N (a 0 , A 0 σ 2 ) analytical integration is feasible, and the conditional marginal likelihood and marginal likelihood are given as Here a N , A N are the moments of the posterior distribution p(α|σ 2 , y): Special cases are the independence prior α ∼ N (0, cIσ 2 ) and the g-prior α ∼ N (0, g(X ′ X) −1 σ 2 ). In both cases a N = A N X ′ y c and hence S N simplifies to For the independence prior, |A 0 | = c d and A N = (X ′ X + 1 c I) −1 ; for the g-prior A N = g g+1 (X ′ X) −1 and hence |A N | 1/2 /|A 0 | 1/2 = (1 + g) −d/2 .

A.2 Fractional Prior
The fractional prior is obtained as a fraction of the likelihood, more specific we define the fractional prior as The posterior, obtained by combining the prior with the remaining fraction of the likelihood, is the normal distribution with moments Conditional marginal likelihood and marginal likelihood can be computed from formulas (10) and (11) with S N = 1 2 (1 − b)y ′ c (I − X(X ′ X) −1 X ′ )y c and |A N | 1/2 /|A 0 | 1/2 = b d/2 .
We use the notation x ′ j x j = Ns 2 j , y ′ c x j = Ns j s y r yj , j = 1, . . . , d and y ′ c y c = Ns 2 y and denote byα j = s yj /s 2 j = r yj s y /s j the LS-estimator of α j . It will turn out that the conditional posterior inclusion probability of regressor x d can be written as a function of α d and additional parameters θ, depending on the slab, as p(δ d = 1|y, δ \d , σ 2 ) = 1 1 + exp(h(α d , θ)/2) (1 − ω) ω .