Bayesian Inference in the Multinomial Logit Model

The multinomial logit model (MNL) possesses a latent variable representation in terms of random variables following a multivariate logistic distribution. Based on multivariate finite mixture approximations of the mul-tivariate logistic distribution, various data-augmented Metropolis-Hastings algorithms are developed for a Bayesian inference of the MNL model. Zusammenfassung: Das multinomiale logistische (MNL) Regressionsmod-ell besitzt eine latente Variablendarstellung, die einen zufälligen Fehlerterm beinhaltet, der einer multivariaten logistischen Verteilung folgt. Aufbauend auf einer finiten Mischungsapproximation der multivariaten logistischen Ver-teilung werden mehrere Metropolis-Hastings-Verfahren für eine Bayes-Analyse im MNL Regressionsmodell entwickelt.


Introduction
In the past decades, finite mixture modeling became a rapidly developing area with numerous applications in biometrics, economics, genetics, medicine, among many others; see Frühwirth-Schnatter (2006) for a review.An early application of finite normal mixture models has been modeling aberrant observations in astronomical data of transit of Mercury (Newcomb, 1886).One of the pioneering papers discussing a Bayesian approach to outlier analysis based on finite normal mixtures is the work by Guttman, Dutter, and Freeman (1978).
Finite normal mixture distributions are useful for practical data analysis because they capture many specific properties of real data such as multimodality, skewness, and kurtosis.Also, they arise in a natural way as marginal distribution for statistical models involving clustering or unobserved heterogeneity.Moreover, they are useful for developing efficient estimation procedures for non-Gaussian models, early examples being Sorenson and Alspach (1971) and Alspach and Sorenson (1972).
Finite mixture distributions possess the following approximation property (Titterington, Smith, and Makov, 1985).Let g(ε) be an arbitrary probability density function and let q K (ε) be a mixture of normals: Austrian Journal of Statistics, Vol. 41 (2012), No. 1, 27-43 For sufficiently large K the Kullback-Leibler (KL-)distance between g(ε) and q K (ε), q K (ε) dε can be made arbitrarily small.To approximate g(ε) for a fixed K, one has to select the weights w 1 , . . ., w K , the means m 1 , . . ., m K and the variances s 2 1 , . . ., s 2 K such that d KL is minimized.It should be noted that this is not a parameter estimation problem, but a problem of numerical optimization.
It has been noted by several authors that Bayesian inference is considerably simpler for many non-Gaussian models if a certain density g(ε) is replaced by an accurate finite mixture approximation q K (ε).This is in particular true for Markov chain Monte Carlo (MCMC) estimation, where substitution of g(ε) by q K (ε) leads to simple Gibbs-type sampling schemes; see Gamerman and Lopes (2006) for a review of MCMC methods.Typically, the density g(ε) is independent of any parameters or depends only on an integer parameter; hence the optimal parameters in (1) can be obtained beforehand.
In all of these papers, g(ε) is a univariate density, hence even moderate values of K yield a good approximation.The present work is a first attempt at taking the idea of auxiliary mixture sampling to higher dimensions, which requires that a multivariate density g(ε) is approximated by a multivariate mixture distribution.As an example, we consider Bayesian inference for multinomial logit regression modeling of discrete outcome variables with m + 1 categories.Data augmentation leads to an error term possessing an m-variate logistic distribution which is independent of any parameters and has a quite rigid structure.We will approximate this distribution by both multivariate normal and multivariate Student-t mixtures, minimizing again the KL distance.However, due to the curse of dimensionality, we do not expect to obtain perfect approximations.Nevertheless, these mixture approximations may be used to construct a joint proposal for all regression parameters within a Metropolis-Hastings (MH) algorithm.

Data Augmentation for the Multinomial Logit Regression Model
Let {y i } be a sequence of categorical data, i = 1, . . ., N , where y i is equal to one of m+1 unordered categories, labeled by L = {0, . . ., m}.Very often it is of interest to model the probability that y i takes the value k, for each k ∈ {1, . . ., m}, in terms of covariate information.A popular choice is the multinomial logit (MNL) model for which the choice probabilities are easily computed.Usually, the MNL model takes the following somewhat restricted form: where β 1 , . . ., β m are category specific unknown regression coefficients of dimension d and x i is a (1 × d) row vector containing covariates which are not category specific.However, for many important applications the MNL model takes a more general form, where the choice probabilities contain regression coefficients that are not category specific.Examples include discrete choice models in marketing (Rossi, Allenby, and McCulloch, 2005) and the partial credit model, used in large educational assessment programs such as PISA (Fox, 2010).In its most general form, the probability that y i takes the value k is modeled for k = 1, . . ., m in the following way: where x ki is a (1 × r)-dimensional, category specific covariate vector and β are unknown regression coefficients of dimension r.

The RUM and the dRUM Representation
Following McFadden (1974), the MNL model (3) may be written as the following random utility model (RUM): Thus the observed category is equal to the category with maximal utility.If the random utilities ϵ 0i , ϵ 1i , . . ., ϵ mi appearing in (4) and ( 5) are i.i.d.following an extreme value distribution, then the MNL model (3) results as the marginal distribution of y i .An alternative way to write the MNL model ( 3) is a difference random utility model (dRUM), which is obtained by choosing a baseline category k 0 , typically k 0 = 0, and considering the model in terms of the differences of the utilities.From (4) to (6) we obtain the following dRUM representation: where z ki = y u ki − y u 0i and ε ki = ϵ ki − ϵ 0i , and L −0 is the set of all categories but 0.
Whereas in the multinomial probit model the error term follows a multivariate normal distribution, the vector ε i that appears in the dRUM representation (7) of the MNL model has a multivariate logistic distribution.The multivariate logistic distribution was introduced by Malik and Abraham (1973) as a generalization of Gumbel's bivariate logistic distribution (Gumbel, 1961).If m denotes the number of variates, its pdf reads As shown by Balakrishnan (1992, Section 11.2), a multivariate logistic distribution results if an i.i.d.sequence of (m + 1) random variables ϵ = (ϵ 0 , . . ., ϵ m ) ′ from the extreme value distribution is transformed into a sequence of . This is exactly the transformation of the error ϵ in the RUM representation to the error ε in the dRUM representation.Kotz, Johnson, and Balakrishnan (2000, Chapter 51) provides a comprehensive review of further properties of the multivariate logistic distribution.For instance, while the errors in the RUM representation (5) are i.i.d., the errors in the dRUM representation ( 7) are no longer independent across categories, but correlated.The variance-covariance matrix R of ε is given by Since the correlation coefficient is equal to 0.5 for all pairs (ε k , ε l ), R is a uniform covariance matrix.It follows immediately that the inverse R −1 can be computed explicitly:

Bayesian Inference
Subsequently, we pursue a Bayesian approach and assume that a priori the regression coefficient β follows a normal distribution N r (b 0 , B 0 ) with known hyperparameters b 0 and B 0 .Since the posterior distribution p(β|y) of the regression coefficient β in the MNL model does not have any closed form, it is usual to apply data augmentation and Markov chain Monte Carlo estimation; see Frühwirth-Schnatter and Frühwirth (2010) for a recent review.Data augmentation has been based both on the RUM representation (Frühwirth-Schnatter and Frühwirth, 2007;Scott, 2011) and on the dRUM representation (Holmes and Held, 2006;Frühwirth-Schnatter and Frühwirth, 2010).The case studies in Frühwirth-Schnatter and Frühwirth (2010, Section 4) reveal that the corresponding MCMC samplers are much more efficient for the dRUM representation than for the RUM representation.
However, the presence of the multivariate logistic distribution complicates MCMC sampling for the dRUM representation.If only category specific coefficients are present, as in (2), then it is possible to derive a partial dRUM representation of the MNL model.For each category k, the corresponding latent variable representation is a dRUM representation of a binary logit model, with category k being one outcome and all alternative categories being the second outcome.This allows to apply any of the efficient samplers that have been developed in Frühwirth-Schnatter and Frühwirth (2010, Section 3.2) from the dRUM representation of a binary logit model.The sampler developed by Holmes and Held (2006) also works with the partial dRUM representation, but is much more involved in terms of computing time and therefore less efficient.
Regrettably, the partial dRUM representation does not lead to simple MCMC sampling for the more general model (3), which contains regression coefficients that are not category specific.Alternative sampling methods for this case have been developed and will be presented in the following section.

Data Augmented Metropolis-Hastings Algorithms in the dRUM Representation
The data augmented MH algorithm operates in the dRUM representation ( 7) of the MNL model.Following the MCMC literature on the multinomial probit model, the latent variables z = (z 1 , . . ., z N ), where z i = (z 1i , . . ., z mi ) ′ are introduced as missing data.The sampler iterates between sampling from β|z and sampling from z|β, y: A closed form Gibbs step is available for joint sampling of z|β, y.To sample z i , for i = 1, . . ., N , we sample the latent utilities y u i = (y u 0i , . . ., y u mi ) in the RUM model ( 4) to (6) from the posterior y u i |β, y, which is given by: where U i and V 1i , . . ., V mi are m + 1 independent uniform random numbers in [0, 1], λ li = exp(x li β) for l = 1, . . ., m, and λ 0i = 1.Then we define z i = (z 1i , . . ., z mi ) ′ as the differences in utility, i.e. z ki = y u ki − y u 0i , k = 1, . . ., m.To sample from β|z, we rewrite the dRUM model ( 7) as multivariate regression model: where X i is a (m × r)-matrix with the kth row being equal to x ki .However, whereas sampling from β|z is straightforward for the multinomial probit model, because ε i is multivariate normal, this step is non-standard in the MNL model because ε i is multivariate logistic.Subsequently, we suggest and compare various MH algorithms for joint sampling from β|z.

A Data-augmented Independence Metropolis-Hastings Sampler
First, we construct an independence MH step, by sampling β new from a proposal density q(β|z) which is independent of the previous draw of β.As usual, β new is accepted with probability P = min(α, 1), where: The proposal density q(β|z) is based on approximating the distribution of ε i in ( 9) by a multivariate normal distribution with the expectation (which is equal to 0) and the variance-covariance matrix R, given in ( 8), of the m-variate logistic distribution.This leads to a multivariate regression model with homoscedastic, equi-correlated errors, which reads for i = 1, . . ., N : Under the prior distribution β ∼ N r (b 0 , B 0 ), the posterior of this approximate model is equal to the multivariate normal distribution N r (b N , B N ) with moments: This posterior is then used as proposal q(β|z).By using the explicit expression for R −1 in (8) we obtain: ′ is a (r × 1) column vector containing the column sums of X i and c i = e ′ m z i is a scalar containing the column sum of z i .For the special case of model ( 2) where r = d • m and β = (β 1 , . . ., β m ) contains only category specific covariates, Therefore: ) .
If each coefficient β k has the same normal prior

Doubly Data-augmented Metropolis-Hastings Samplers
In this section we construct rather general MH samplers for the multinomial logit model by approximating the distribution of ε i in ( 9) by a mixture of multivariate distributions, where the pdf p(ε i ) results as the marginal density of an (m + 1)-variate random variable (ε i , λ i ), with ε i |λ i following a normal distribution, i.e. ε i |λ i ∼ N m (m i , R i ), and λ i ∼ p(λ i ).We focus on mixture distributions where it is easy to sample from the conditional density λ i |ε i , such as the multivariate Student-t distribution, which is a scale mixture of multivariate normal distributions, finite multivariate normal mixtures, and finite multivariate Student-t mixtures.Once the approximate distribution p(ε i ) has been chosen, no further tuning parameters appear in this MH-sampler.

Sampling Scheme
The advantage of approximating the distribution of ε i in ( 9) by a mixture of multivariate normals is that double data augmentation, i.e. conditioning on the latent variables λ = (λ 1 , . . ., λ N ) in addition to the latent variables z = (z 1 , . . ., z N ), leads to a multivariate regression model with normally distributed errors which reads for i = 1, . . ., N : The posterior q(β|λ, z) of this model is given by and is equal to a normal distribution N r (b N , B N ).All approximate error distributions discussed below have in common that the error covariance matrix R i in the approximate model ( 10) is a uniform covariance matrix, i.e., R i = σ 2 i C i , where . This leads to a straightforward way to compute the moments b N and B N : where the (r × 1)-vector w i = X ′ i e m contains the column sums of X i , the scalar d i = e ′ m (z i − m i ) contains the sum of all elements of (z i − m i ), and a i and b i are given by: .
Since q(β|λ, z) is an important building block of our MH-algorithm, we call the resulting sampler a doubly data-augmented MH sampler.Conditional on the utilities z, the proposal q(β new |β old ) is constructed in the following way: where q(β new |λ, z) is the posterior of model (10) and q(λ i |β old , z i ) is equal to the conditional posterior of λ i given ε i = z i − X i β old : which is available in closed form according to our assumption.
It is easy to show that Therefore, the acceptance probability P = min(α, 1) may be expressed entirely in terms of likelihood ratios between the exact multivariate logistic distribution and the approximate distribution p(ε): Hence, if p(ε) is a good approximation to f LOm (ε) over a wide range of ε, then the acceptance rate will be close 1.Subsequently, we consider several error distributions p(ε), obtained by approximating f LOm (ε) in various ways.Finally, note that λ is an auxiliary variable sampled only in order to construct the proposal.Since the utilities are sampled from z|β using the exact dRUM model, the latent variables λ may not be stored and used in any subsequent MCMC sweep, see van Dyk and Park (2008) for a theoretical justification.

Using a Multivariate Student-t Distribution
Several authors (Albert and Chib, 1993;Liu, 2004) approximate the binary logit model by a binary discrete choice models based on the cdf of a univariate t ν -distribution with ν in the range of 7 to 8, because the cdfs are very similar over a wide range.Since all univariate marginal distributions of the multivariate logistic distribution are logistic distributions, this suggests to approximate the multivariate logistic distribution by a multivariate Student t ν (0, Σ)-distribution.As the multivariate logistic distribution is invariant to permuting the elements of ε, Σ has to be a uniform covariance matrix: Σ = σ 2 ((1−ρ)I m +ρ e m e ′ m ).The multivariate t ν -distribution is a scale mixture of normal distribution, i.e., ε i |λ i ∼ N m (0, Σ/λ i ) with λ i ∼ G (ν/2, ν/2) being a scale variable taking values in ℜ + .Hence, m i = 0 and R i = Σ/λ i in the approximate model ( 10).The conditional posterior q(λ i |β, z i ) is given by: It remains to choose ν, σ 2 , and ρ.We have determined them by minimizing the KLdistance.The minimization was performed by the MATLAB implementation of the simplex algorithm according to Nelder and Mead (1965).The corresponding optimal parameters for m = 2, . . ., 6 are reported in Table 1, along with the KL-distance to the target distribution.

Using Multivariate Finite Normal Mixture Distributions
In the univariate case the approximation can be made accurate enough to have an acceptance rate of virtually 1.Hence, a Gibbs-type sampler can be run without a rejection step.
In the multivariate case a finite mixture approximation has to be found in m variates.Due to the curse of dimensionality, it is not to be expected that an MH-step can be implemented without a rejection step.
The density of the multivariate finite normal mixture approximation reads The corresponding latent variable representation involves the discrete random variable λ i ∼ MulNom (w 1 , . . ., w K ) taking values in the set {1, . . ., K} and . Hence, m i = µ λ i and R i = Σ λ i in the approximate model (10).The conditional posterior q(λ i |β, z i ) is given by λ i ∼ MulNom (p i1 , . . ., p iK ), where

and
∑ K k=1 p ik = 1 for all i = 1, . . ., N .Finding an approximation with acceptance rate close to 1 is difficult.However, the multivariate logistic distribution has a rigid structure, which allows to impose restrictions on the means and covariance matrices of the mixture components.In particular, the multivariate logistic distribution in m variates is invariant under a permutation of the variates and therefore has m-fold symmetry with respect to the axis vector a = (1, . . ., 1).Consequently, each component k of the mixture consists of m copies.Each copy has the same weight w k /m and the same covariance matrix σ 2 k R k , with  The conditional posterior q(λ i |β, z i ) = q(λ 1i |β, z i )q(λ 2i |λ 1i β, z i ), where λ 1i is sampled marginally from λ 1i ∼ MulNom (p i1 , . . ., p iK ) with
The approximating Student-t mixtures were obtained in the same way as the Gaussian mixtures, with the number of degrees of freedom as an additional parameter, which was assumed to be the same for all components.Figure 4 shows the KL-distances between the target distribution and the mixtures, as a function of the dimension m and the number K of mixture components.Again, for m > 12 the parameters of the mixtures for m = 12 have been used.

Illustrative Applications
For all examples, we take an independent standard normal prior for each regression coefficient, i.e., b 0 = 0 and B 0 = I r .We use each of the four MH methods presented above to produce M = 10000 draws from the posterior distribution after running burn-in for 2000 iterations.For any proposals based on a finite mixture approximation, we increase the number of mixture components K from 2 to 6.

Simulated Data Sets
To evaluate the acceptance rate of the four MH samplers with increasing dimension, we consider a simple example, namely N i.i.d.observations y 1 , . . ., y N with m+1 categories, drawn from Pr( The latent equation in the corresponding dRUM model reads for i = 1, . . ., N Note that the mixture approximation is applied to equation ( 11) not only once, but N times.
We made various comparisons with respect to the true parameter β and investigated both balanced distributions, where the π k are roughly the same, and unbalanced distributions, where some of the π k are very small.We found, however, that the acceptance rate of the samplers were insensitive to these change.Hence, we decided to assume that π 0 , . . ., π m is a uniform distribution over the (m + 1) categories.
To evaluate the quality of the various mixture approximations, we start with a single observation, i.e., N = 1.Note that the proper prior distribution p(β) guarantees that the posterior p(β|y 1 ) is proper although the dimension of β is larger than N = 1.Table 2 shows the acceptance rate as a function of the number K of components (K = 1, . . ., 6) and the dimension m (m = 2, . . ., 6).We see the expected behavior, as the acceptance rate drops with increasing m and rises with increasing K.For N = 100 observations the acceptance rate drops by about 15 to 20 percentage points, and the effect is more pronounced for small K.If N is increased to 1000, the acceptance rate drops only slightly, by a couple of percentage points in the worst case.

Real Data Sets
Furthermore, we consider two real data sets.First, the car data (Scott, 2011), which is a medium sized data set (N = 263) with 3 categories and 4 regressors, i.e. r = 8; second, Table 3 and Table 4 evaluate and compare the various MH samplers using common measures such as the average acceptance rate after burn-in (in percent, Acc) and runtime in terms of the CPU time T CPU after burn-in (in seconds, CPU).In addition, for each regression coefficient β k , k = 1, . . ., r, the inefficiency factor τ = 1 + 2 • ∑ H h=1 ρ(h) is computed, where ρ(h) denotes the empirical autocorrelation of the MCMC draws at lag h, and H is determined by the initial monotone sequence estimator (Geyer, 1992), the effective sampling size (Kass, Carlin, Gelman, and Neal, 1998), defined by ESS=M/τ , as well as the effective sampling rate defined as the ratio ESS/T CPU .Table 3 and Table 4 report the median inefficiency factor (Ineff) and the median effective sample rate (ESR) over all regression coefficients.
In both examples, the acceptance rate rises with increasing number K of components, and the inefficiency factor drops.These effects, however, are not strong enough to compensate for the rise of the computational load with increasing K, so that we observe no net gain in terms of the effective sample rate, and even a net loss in three out of the four cases studied.

Concluding Remarks
We have shown how to construct data-augmented Metropolis-Hastings samplers for the general multinomial logistic model.The data augmentation relies on two mixture approximations to the multivariate logistic error distribution, which is characteristic for the dRUM representation of the model.We have studied the corresponding MH samplers on simulated and on real data sets.The results show that the sampling scheme is sound, but that the approximations are not yet precise enough to yield an acceptance rate and an inefficiency factor that more than offsets the rise in the computational load inherent to the evaluation of the multivariate mixtures.

Figure 1 :
Figure 1: Top: Contour lines of the bivariate logistic distribution and the approximating Gaussian mixture with five components (m = 2, K = 5).The corresponding contour lines are at the same height.The plus signs indicate the locations of the ten component means.Bottom: the univariate marginal distributions.They are virtually identical.

Figure 2 :Figure 3 :
Figure2: The KL-distance between the Gaussian mixtures and the multivariate logistic distribution as a function of the dimension m and the number K of mixture components.

Figure 4 :
Figure 4: The KL-distance between the Student-t mixtures and the multivariate logistic distribution as a function of the dimension m and the number K of mixture components.

Table 1 :
Parameters of the optimal multivariate Student-t distribution.

Table 3 :
Evaluating the various MH-algorithms for the car data.

Table 4 :
Evaluating the various MH-algorithms for the Caesarean birth data.