Model Selection Using Modified Akaike ’ s Information Criterion : An Application to Maternal Morbidity Data

The most commonly used model selection criterion, Akaike’s Information Criterion (AIC), cannot be used when the Generalized Estimating Equations (GEE) approach is considered for analyzing multivariate binary response. Recently, a modified version of AIC (mAIC) which is based on quasi-likelihood function is proposed as a model selection criterion. This model selection criterion can be used in the GEE setup. In this study, an application of mAIC is showed in selecting important covariates associated with pregnancy related complications of Bangladeshi women. Zusammenfassung: Das am häufigsten verwendete Modellwahl Kriterium, das Akaike Informationskriterium (AIC), kann nicht verwendet werden, wenn der Ansatz der Generalisierten Schätzgleichungen (GEE) in Betracht gezogen wird um multivariate binäre Daten zu analysieren. Unlängst wurde eine modifizierte Version des AIC (mAIC) als Modellwahl Kriterium empfohlen, das auf die Quasi-Llikelihood Function basiert. Dieses Modellwahl Kriterium kann im GEE Umfeld verwendet werden. In dieser Studie wird eine Anwendung des mAIC gezeigt und damit wichtige Kovariablen ausgewählt, die mit schwangerschaftsbezogenen Komplikationen von Bangladeshi Frauen zusammenhängen.


Introduction
Millions of women in developing countries like Bangladesh experience life-threatening and other health related problems during pregnancy and post-partum period when they require professional care, but many of them either don't perceive the seriousness of their condition or they don't have favorable conditions to seek care.About 16000 maternal deaths occurred in Bangladesh due to pregnancy and delivery related complications in the year 2000.
Though experiencing complications during pregnancy and post-partum period is very common to Bangladeshi women, not many studies have been conducted in Bangladesh on this topic.Recently, Bangladesh Institute of Research for Promotion of Essential and Reproductive Health Technologies (BIRPERHT), a non-governmental organization, conducted a prospective survey on maternal morbidity in Bangladesh where the selected women were followed during the pregnancy and post-partum period.Among other important pregnancy-related variables, presence/absence of any complication during pregnancy is recorded over the follow-up period for each of the selected women.
Since several measurements are made from each woman over different time points, the responses are usually positively correlated and responses of this type are known as multivariate or correlated binary response.
The methods for analyzing multivariate binary responses can be classified into two broad classes of methods: likelihood based and estimating equation based methods.The likelihood based methods require complete specification of the joint distribution of the multivariate responses, whereas the estimating equation based methods can be employed when joint distribution is not fully specified.The most common likelihood based methods for multivariate binary data are multivariate probit and multivariate logit models which consider univariate normal and logistic distributions as univariate margins, respectively (see, Joe, 1997).On the other hand, the generalized estimating equations (GEE) methodology, an estimating equation based method which is proposed by Liang and Zeger (1986) (also see, Zeger and Liang, 1986), is widely used for analyzing multivariate binary response.GEE can be used for analyzing both continuous and discrete multivariate responses within the generalized linear model framework.This method can provide consistent estimators of the regression parameters if the specification of the marginal means is correct.They introduced the "working" correlation matrix in which a larger value of working correlation parameter is used if there is more dependence in the data.
Model selection is an important part of data analysis which leads to a search "best" model.By "best" model, we mean selecting the best subset of the covariates from the available covariates in the data.Usually model selection is done by using a specific criterion.For likelihood-based methods, Akaike's Information Criterion (AIC) (Akaike, 1973) is widely used as a model selection criterion.But for non-likelihood-based methods, e.g., GEE, no such criterion is available for model selection.Recently, a modified Akaike's Information Criterion (mAIC), which is based on the quasi-likelihood function (McCullagh and Nelder, 1989), was proposed as a model selection criterion Pan (2001a).Among other non-likelihood based methods for model selection Pan (2001b) proposed the bootstrap smoothed cross-validation (BCV), a general model selection criterion that minimizes the expected predictive bias (EPB).Again Pan and Lee (2001) suggested the basic and bias-corrected bootstrap approaches to estimate the predictive mean squared error (PMSE) of a model and use the PMSE for model selection.Cantoni et al. (2005) suggested a generalized version of Mallows's C p (GC p ) suitable for use with both parametric and non-parametric models, that provides an estimate of a measure of model's adequacy for prediction.Recently, Cantoni et al. (2008) also proposed a cross-validation Markov Chain Monte Carlo (MCMC) procedure as a general variable selection tool which avoids the need to visit all candidate models.
The main objective of this paper is to select best models from a given set of covariates when the response is multivariate binary.The generalized estimating equation (GEE) approach is considered for modeling multivariate binary response and appropriate information criterion is used to select the best subset of the available covariates.A procedure of selecting working correlation structure for the selected subset of covariates is also described with an example of maternal morbidity data.In Section 2, the method of generalized estimating equations and modified Akaike's Information Criterion are briefly described.In Section 3, a short description of the sampling procedure and estimates of different models considered in the analysis are given, and Section 4 contains the conclusion.

Generalized Estimating Equations
Let y i = (y i1 , . . ., y id i ) be the response vector corresponding to the ith woman, i = 1, . . ., n, where the binary response y ij corresponds to the jth time-point of the ith woman, representing whether or not the woman suffers from specific complication.Let x ij = (x ij1 , . . ., x ijp ) be the vector of covariates corresponding to the jth response of the ith woman, where x ij1 = 1 for all i, j.Let us assume that y ij follows a distribution from the exponential family and the dependence of the mean function µ ij = Pr(y ij = 1) on the covariate set x ij can be expressed by the link function h(•) as Liang and Zeger (1986) used a working correlation matrix R i (α), i = 1, . . ., n, of order d i × d i to specify the within-subject dependence.The form of the working correlation matrix is assumed to be fully specified by the parameters α = (α 1 , . . ., α q ) .The common correlation structures such as independence and the exchangeable correlation structure can be obtained by considering respectively, where ρ = corr(y ij , y ik ), j, k = 1, . . ., d i , j = k, where I d i is the identity matrix of order d i × d i and J d i is a d i × d i matrix with all elements equal to one.
For estimating the regression parameters, Liang and Zeger (1986) proposed the following set of estimating equations where V i is the working covariance matrix considered for the ith subject, which can be expressed as a function of the working correlation matrix as where ) is a function of the known mean function µ ij and the dispersion parameter φ.Thus, the estimating equations ( 1) are functions of the regression parameters β, the dispersion parameters α, and φ.If the regression parameters are of main interest, the estimating equations can be reduced as a function of β by replacing α and φ by α(y, β, φ) and φ(y, β), respectively.So the estimating equations can be written as According to Liang and Zeger (1986), given the estimators of α and φ, the estimator of the regression parameters β, which is the solution of U(β, α(β, φ(β))) = 0, is consistent and asymptotically multivariate normal with mean β and covariance matrix One of the attractive property of GEE approach is that it provides consistent estimator of β even if the correlation matrix R is misspecified.Though Liang and Zeger (1986) did not mention any connection between the GEE approach and the likelihood based approach.For multivariate binary responses it can be shown that the estimating equations ( 2) are score function derived from multivariate logistic distribution (see Molenberghs andLesaffre, 1994, Joe andLatif, 2005) with constant third and fourth order moments.It can also be shown that the estimating equations (2) are equivalent to the score functions obtained from the quasi-likelihood function (McCullagh and Nelder, 1989) with independent correlation structure (Pan, 2001a).However, for a more general correlation structure there is no guarantee that a corresponding quasi-likelihood function exists unless certain conditions are satisfied.

Akaike's Information Criterion
Akaike's information criterion (Akaike, 1973) was introduced as an approximately unbiased estimator of the expected Kullback-Leibler information of the fitted model.Let D = {(y i , x ij )} be the data at hand, where y i is the response vector and x ij is a set of covariates as defined in the previous section.Also let M and M be a candidate and the true model, respectively.Further let L(β; D) and L(β ; D) be the log-likelihood functions corresponding to the models M and M , respectively, where β and β are the corresponding regression parameters.The Kullback-Leibler information, also known as cross entropy, between the models M and M is where the expectation E M is taken under the true model M .For a given set of competing models, we choose that model as the best model for which the Kullback-Leibler information ∆(β, β ) is the smallest.In practice, both β and β are unknown, as an asymptotically unbiased estimator of E M (∆( β, β )) which is actually the AIC can be used as a model selection criterion, where β is the maximum likelihood estimator of β under any competing model.Notationally, the AIC can be written as where p is the order of the vector β.A model which minimizes the AIC is considered to be the "best" model.This definition implies that when there are several models whose values of maximum likelihood are about the same level, we should choose the one with the smallest number of free parameters.A more detailed discussion on AIC can be found in Linhart and Zucchini (1986) and a review of model selection can be found in Zucchini (2000).

Modified Akaike's Information Criterion
The AIC is one of the most widely used model selection criterion when the likelihood function can be fully specified.But on the other hand, when the likelihood function cannot be fully specified, e.g., as in the GEE setup, the AIC cannot be used for model selection purposes.In such a situation, the modified Akaike's Information Criterion (mAIC) which is based on the quasi-likelihood function (McCullagh and Nelder, 1989, p. 325), can be used instead.Under the working independence correlation structure, the quasi-likelihood function based new discrepancy measure can be defined as where I is for the independence working correlation structure.Then the modified Akaike's Information Criterion can be defined for a general working correlation structure R as where β(R) is a solution of the estimating equations defined in (1) under the working correlation structure R, VR is the estimated robust variance-covariance matrix under the general working correlation structure R which is defined in (3), and is a consistent estimator of The right hand side of equation ( 4) is approximately equal to E M (∆ m ( β, β , I)), which ignores a term that is difficult to estimate (Pan, 2001a).However, this term converges to 0 if the model is correctly specified.
For analyzing regression models with dependent responses using a GEE approach, a minimum mAIC strategy can be used to find the best model from a set of competing models.The mAIC can be helpful not only to select the best set of covariates but also to select the best working correlation structure.Among all competing models, the best model is the one that has the smallest mAIC value.The difference between two mAIC values may not be meaningful.One of the limitations of the mAIC as a model selection criterion is that no probability distribution is associated with it, so the difference between two mAIC values cannot be compared using any statistical hypothesis testing procedure.Another limitation of the mAIC is its weak consistency in the sense that its consistency is assured only if the model is correctly specified.For details about the mAIC see Pan (2001a).

Data and Variables
This paper is based on the data from the survey on maternal morbidity in Bangladesh conducted by the Bangladesh Institute of Research for Promotion of Essential and Re-Austrian Journal of Statistics, Vol. 37 (2008), No. 2, 175-184 productive Health Technologies (BIRPERHT) during the period of November 1992 to December 1993.There have been a number of papers published using this data set, e.g., Islam et al. (2004), Gulshan et al. (2005), and Chakraborty et al. (2003).
A multistage sampling design was used in the survey where in the first stage the districts are randomly selected in such a way that exactly one district was chosen from each division.In the second stage, one thana was randomly selected from each of the chosen districts and in the third stage, two unions were randomly selected from each of the selected thanas.All the pregnant women of duration at most six months of the selected unions comprised the sample.All the selected women were followed till 90 days after delivery.A total of 1020 pregnant women were interviewed in the survey.Information on socio-economic and demographic characteristics, pregnancy related care and practice, morbidity during the period of follow-up as well as in the past, information concerning complications at the time of delivery and during the postpartum period, etc., were also collected for all the selected pregnant women.
One of the objectives of the BIRPHERT survey was to identify important factors associated with pregnancy related complications.The major life-threatening antenatal complications are hemorrhage, oedema, excessive vomiting, and fits or convulsion.In this study the response variable is considered as binary taking the value 1 if at least one of the complications was present.Notationally, it can be defined as y = 1, if the woman suffers at least one of the major complications, 0, otherwise.
Among the available covariates, only five important covariates are considered in this study, which are: educational level of the respondents (EDU), age at marriage (AgeM), economic status (ECON), desired the index pregnancy (DIP), and food supplement (FS).All these covariates are coded as binary with the reference categories, never attended school for educational level, 15 years or less for age at marriage, less than average for economic status, and no for desired index pregnancy, and food supplement, respectively.

Selection of Best Models
One of the main objectives of this paper is to show applications of the mAIC to select the best model within this GEE setup.All possible models that can be considered from the selected five covariates are examined and the best models with different number of covariates are shown in Table 1 with different correlation structures.
Among the five models with one covariate, the model with FS as the only covariate (Model I) is found to be the best one because the corresponding mAIC value is the smallest.Model I is found to be the best model for all three correlation structures that have been considered in this analysis.Among the 10 models with two covariates, Model II, which includes AgeM and FS as covariates, is the best choice.For three covariates, the model with the covariates AgeM, ECON, and FS is found to be the best one, we denote this model as Model III.The best model with four covariates (Model IV) includes the covariate EDU in addition to the covariates of Model III.The only model with five covariates is denoted as Model V which contains all the covariates that are considered in this study.Among all the five models (Model I up to Model V), Model III can be considered as the best model because the corresponding mAIC value is the smallest and this is true for all three correlations structures.For all cases, the selected best models are found to be the best model for all three correlation structures.

Analysis of Morbidity Data using Different Correlation Structures
Table 2 shows the estimates of the parameters of the best model (Model III) under different correlation structures, namely, independence, exchangeable, and unstructured.It is found that only covariate FS is significant irrespective of the choice of the correlation structure.
The analysis shows that taking special food during the pregnancy period reduces the number of complications.More specifically, women who do not take food supplements during pregnancy experience about twice as more pregnancy related complications compared to women taking food supplements.Age at marriage is found to be significant (at a 10 percent level) only if the unstructured correlation structure is assumed for the model.The analysis shows that women who got married before their fifteenth birthday experience more pregnancy related complications than women who married later.The other variable of the best model (Model III), economic status, is found to have a non-significant effect on pregnancy related complications.

Conclusion
Health related problems during pregnancy and the post-partum period are very common to Bangladeshi women.Not many studies have been considered in order to identify the important covariates associated with such pregnancy related problems.Recently, BIR-PHERT has conducted a survey on pregnant women in Bangladesh to identify such factors associated with pregnancy related problems.In this study, the BIRPHERT data is used to show an application of recently proposed modified Akaike's Information Criterion (mAIC) as a model selection criterion.The mAIC is very useful in situations when the response is multivariate non-normal and a fully specified likelihood function is not available.
Among the five covariates we have considered in this analysis, age at marriage, the economic status, and taking food supplements are found to be the best subset of the covariates among all possible subsets of covariates.The selection of the best model does not depend on the choice of the correlation structure.
The analysis of the best model shows that taking food supplements during the pregnancy period significantly reduces complications during pregnancy period.This means that the probability of developing some major complications during pregnancy is smaller for women who took special food during pregnancy than for those who did not.In a study conducted in the late nineties in Gambia, Ceesay et al. (1997) also found that food supplements have a significant effect on increasing weight gain during pregnancy and also on increasing birth weight.
The variable age at marriage is also found to have a significant effect on pregnancy related complications if only the unstructured correlation structure is considered for the model.Age at marriage is an important covariate for pregnancy related studies in a developing country like Bangladesh where more than 50% women married at the age 18. Akhter et al. (1996) also found a significant effect of age at marriage on pregnancy related complications.Recent studies show that female education plays a vital role in reducing maternal mortality, more specifically, a low incidence of maternal morbidities was found among the educated females (Choolani and Ratnam, 1995).Chowdhury et al. (2007) examined the trends in maternal mortality in Matlab, Bangladesh over 30 years and revealed female education and poverty reduction are two important variables in reducing the maternal mortality.In our analysis the variable female education has not been selected in the best model.

Table 1 :
Best models with different number of covariates for different correlation structures

Table 2 :
Estimates of the parameters of Model III (with p-values in parenthesis)