Order Selection in a Finite Mixture of Birnbaum-Saunders Distributions

One of the most signiﬁcant and diﬃcult problems in a mixture study is the selection of the number of components. In this paper, using a Monte Carlo study, we evaluate and compare the performance of several information criteria for selecting the number of components arising from a mixture of Birnbaum-Saunders distributions. In our comparison, we consider information criteria based on likelihood-based statistics and classiﬁcation likelihood-based statistics. The performance of information criteria is determined based on the success rate in selecting the number of components. In the simulation study, we investigate the eﬀect of degrees of separation, sample sizes, mixing proportions, and true model complexity on the performance of information criteria. Furthermore, we compare the performance of the proposed information criteria under unpenalized and penalized estimation. Finally, we discuss the performance of the proposed information criteria for a real data set.


Introduction
Finite mixture models are a popular statistical modeling technique used to model situations in which data arises from a population consisting of several homogeneous sub-populations. Several authors have applied mixture modeling in various practical issues such as agriculture, botany, economics, industrial engineering, marketing, medicine, reliability, survival analysis, zoology, and so on. Titterington, Smith, and Makov (1985), McLachlan and Basford (1988), Lindsay (1995), McLachlan and Peel (2000), Böhning, Seidel, Alfò, Garel, Patilea, and Walther (2007), Schlattmann (2009), and references therein include detailed description and applications of mixture modeling.
One of the most important tasks in mixture modeling is selecting the appropriate number of components that describe the data, since the number of components g refers to the number of separate groups in the data. Proper selection of the correct number of components is critical because the number of components selected can have a significant effect on the practical interpretations of the modeling results. In the literature, different approaches have been suggested for selecting the number of components in mixture models, in both frequentist and Bayesian settings. These approaches can be classified as, information-based approaches, penalized distance-based approaches, penalized log-likelihood-based approaches, Bayesian approaches, and testing hypothesis-based approaches (see, McLachlan and Peel (2000) and Niu (2014) for a detailed review).
In this study, we focus on information-based approaches to select the number of components in a Birnbaum-Saunders (BS) mixture model. The information criteria can be categorized into statistics based on the log-likelihood function and statistics based on the classification function. The methodology of using information criteria in order selection in mixture models consists of estimating the models with different number of components using the expectationmaximization (EM) algorithm (Dempster, Laird, and Rubin 1977) and comparing those estimated models using information criteria, where the model with the lowest information criteria values is chosen as the best model. Balakrishnan, Gupta, Kundu, Leiva, and Sanhueza (2011) introduced a two-component mixture model based on BS distributions for modeling heterogeneous populations when the subpopulations have skewed distributions, since the BS distribution (Birnbaum and Saunders 1969) is positively skewed, and studied the characteristics and estimation procedure of the proposed model. Benites, Maehara, Vilca, and Marmolejo-Ramos (2017) extended the work proposed by Balakrishnan et al. (2011) via considering a g-component mixture of BS distributions for modeling multimodel populations to provide flexibility in different data analysis situations. Benites et al. (2017) proved the identifiability for a two-component mixture of BS distributions and suggested in the EM algorithm the k-bumps initialization algorithm to estimate the parameters of the BS distribution mixture model. Also, they implemented bootstrap procedures via real data to test the hypotheses about the number of components g in a mixture of BS distributions. El-Sharkawy and Ismail (2020) proved the identifiability for a g-component mixture of BS distributions, and they discussed parameter estimation and testing homogeneity in the mixture of BS distributions using EM algorithm and EM test, respectively, based on random censoring data. El-Sharkawy and Ismail (2021) proposed the modified likelihood ratio test and the shortcut method of the bootstrap test for testing the number of components g in a mixture of BS distributions under a random right censoring scheme. Nevertheless, to our knowledge there are no studies that use information criteria to select the number of components g in a mixture of BS distributions.
The purpose of the paper is to compare the performance of all the information criteria introduced in the next section for selecting the number of components g in a mixture of BS distributions. We investigate the performance of these criteria through a Monte Carlo simulation study under different conditions for degrees of separation, sample sizes, mixing proportions, and true model complexity. The objective is to evaluate the effect of these factors on the proper selection of the number of components and construct a comparison between criteria based on log-likelihood and classification functions. Also, we evaluate the performance of these criteria based on unpenalized and penalized estimates of the model's parameters.
The paper is organized as follows. Section 2 reviews several information criteria to select the number of components in a finite mixture model considered in our study. Section 3 describes the finite mixture of the BS distributions and the procedure of model fitting using the EM algorithm. Section 4 presents the simulation study design and the results. Section 5 provides an illustrative example of real life. Finally, Section 6 presents some concluding remarks.

Information criteria
In this section, we introduce several information criteria for order selection in finite mixture models. The information criteria can be classified into likelihood-based statistics and classification likelihood-based statistics, also referred to as complete data likelihood-based statistics. These criteria have been developed to balance between accuracy and complexity of the model. The general form can be expressed as where is the log likelihood of a fitted model and P is the penalty term that may depend on the number of free model parameters k, sample size n, or both. The first term is interpreted in (1) as a measure of the accuracy of the model, while the second term is interpreted as a measure of the complexity of the model. Selecting the optimal number of components is that corresponding to the minimum value of the criterion in (1). Next to the information criteria, depending on the complete data likelihood function, classification likelihood-based statistics have been developed for finite mixture models. These statistics are used to measure the quality of the classification given by a model.

Likelihood-based information criteria
The most widely known and used model selection criterion is Akaike information criterion (AIC), proposed by Akaike (1973Akaike ( , 1974. It's derived as an estimate of expected relative Kullback-Leibler (K-L) divergence (Kullback and Leibler 1951) between the true model and the fitted model, based on the maximized empirical log-likelihood function of the expected K-L divergence. The AIC is defined as with penalty term P = 2k.
Another more popular criterion that is used for model selection is the Bayesian information criterion (BIC) proposed by Schwarz (1978). This criterion has a Bayesian derivation to select a model from a set of candidate models with the largest posterior probability, but it can also be used for model selection in a non-Bayesian context. It is defined as with penalty term P = k log n, which is a complexity penalization as opposed to 2k in the AIC. This criterion has the property of consistency which as n grows large, selects the true model with probability one ( for more details see, Andrews and Currim (2003), Andrew and Joseph (2012), and Vrieze (2012)).
Various extensions of AIC have been proposed to overcome its lack of performance in model selection applications. A small sample correction of AIC was proposed by Hurvich and Tsai (1989) to overcome that the AIC tends to overfit models in small samples and that led to a criterion called corrected AIC (AICc). AICc is defined as with penalty term P = 2k n n−k−1 (also see Sugiura (1978)). Burnham and Anderson (2002) recommended use of AICc if the ratio n/k is small (say < 40). Both AIC and AICc will be nearly identical and have a strong tendency to choose the same model if the ratio n/k is sufficiently large. Bozdogan (1987) extended the AIC to make it consistent, namely consistent Akaike's information criterion (CAlC) is defined as with penalty term P = k ((log n) + 1).
In the context of the mixture-model cluster analysis, Bozdogan (1994) modified the AIC to use it in estimating the number of mixture clusters where the marginal cost per parameter in the AIC definition, the so-called magic number 2 is not adequate for the mixture-model and he increased it to 3 and called it AIC3. AIC3 is defined as with penalty term P = 3k.
Although the derivation of these criteria depends on regularity conditions that do not hold in the framework for mixture models, they are still used to select the order of a finite mixture model (see e.g. Bozdogan, Sclove, and Gupta (1994), Konishi and Kitagawa (2007), McLachlan and Peel (2000), McLachlan and Rathnayake (2014), M and van der Laan M J (2003), Fernández and Arnold (2016) for a theoretical and practical support).

Classification likelihood-based information criteria
The classification likelihood-based information criteria which are based on the complete data log-likelihood were developed for finite mixture models. Such criteria take into account classification and aim to pick models which can classify the observations in a consistent fashion. A simple relation between the log-likelihood and the complete data log-likelihood which was first noted by Hathaway (1986) can be written as where EN (g) = − g j=1 n i=1τ ij logτ ij ≥ 0 is an entropy function composed of elementsτ ij which represent the estimated posterior probabilities that the subject i arises from the j-th mixture component. This entropy can be interpreted as a penalization term which measures the overlap of the mixture components and can be viewed as a measure of the model's ability to provide a relevant data partition. The entropy will be close to zero if the mixture components are well separated, but will take large values if the mixture components are poorly separated (see, Celeux and Soromenho (1996) and McLachlan and Peel (2000)). Banfield and Raftery (1993) derived an approximate Bayesian criterion based on the complete data log-likelihood called the approximate weight of evidence (AWE) criterion having the form AW E = −2 c + 2k 3 2 + log n .
When the mixture components are well separated that leads to c , therefore the AWE approximates to the BIC. The penalty term in AWE criterion is complex, which makes it select more parsimonious models than BIC (Fernández and Arnold 2016). The weakness of this criterion is that when the mixture components are not well separated, parameter estimation is biased (see, McLachlan and Peel (2000) for more details ). Biernacki and Govaert (1997) suggested using the relationship that exists between the loglikelihood and the complete data log-likelihood for order selection. This criterion is known as the classification likelihood information criterion (CLC) and is derived directly from equation (7) as The performance of CLC works well under equal mixing proportions but it appeared to overestimate the appropriate number of components when no constraints were imposed on the mixing proportions. The explanation for this behavior is that the number of parameters in the mixture model is not penalized by the CLC (see, Biernacki, Celeux, and Govaert (1998)).
Based on the estimated entropy function, Celeux and Soromenho (1996) introduced the normalized entropy criterion (NEC) as follows where (1) is the maximum log-likelihood in a one component mixture. The decision of NEC for choosing the number of components in the mixture model between one component and greater than one is not valid, because it is undefined at g=1. The improved version of NEC to deal with this problem was proposed by Biernacki, Celeux, and Govaert (1999) for the Gaussian mixture.
To overcome the drawbacks of the CLC and BIC, Biernacki et al. (1998) proposed the integrated classification likelihood (ICL) criterion and an approximated version of it using BIC. ICL-BIC was used by McLachlan and Peel (2000) to refer to an approximate form of ICL criterion which is given by See, for example, McLachlan and Peel (2000), Oliveira-Brochado and Martins (2005), and Depraetere and Vandebroek (2014) for a detailed review of all order selection criteria in the framework of the mixture model.

Model formulation
A finite mixture of BS distributions is a weighted sum of g component BS densities as given by where θ = (θ 1 , ..., θ g ), θ j = (π j , α j , β j ), with α j > 0, β j > 0 and π j ≥ 0, g j=1 π j = 1 are the shape, scale, and mixing parameters, respectively, and f (t; α j , β j ) is the probability density function (pdf) of the jth component The BS distribution is a two-parameter, unimodal, and positively skewed distribution used to model fatigue failure of materials subject to cyclic stress. It is also famous as the fatigue life distribution, since it originated from the fatigue of materials but has been expanded to solve problems in areas such as biology, business, economics, reliability, environmental and medical sciences, and many others. For several implementations, the BS distribution was considered as a rival of the log-normal, Weibull, gamma, and inverse Gaussian. A comprehensive review and additional details on the BS distribution, see Balakrishnan (2019) and Leiva (2016).

Parameter estimation via the EM algorithm
Let T = (T 1 , T 2 , ..., T n ) denote the observed data of size n from a mixture of BS distributions as given in (12). The log-likelihood for the observed-data is given by To use the EM algorithm for calculating the MLE of the parameters involved in (14), we need to formulate the mixture of BS distributions in the EM framework by introducing allocation Therefore, log-likelihood for the complete-data (T, Z) is given by By maximizing c (θ|t, z), the resulting estimates may not guarantee the optimal properties of the likelihood approach (consistency property). Since the maximum likelihood estimates of mixing parameters may be on or near the boundary and as a result, some of the ML estimates of the other parameters become inconsistent, see Chen, Chen, and Kalbfleisch (2004) and Wong and Li (2014). Chen, Chen, and Kalbfleisch (2001) suggested a penalized penalty term on the log-likelihood function dependent on the mixing proportions to push these estimates away from the boundary. Therefore, the penalized log-likelihood for the complete-data (T, Z) is given by where c (θ|t, z) is the log-likelihood function defined in (15), p(π) = C g j=1 log(2π j ) is the penalty function on mixing proportions such that p(π) gets its maximum at π j = 1/g and p(π) → −∞ when π j → 0 or 1, j = 1, ..., g, whereas C is a positive constant.
The EM algorithm for ML estimation of the mixture of BS distributions works as follows: 1-Start with randomly initial value θ 0 of the parameter θ.
where A is a constant independent of θ and .
To avoid getting stuck at local maxima while running the EM algorithm in our numerical illustrations, we provide a random initialization technique to improve the chances of achieving the global maximum. This technique can be summarized as follows: 1-Generate random initial positions. This is could by drawing random initial values for π j , denoted by π 0 j , j = 1, ..., g from a Dirichlet distribution with all parameters equal to 1. Then, based on the π 0 j values, the ordered sample is separated into g sets and for each set the modified moment estimates of α and β presented by Ng, Kundu, and Balakrishnan (2003) are calculated as the initial values for the parameters α j and β j , for j = 1, ..., g.
2-Run the EM algorithm with few iterations when C = 0 and C = 1.
3-Repeat the Steps 1 and 2, r times, and select the solution with the highest likelihood among those r trials.

Simulation study
In this section, we explain the procedure and results of a Monte Carlo simulation study for evaluating and comparing the performance of the information criteria based on the loglikelihood function and classification function, to select the number of components in a mixture of BS distributions using unpenalized and penalized estimation.

Simulation procedure and design
The simulation procedure involves three stages: (i) Generate data sets from the mixture of BS distributions. (ii) Use the EM algorithm to estimate the parameters of the mixture of BS distributions for different component numbers and then apply each information criterion to choose the best number of components. (iii) Evaluation and comparison of the performances of these information criteria, through studying some design factors such as the degree of separation between the components, the sample size n, the mixing proportion π, and the true number of components that may affect the choice of the number of components in mixture models. Comparison between different information criteria is based on the results of the proportion of selecting the number of components correctly, which are provided by the simulation study. All the computations have been implemented by using R software.
In particular, we randomly generate data sets for various sample sizes namely n = 25, 50, 100, and 250 from the mixture of BS distributions when the true number of components g = 2 and 3. The penalty constant C is set to be 0 and 1. We consider different scenarios for models of mixtures of BS distributions by varying the mixing proportions and degrees of separation between the components. The Mahalanobis distance is used to measure the distance between the two components of the mixture model and is defined as ∆ = |µ 1 − µ 2 |/σ, where σ is the common standard deviation, and µ 1 and µ 2 are the means for the first and second components in the mixture model, respectively. Table 1 displays the full sets of parameters for each scenario, where models M1, M2, M3, and M8, M9, M10 are used to study the effect of degrees of separation using ∆ when the true number of components g = 2 and 3, respectively. While, M4, M5, and M6, M7 are used to study the effect of mixing proportions when the true number of components g = 2 but for ∆ = 2 and 4, respectively. The simulation procedure is repeated 5000 times.

Results
In each table, the quantities corresponding to the highest proportion of selected number of components are shown in boldface. Tables 2-4 display the results of each information criterion for the competing models g* = 1, 2, 3, and 4 when the true number of components g = 2 for M1, M2, and M3, respectively. The overall findings in these tables show that the degree of separation and the sample size significantly affect the performance of information criteria, since the performance of all information criteria increases by increasing both the values of the degree of separation and the sample size or either of them, holding all other factors fixed. It is also noted from these tables that all the information criteria perform well as they are able to select the correct model (a mixture of BS distributions with g* = 2) with varying percentages, regardless of the values of ∆ and n. Furthermore, we observe that the classification based criteria perform better than the log-likelihood based criteria when the values of ∆ and n are small. However, the performances of the classification based criteria and log-likelihood based criteria converge as the values of ∆ and n increase. A possible explanation for this result is that the percentages of selecting the mixture of BS distribution with g* = 1 in the classification based criteria have been omitted since these criteria can not be numerically defined for the mixture model with g = 1. The effect of the unpenalized and penalized estimation on the performance of information criteria is also observed from these tables, as it is clear that all the information criteria have higher percentages of correct fitting when C = 1 than when C = 0, and these percentages converge when the values of ∆ and n increase. Comparing the results of all criteria, it is observed that AWE and ICL-BIC perform the best in tables 2-4, even when both values of ∆ and n are small, with percentages of correct fitting ≥ 85%, while the other criteria perform quite satisfactory.
The effect of the mixing proportions on the performance of information criteria can be observed for π = 0.5, 0.3, and 0.1 in Tables 2, 5, and 6 when ∆ = 2 and in Tables 4, 7, and 8 when ∆ = 4, respectively. It is observed that the performance of the criteria decreases when the mixing proportions within a model are less similar. It is noted from Tables 2 and 4-8, that the classification based criteria select the correct model always with varying percentages, where percentages decrease as π decreases, increase as the values of ∆ and n increase, and are higher for C = 1 than for C = 0, holding all other factors fixed in each case. On the other hand, the log-likelihood based criteria don't perform well for lower mixing proportions π and small ∆. Looking across the Tables 5-8 to compare the criteria, we note that AWE and ICL-BIC show the best performance even when both values of ∆ and n are small with the percentages of correct fitting > 70%. While the CLC and NEC show much less performance compared to AWE and ICL-BIC when the mixing proportion π decreases especially when the values of ∆ and n are small. The likelihood-based information criteria reveal worse performance at small sample sizes and the performance becomes more satisfactory as the values of ∆ and n increase.
Tables 9-11 show the results of each information criterion for the competing models g*= 1, 2, 3, 4, and 5 when the true number of components g= 3 for M8, M9, and M10, respectively. In Table 9, poor performance is observed from all information criteria as the correct model (a mixture of BS distributions with g* = 3) is not selected even when n = 250. When both the values of ∆ and n increase, all information criteria show improvement for selecting the correct model. It is observed from Table 10, that for ∆ = 3 and n = 50, only CLC and NEC select the correct model with percentages reaching to 43%. However, all the information criteria select the correct model with percentages reaching to 63% and 89% when n = 100 and 250, respectively. From Table 11, where ∆ = 4, only the classification criteria select the correct model with percentages reaching to 46% when n = 25, whereas when n = 50 all the criteria select the correct model with percentages reaching to 59% expect CAIC and BIC and the percentages increase to reach 100% when n = 250. Comparing the information criteria in these tables, we observe that the classification based criteria have better performance than the log-likelihood based criteria for ∆ = 3 and n ≥ 100 and ∆ = 4 and n < 100 and both classification based and log-likelihood based criteria have similar performance for ∆ = 4 and n ≥ 100.
On the average, AWE and ICL-BIC perform much better across different values of ∆ and n. Also, we observe from these tables that the percentages of correct fitting are lower as compared to the corresponing results in Tables 2-4. This shows that the mixture model with more components is more difficult to detect when each of the two neighboring components is the same distance from each other, especially when ∆ is small. The same conclusion was given in Usami (2014)

Application
We consider a data set consisting of 245 observations of enzymatic activity in the blood, which involves the metabolism of carcinogenic substances. The data have been first analyzed by Bechtel, Bonaiti-Pellie, Poisson, Magnette, and Bechtel (1993), which observed that these enzyme data are suited for a mixture of two skewed distributions. Also, it was recently studied by Balakrishnan et al. (2011) and Benites et al. (2017), who considered a two-component mixture of BS distributions for modeling these data. Here we consider the mixture of BS distributions with different number of components for modeling the enzyme data and use the proposed information criteria to select the appropriate number of components in the mixture of BS distributions that fit these data correctly. The resulting fits from the information criteria for the enzyme data based on unpenalized and penalized estimates are given in Table  12. According to the results, all the proposed information criteria can successfully estimate the correct number of components of the enzyme data set, under both the unpenalized and penalized estimates of the model parameters.

Conclusion
In this paper, a Monte Carlo simulation study is conducted to evaluate and compare the performance of the log-likelihood based information criteria and classification based information criteria for selecting the number of components in a mixture of BS distributions, considering two different types of estimation, unpenalized and penalized. Based on the simulation analysis, all the information criteria show both good and poor performance in detecting the correct number of components. Therefore, in general, we can conclude that no particular information criterion is always the best. According to the comparison results, it is observed that the classification based information criteria have the best overall performance in selecting the number of components, especially both AWE and ICL.BIC show better performance than any other criterion. It is also found that the information criteria based on penalized estimates have better performance compared to those based on unpenalized estimates. We also investigate the influence of degrees of separation, sample sizes, mixing proportions, and true model complexity on the performance of information criteria. The investigation show that the degree of separation is the most influential factor on the performance of information criteria and the sample size has less influence as compared to the degree of separation. While the influence of mixture proportions on the performance of information criteria is small as compared to the degree of separation and sample size. Also, it is observed that the performance of all information criteria gets worse with increasing the true model complexity. Finally, an example of enzyme data set is used to illustrate the performance of the proposed information criteria for selecting the appropriate number of components in the mixture of BS distributions that fit these data correctly.