On Zero-Modified Poisson-Sujatha Distribution to Model Overdispersed Count Data

In this paper we propose the zero-modified Poisson-Sujatha distribution as an alternative to model overdispersed count data exhibiting inflation or deflation of zeros. It will be shown that the zero modification can be incorporated by using the zero-truncated Poisson-Sujatha distribution. A simple reparametrization of the probability function will allow us to represent the zero-modified Poisson-Sujatha distribution as a hurdle model. This trick leads to the fact that proposed model can be fitted without any previously information about the zero modification present in a given dataset. The maximum likelihood theory will be used for parameter estimation and asymptotic inference concerns. A simulation study will be conducted in order to evaluate some frequentist properties of the developed methodology. The usefulness of the proposed model will be illustrated using real datasets of the biological sciences field and comparing it with other models available in the literature.


Introduction
Most applications involving the analysis of count data are performed using the Poisson and Negative Binomial distributions.The latter is a well-known 2-parameter Poisson compound model that arises as alternative to fit overdispersed data since Poisson models are not applicable in this case.The literature concerning discrete models that accommodate different levels of dispersion is wide and provides several composed distributions as Poisson-Lindley (Sankaran 1970), Negative Binomial-Lindley (Zamani and Ismail 2010), Poisson-Exponential (Cancho, Louzada-Neto, and Barriga 2011), Poisson-Shanker (Shanker 2016a), among others.
A relevant drawback of such compound models is the fact that they do not fit well when a large amount of zeros is observed.To overcome this issue, several zero-inflated and hurdle approaches for standard Poisson model were proposed (Mullahy 1986;Lambert 1992;Zorn 1996).McDowell (2003) provides a insightful discussion about hurdle models.Such approaches were considered by several authors for applications and we will pointed out a few.Bohara and Krieg (1996) show that the modelling of migratory frequency data can be improved using zero-inflated Poisson models.Gurmu and Trivedi (1996) seek to deal with the excess of zeros on data from recreational trips.Ridout, Demétrio, and Hinde (1998) use several zero-inflated Poisson regression models for Apple shoot propagation data.In the social sciences, Bahn and Massenburg (2008) consider the hurdle Poisson model for the number of homicides in Chicago (Illinois -USA).Further applications were considered for quantitative studies on HIV-risk reduction (Heilbron and Gibson 1990;Hu, Pavlicova, and Nunes 2011) and for DNA sequencing data (Beuf, Schrijver, Thas, Criekinge, Irizarry, and Clement 2012).As zero-deflated data seldom arises in practice, there are very few literature addressing this case (Angers and Biswas 2003;Conceição, Andrade, and Louzada 2013) even if this situation is often referred in papers dealing with zero-inflated models.
Recently, the Poisson-Sujatha distribution was obtained by compounding the Poisson with a Sujatha distribution.The latter was introduced by Shanker (2016b) for modelling real lifetime in biological and engineering contexts.The author has shown that this model is a three component mixture of an Exponential distribution with scale parameter θ, a Gamma distribution having shape parameter 2 and scale parameter θ and a Gamma distribution having shape parameter 3 and scale parameter θ with mixing proportions given, respectively, by θ 2 υ −1 , θυ −1 and 2υ −1 , being υ = θ 2 + θ + 2. A comprehensive discussion about the statistical properties of the Sujatha distribution such as moments, hazard function, stochastic orderings, parameter estimation, among others is also presented on the mentioned paper.
The Poisson-Sujatha distribution was introduced and extensively studied by Shanker (2016c) which have discussed its various mathematical properties.Shanker and Fesshaye (2016a) consider the Poisson-Sujatha distribution to model overdispersed counts provided by ecological and genetic experiments.Shanker and Fesshaye (2016b) obtained the size-biased version of the Poisson-Sujatha distribution, presenting its properties and discussing its applications.The zero-truncated Poisson-Sujatha distribution, which will be of particular interest in this paper, was presented by Shanker and Fesshaye (2016c).Further, a detailed report on zerotruncated Poisson, Poisson-Lindley and Poisson-Sujatha distributions is provided by Shanker and Fesshaye (2016d).
Zero-modified models may arise when no information about the kind of zero modification in a given dataset is available.Dietz and Böhning (2000) proposed the zero-modified Poisson regression model for zero inflated/deflated samples and Conceição et al. (2013) consider a Bayesian approach for this model as an alternative to model Brazilian leptospirosis notification data.Once zero inflated/deflated models may also be useful to deal with data presenting overdispersion, this paper aims to introduce and present the usefulness of the zero modified version of the Poisson-Sujatha distribution, which is itself overdispersed.The proposed model is naturally more flexible than the original one since it takes into account inflation or deflation of zeros, being the first an issue often encountered when analysing count data.For our purpose, we consider a reparameterization of the zero-modified Poisson-Sujatha probability mass function, which will allow the likelihood function to be separable on the model parameters.The estimation procedure will be conducted under the frequentist point of view by the usual likelihood theory.A simulation study will be conducted in order to evaluate some frequentist properties of the maximum likelihood estimators.The usefulness of the proposed model will be illustrated by considering applications to real datasets from the biological sciences field.Standard model comparison will be also provided.This paper is organized as follows.In Section 2, we briefly present the Poisson-Sujatha distribution, some of its mathematical properties and its zero-truncated version.In Section 3, we introduce the zero-modified Poisson-Sujatha distribution, demonstrating its flexibility to deal with zero inflated/deflated data.In Section 4, the zero-modified Poisson-Sujatha distribution is presented as a hurdle model.In Section 5, maximum likelihood estimation for the unknown parameters as well the asymptotic standard errors and confidence intervals are discussed.In Section 6, a simulation study is presented.In Section 7, the proposed model is considered for application to real datasets.Concluding remarks are addressed in Section 8.
being the ratio involving the parameter θ always positive.This implies that the PS distribution is overdispersed, i.e. whichever θ > 0 we have that σ 2 > µ.Further, the useful index of dispersion (τ ) is clearly greater than 1, also implying overdispersion since τ = σ 2 µ −1 .On the other hand, we have that τ → 1 σ 2 → µ as θ → ∞, i.e. the PS distribution has the property of equidispersion for large values of θ.
Again, using the relationship between the moments about mean and the moments about origin, some useful measures as the coefficient of variation (β), the coefficient of skewness (γ) and the coefficient of kurtosis (ζ) can be derived from equation (2).The expressions of such measures are The definition of the zero-truncated Poisson-Sujatha (ZTPS) distribution will be quite useful for the purpose of this paper.A random variable X is said to have ZTPS if its pmf can be written as for θ > 0. See Shanker and Fesshaye (2016c) for further details about the ZTPS distribution.

Zero-modified Poisson-Sujatha distribution
Let X be a random variable defined on X 0 .Thus, X is said to have zero-modified Poisson-Sujatha (ZMPS) distribution if its pmf can be written as for θ > 0 and the parameter π is subject to the condition (called π-condition) given by being f (x; θ) the pmf of a PS random variable.Further, δ x is the indicator function, so that δ x = 1 if x = 0 and δ x = 0 otherwise.Note that (7) is not a mixture distribution typically fitted to zero-inflated data, since parameter π can assume values greater than 1.However, for all values of π between 0 and its upper boundary, the equation ( 7) corresponds to a properly pmf since f ZMPS (x; θ, π) is positive for each x and sums to 1 on X 0 .The expected value and the variance of X are where µ and σ 2 are given in equations ( 3) and (4).Under the ZMPS distribution, the index of dispersion and the coefficients of variation, skewness and kurtosis are given, respectively, by The ZMPS distribution may be considered an interesting alternative to the usual zero-modified Poisson (ZMP) model since the basis distribution of the former can accommodate several levels of overdisperson, issue that the P distribution generally fails in deal with.Table 1 summarizes the nature and the behaviour of the presented measures, using selected values for the parameters θ and π.It is clear that the coefficient of variation, the coefficient of skewness, and the coefficient of kurtosis are increasing as θ increases and π decreases.The higher values for the index of dispersion are obtained for small values of θ and π.On the other hand, combining small values of θ with higher as possible values of π will provide bigger values for the expected value and for the variance.
Theorem 1.The following statements holds.
i) If π = 0 then f ZMPS (0; θ, π) = 1 and therefore, equation (7) relates to a degenerate distribution with all mass at zero; and therefore, the ZMPS distribution has a proportion of zeros greater than the usual PS distribution; and therefore, the ZMPS distribution has a proportion of zeros smaller than the usual PS distribution.
Proof.Define the proportion of additional or missing zeros by Follows from the previous expression that (i) and (ii) are obvious.As for (iii), For the latter, whichever π > 1, (1 − π) < 0 and the result follows by the same argument for (iv).Hence, f ZMPS (0; θ, π) < f (0; θ), which completes the proof.
The inflation and deflation of zeros are characterized, respectively, by statements (iv) and (v) of the previous Theorem.We observe from (10) that the main role of the parameter π is to control the frequency of zeros.In such a way, very different values of π lead to completely different ZMPS distributions.For instance, fixing θ = 1.5, if π = 0.05 then P (X = 0) ≈ 0.97 and if π = 0.95 then P (X = 0) ≈ 0.43.

Hurdle version of the ZMPS distribution
The class of hurdle models was introduced by Mullahy (1986).The relevant feature of such models is that the zero outcomes are treated separately from the positive ones.In the formulation, a binary probability model determines whether a zero or a non-zero outcome occurs and hence, an appropriated truncated discrete distribution is chosen to describe the positive outcomes (Saffar, Adnan, and Greene 2012).
Let us define the hurdle version of the ZMPS distribution.Firstly, the equation ( 7) can be expressed as being f ZTPS (x; θ) the ZTPS distribution given by ( 6).Since 0 The pmf (12) can be seen as a hurdle version of the ZMPS distribution, where the probability of X = 0 is (1 − ω) and the probability of X > 0, is ω f ZTPS (x; θ).Moreover, from equation ( 9) we have that the expected value and the variance of the ZMPS distribution, expressed in its hurdle version, depends on the probability of X = 0 under the PS distribution.
The ZMPS distribution expressed in a hurdle version contains the ZTPS distribution as one of its components, which differs from the traditional mixture representation of zero-inflated distributions.Indeed, this representation of the ZMPS distribution can be interpreted as a superposition of two processes, i.e. one that produces positive observations from a ZTPS distribution and another that produces only zero valued observations with probability (1−ω).
The hurdle version of the ZMPS distribution can be used to derive the maximum likelihood estimators (MLEs) for the parameters θ and ω.Furthermore, such approach allow us to use only the positive observations in a given dataset to estimate the parameter θ assuming that these observations comes from a ZTPS distribution, while the parameter ω can be estimated as the proportion of zeros in the sample.Hence, parameter π can be estimated subsequently using the equation It is noteworthy that make inferences about the parameter π is essential to identify the kind of zero modification (inflation or deflation) is present in the analysed dataset.

Maximum likelihood estimation
Let X = (X 1 , . . ., X n ) a random sample of size n from the ZMPS distribution and x = (x 1 , . . ., x n ) its observed values.Thus, the likelihood function for the parameters θ and ω is given by One can note that the likelihood values of hurdle models are computed separately for each pmf.The MLEs of θ and ω can be obtained by direct maximization of the log-likelihood function where m denotes the number of positive outcomes and q the number of zero ones.Indeed m + q = n.Moreover, x m is the sample mean obtained from the set of positive values.
From (15) it is straightforward to see that the parameters θ and ω are orthogonal and that all terms in the log-likelihood function depending on θ take into account only the positive values of the sample vector x.Denoting by x (m) the vector of positive values from x, the log-likelihood function for θ based on the assumption that x (m) j , j = 1, . . ., m, are generated from a ZTPS distribution is given by Indeed n θ, ω; x (m) = m (θ; x), since each x j present in the log-likelihood of θ are generated by a zero-truncated distribution.Therefore, evaluate θ under ZMPS distribution is equivalent to assuming that the positive values of x comes entirely from a ZTPS distribution.On the other hand, denoting by x (q) the vector of zero outcomes from x, the log-likelihood function for ω is given by n θ, ω; Now, the corresponding score vector is given by where and The observed information, i.e. the Hessian matrix is given by where and By orthogonality of θ and ω, the crossed partial derivatives are null, then k θω = k ωθ = 0. Hence the set up of the information matrix is complete.As usual, the curvature of the log-likelihood function can be evaluated locally at θ and ω.
Proposition 1.Let ω the MLE of parameter ω.The following statements holds.
i) ω = mn −1 ; ii) ω is an unbiased estimator for ω; iii) The lower bound for the variance of ω is that for binary probability models.
Proof.Item (i) is straightforward.Take n − q = m and isolate ω in the equation and (ii) holds.For (iii), firstly note that the set {x : f ZMPS (0; θ, ω) > 0} does not depend on θ nor ω.Moreover, it is clear from (20) that for all x 0, u ω exists and is finite whenever ω = 0.For the moment, such conditions are sufficient and allow us to make use of the Cramér-Rao bound for the variance of an unbiased MLE, which is the case of ω.Then, Var ( ω) J −1 (ω) , where J = −E X (K) is the expected information, i.e. the Fisher information matrix.From ( 22) we have that and hence, which completes the proof.It is straightforward to show that the variance of ω is exactly n −1 ω (1 − ω) and therefore, coincides with the Cramér-Rao lower bound.
There is no closed form for the MLE of θ, see ( 19).However, using (17) the parameter θ can be estimated using standard numeric optimization algorithms such the Newton-Raphson, the Bisection and the Regula-Falsi methods.By the usual maximum likelihood theory, an asymptotic approximation for the variance of θ can be obtained from k −1 θθ , which evaluated at θ provides a consistent estimator for such a measure.On the other hand, the variance of ω can be estimated by its lower bound provided by the previous Proposition which, in fact, corresponds to the exact variance.
As aforementioned, the parameter π is a non-linear function depending on θ and ω.By the invariance principle, the MLE of π can be obtained as . Now, the variance of π can be estimated using the delta-method.Since θ and ω are orthogonal, Cov θ, ω = 0. Hence, being the variance of θ estimated numerically.The terms inside the brackets are evaluated at the MLEs of θ and ω.Now, to obtain intervallic estimates, we can use large sample approximations for the 100 (1 − α) % two sided confidence intervals (CIs) for the parameters θ, ω and π that are given, respectively, by being z α the upper α th percentile of the standard Normal distribution.The standard errors (SEs) are estimated as the squared root of the variance of the MLE of each model parameters.
In the following two sections, we presented the results obtained in the simulation study and the application of the proposed model to real datasets.To attain the numerical results, all computations were performed under the R environment (R Development Core Team 2007).

Simulation study
In this section, we seek to evaluate the frequentist properties of the proposed methodology by performing a simulation study.The simulation process consists in generating N = 10, 000 pseudo-random samples of sizes n = 50, 100, 150 and 200 of a variable X having ZMPS distribution in its hurdle version.Our procedure is based on the Monte Carlo simulation method to estimate the average bias and the mean squared error of the MLEs of the parameters θ and ω as well the coverage probability of the asymptotic CIs derived from such estimates.The results obtained for the parameter π are also presented.Assuming φ = θ, ω or π, the measures computed using the generated samples are given by , where A j = φ j − zα /2 se φ < φ < φ j + zα /2 se φ and therefore, δ A j assumes 1 whenever the CI obtained from the j th simulation contains the true value φ.
The following algorithm can be used to generate a single random variable from a ZMPS distribution.The process to generate a random sample consists to run the algorithm as often as necessary, say n times.The sequential-search is a black-box type of algorithm and works with any computable probability vector.The main advantage of such procedure is its ease of implementation.More informations on this algorithm can be found at Hörmann, Leydold, and Derflinger (2013).
Generate u ∼ U (0, 1) while u > p do 6: end while 9: return x 10: end procedure Under ZMPS distribution, the expected number of iterations (NI), i.e. the expected number of comparisons in the while condition is given by To run the simulation, we have established four scenarios in which a single value of θ was chosen for each one.The selected values were θ = 0.5, 1.0, 2.0 and 3.0.Moreover, we consider ω = 0.1, 0.5 and 0.9 varying in each scenario.On the other hand, since parameter π depends on the values of θ and ω, its values vary within each scenario and are closer of ω for small θ and when ω approaches to zero.Further, in order to evaluate if the coverage probability of the asymptotic CIs are around the nominal level of 95%, we fix α = 0.05 to compute such measures in the simulation process.
In Table 2, the bias and the coverage probability of the MLEs are presented for each parameter involving the ZMPS model.Figures 2 and 3 depicts the mean squared error of such estimates.
The results shows that both bias and mean squared error tends to zero when the sample size increases.It is noteworthy that the MLE of ω is negative biased in some cases.The mean squared error of ω remains quite small even for small n, which also occurs for π the smaller the value of θ.The coverage probabilities were found between 93% and 97% in most cases, indicating that the coverage of the 95% asymptotic CIs is relatively accurate.On the other hand, one can note that for small n, the coverage probability of the CIs obtained for ω decre-  ases at ≈ 89% when its true value is close to boundaries of the parametric space.The last relevant result is that the coverageprobability of the CIs obtained for π increases as the values of θ and ω increases, attaining values greater than 99% in the last cases of the third and fourth scenarios.

Application to real data
In this section, the ZMPS distribution is considered as an attempt to adequately model four datasets from biological science field.The goodness of fit of the proposed model is compared with those accessed by the P, the PS and the ZMP distributions.The first dataset is due to Beall (1940).The response is the number of Pyrausta nubilalis observed in small unit areas of a field in 1937.The second one refers to the number of chromatid aberrations observed on chemically induced chromosome aberrations in cultures of human leukocytes (Loeschcke and Köhler 1976).The third and fourth datasets relates to the number of mammalian cytogenetic dosimetry lesions induced in rabbits by lymphoblast streptonigrin at the exposure of 60 mg/kg and 90 mg/kg, respectively.A broader description of the last three datasets is provided by Shanker and Fesshaye (2016a).
Table 3 presents some descriptive statistics.The last column shows the observed proportion of zeros (PZ) in each dataset.One can note that more than 50% of the observations are zero-valued in all samples.Also, the initial analysis highlights the presence of overdispersion (see the index of dispersion), justifying the choice of the ZMPS model to describe such data.In Table 4 we present the frequency distribution of each sample.The expected frequencies was obtained using the estimated probabilities considering the MLEs of the parameters θ and π.Frequencies in bold relate to those one closer to the real values.The results show that the expected frequencies provided by the ZMPS model are the closest in most cases.The MLEs, the SEs and the 95% asymptotic CIs for the parameter θ of each fitted model are presented in Table 5.The model selection was performed using the Akaike information criterion with correction for finite samples (AICc) and the Bayesian information criterion (BIC).The goodness of fit was evaluated by the χ 2 statistic.It is noteworthy that the smaller AICc's are provided by the ZMPS model.On the other hand, in some cases the PS distribution presents similar fit when compared with its zero-modified version, as can be seen by that obtained from Dataset 4. In fact, there exist evidences that the proposed model adheres better to the considered datasets and hence, can be considered as a suitable option to model zero inflated/deflated count data in the presence of overdispersion.
The summary for the parameters ω and π can be found at Table 6.Under the ZMPS model, the inference about the parameter π allow us to identify that may exists an evidence that the first three datasets are zero-inflated while the last one may be classified as zero-deflated.By the 95% CIs obtained for the parameter π, we can also conclude that the PS distribution were computed and indicated the suitability of the considered methodology.The usefulness of the proposed distribution was evaluated by fitting it to four datasets obtained from biological science field with characteristics of overdispersion and zero modification.The model selection was performed using the AICc and the BIC criteria.The goodness of fit was accessed by the χ 2 statistic.The provided results demonstrate the superiority of the proposed model over the P, the PS and the ZMP distributions, confirming its applicability to model overdispersed and zero-modified count data.

Figure 1 :
Figure 1: Behaviour of the ZMPS distribution for different values of θ and π.

Table 1 :
Theoretical descriptive measures for different values of θ and π.

Table 2 :
Estimated bias of the MLEs and coverage probability of the CIs.

Table 3 :
Variables and descriptive statistics for each dataset.

Table 4 :
Observed and expected frequencies from each fitted model.