Using the Discrete Lindley Distribution to Deal with Over-dispersion in Count Data

Count data in environmental epidemiology or ecology often display substantial over-dispersion, and failing to account for the over-dispersion could result in biased estimates and underestimated standard errors. This study develops a new generalized linear model family to model over-dispersed count data by assuming that the response variable follows the discrete Lindley distribution. The iterative weighted least square is developed to ﬁt the model. Furthermore, asymptotic properties of estimators, the goodness of ﬁt statistics are also derived. Lastly, some simulation studies and empirical data applications are carried out, and the generalized discrete Lindley linear model shows a better performance than the Poisson distribution model.


Introduction
The Poisson distribution is standard in modeling count data because it is easy to use and interpret. For instance, Almeida, Casimiro, and Calheiros (2010) fit the generalized additive model (GAM ) to describe the effect of daily average temperature on Poisson daily mortality in Lisbon and Oporto cities of Portugal. Linares and Diaz (2010) investigate the relationship between daily P M 2.5 concentrations and hospital admissions in Madrid city (Spain) by the generalized linear model (GLM ) with the Poisson distribution. However, a critical property of Poisson distribution is that the variance equals the expectation. For most of the observed count data, this property is often violated. Count data often display substantial overdispersion, in which the variance of the outcome is greater than its mean. Over-dispersion is attributed to omitted-variable problems, the existence of outliers, heteroskedasticity (Rigby, Stasinopoulos, and Akantziliotou 2008). Crawley (2012) and Hilbe (2011) state that applying the Poisson model on over-dispersed data could result in biased estimates because of ignoring dispersion parameters in the fitted model. Another disadvantage is that the standard errors of the estimates are downward biased, so the explanatory variables tend to have significant impacts on the response variable, although they do not (Faddy and Smith 2011).
Various statistical procedures have been developed to overcome the over-dispersed problem in the last several years. A common way is to apply the Poisson quasi-likelihood (McCullagh and Nelder 2019) by specifying the parameters relating to the dependence of mean on explanatory variables and the variance written as a multiplicative constant of the mean. Besides, the negative binomial model is another valuable technique for over-dispersion. Also, Zuur, Ieno, Walker, Saveliev, Smith et al. (2009) introduces zero-inflated Poisson as a tool for an excess of zeroes in the data; Harrison (2014) suggests the Poisson-lognormal deal with the observationlevel random-effect model. Abebe and Shanker (2018) use a discretization method based on an infinite series to generate the discrete Lindley distribution from the continuous Lindley distribution (Lindley 1958). They discuss statistical properties of the discrete Lindley distribution consisting of moments, skewness, kurtosis, and parameter estimation. They conclude that its index of dispersion is greater than 1, and the discrete Lindley distribution is suitable for a model with an overdispersed response variable.
In this study, we introduce a new method to solve the problem of over-dispersion in modeling count data by combining the discrete Lindley distribution with the GLM, named as the generalized discrete Lindley linear model (GDLLM ). Section 2 is about estimating parameters and some statistics derived from the GDLLM. The simulation study is carried out to explore the performance of the discrete Lindley model compared to the Poisson model at different levels of over-dispersion in section 3. Furthermore, demonstrations of the GDLLM using a few realistic data sets are included in section 4. The first data set relates to the ecologic field extracting from the faraway package of software R Core Team (2013), and the discrete Lindley model is compared to Poisson and negative binomial models to select the best model. The second example is an empirical analysis to discover the impact of daily temperature on the all-cause mortality at Dien Chau district, Nghe An province, Vietnam.

Discrete Lindley (DL) distribution
The single random variable Y follows the discrete Lindley distribution with a single parameter θ, denoted by Y ∼ DL(θ), if its probability mass function can be written in the following form (Abebe and Shanker 2018) for y ∈ N and θ > 0. The function in Equation (1) can be rewritten as showing that discrete Lindley distribution is in the exponential family. In which, γ = −θ is the canonical parameter, the cumulant function is b(γ) = −2[ln(e −γ − 1) + γ], the dispersion parameter equals 1 and the normalizing function is ln(1 + y). From the properties of distribution in the exponential family, we derive the expectation and the variance of the random variable Y following discrete Lindley distribution as follows The index of dispersion (IOD) is defined by The index of dispersion is greater than 1 for all θ > 0. Hence, the discrete Lindley distribution is used to describe the over-dispersed data set. Besides, for further analysis, denoting µ = E(Y ) and rewriting V(Y ) with respect to µ, we have

Generalized discrete Lindley linear model (GDLLM)
Fitting the model Suppose Y 1 , Y 2 , · · · , Y n are independent random variables, each with discrete Lindley distribution and where µ i = E(Y i ), X i is a i th row of the design matrix containing data on the k explanatory variables, β is the k−vector of parameters. Then, model (6) can be rewritten in an equivalent form where η = (η 1 , η 2 , · · · , η n ) T . The log-likelihood function for all Y i s is In general, the n−vector score and n × n information matrix are computed by Applying to the discrete Lindley distribution, the elements of vector u and matrix J are Also, we easily have that ∂η ∂β = X.
To obtain the maximum likelihood estimate for the parameters, we need to find the differentiation of (8) with respect to β s and set them equal to zero. It turns out that the estimates β are the solutions of X T u = 0.
Applying the Newton-Raphson, the updated β at the (m + 1) th iteration is where w is the vector of "pseudo data" and calculated by The iterative weighted least square is implemented to update η, µ, and w each iteration. Developed from the general algorithm by Nelder and Wedderburn (1972), our specific algorithm for GDLLM is presented as follows Algorithm 1 1.
Each step of the iterative weighted least square adjusts the pseudo dependent variable w. We define an asymptotic approximation w 0 of w. We get Then, w has asymptotic mean η 0 and asymptotic variance (J 0 ) −1 . We, therefore, have the variance-covariance matrix for β When all the regularity conditions are met, we can extend to show that β is asymptotically or in the form of Wald statistic

Deviance
The definition of deviance for a fitted model is where ( µ max , y) indicates the maximized log-likelihood of the full model: the model with the maximum number of parameters that can be estimated, while ( µ, y) denotes the maximum value of the log-likelihood function for the model of interest. If the response variables Y 1 , Y 2 , · · · , Y n are independent and each Y i ∼ DL(θ i ), the likelihood function is mentioned in (8). The maximum likelihood estimator for each θ i is Then, the maximum log-likelihood for the full model Because µ i = 2/(e θ i − 1), the log-likelihood as a function with respect to µ Substitute the estimate µ for µ, ( µ, y) is derived. Therefore, the deviance is Under the hypothesis that the model is correct, the deviance D approximately has the Chisquare distribution where k is the number of parameters in the model.

Goodness of fit
Testing the goodness of fit of the generalized linear model is not often based on the raw residuals i = y i − y i because it does not imply the mean and variance relationship. Common residual types used in the generalized linear model are the Pearson residuals and the deviance residuals.
For the discrete Lindley GLM, the concrete formula of the Pearson residual is Besides, if we write the deviance as the summation of n components d i computed on the i th datum the deviance residuals can be defined These residuals should follow a distribution being approximately standardized Normal. Furthermore, the residuals should not show any apparent pattern in trend when plotted against the fitted value or any explanatory variable. Akaike Information Criteria (AIC) is used to select different models. Also, for a model with over-dispersed outcome, the choice of model is based on the modified Akaike Information Criteria QAIC (Hastie and Tibshirani 1990), given by in which φ is the over-dispersed parameter, and it can be estimated by

Simulation study
To generate the over-dispersed data, we apply the result derived by Cameron and Trivedi (1998). Let Y i have a Poisson distribution with parameter θ i where The unconditional distribution of y will be negative binomial N B r = α, p = µ µ + α . Then, the mean and the variance will be E(Y ) = µE(ε i ) = µ and V(Y ) = µ 1 + µ α , presenting for over-dispersion.
We simulate 5000 samples, with sample size n ∈ {50, 100, 200}, through the following steps: 1. Two independent variables x 1 and x 2 are generated: x 1 is randomly chosen from the Normal distribution N (1, 0.5), and x 2 is drawn from the Bernoulli distribution with success probability p = 0.5.

Compute
Once the data are generated, the generalized linear model, which assumes the response variable following the discrete Lindley distribution, is employed. The model of generalized linear with Poisson response variable is applied, too. We are then going to compare two methods for each condition by using the common bias (CB), the standard deviation (SD), the mean squared error (MSE) of each estimated parameter based on the 5000 replicates. We have where β jr is the estimated value of parameter β j (j = 0, 1, 2) at r th simulation experiment and β j = 1 5000 5000 r=1 β jr .     Table 2 and 3.
The results in Table (1) -(3) present that, in all levels of over-dispersion and sample sizes, estimators based on the assumption of discrete Lindley distributions perform better than the case of Poisson distribution in all terms of common bias, standard error, and mean squared error. Besides, the common bias, standard error, and MSE of estimators in both models increase as the level of over-dispersion increases (α decreases). A larger sample size leads to more precise estimates of parameters in the discrete Lindley model, and the same conclusion is obtained for the Poisson model. In summary, the GDLLM appears to be a good recommendation for over-dispersion.

Empirical data analysis 4.1. Example 1: Species diversity on the Galapagos Islands
The gala data set is available in the faraway R package (Faraway 2004). In order to demonstrate the performance of the GDLLM to handle over-dispersion, various popular GLM on count data, including Poisson regression and negative binomial regression, are applied to the data set and compared to the GDLLM through AIC and QAIC criteria. The model with the smallest values of these criteria is considered the best-fitting model for the data set.
The response variable is the number of endemic species found in 30 Galapagos islands. The predictors used in this study include four variables: • Area (km 2 ): the area of the island  The residual plots are presented in Figure (    Guo, Barnett, Pan, Yu, and Tong 2011). The distributed lag nonlinear model (DLNM ) proposed by Gasparrini, Armstrong, and Kenward (2010) has become a powerful technique in the epidemiological field because it controls the bi-dimensional effect of the exposure and lags on the outcome. The main idea of DLNM is to generate cross-basis variables and include them in a design matrix of a generalized linear model to estimate the parameters.
In this example, we apply the DLNM to investigate the effect of temperature on all-cause mortality with the assumption that the response variable follows the discrete Lindley distribution. The procedures for generating the cross-basis variables and applying the GDLLM are discussed clearly.

Data description
The empirical analysis is based on a time-series data set conducted at Dien Chau, a coastal plain district in Nghe An Province, Vietnam. Geographically, Dien Chau is located in the central part of Vietnam. The climate of Dien Chau is influenced by strong monsoon with hot, dry summer and cold, cloudy, drizzly winter.
For the study, the daily number of death from all causes in 2014 at Dien Chau is obtained from the A6 death register, which records every death event occurring across the country by medical workers at commune health stations. The meteorologic indicators are daily observations measured at Nghe An Observatory and provided by the Vietnam Meteorological and Hydrological Administration, including temperature ( 0 C) and total precipitation (mm). The descriptive statistics of these variables are presented in Table 5. Besides, the histogram for all mortality cases is shown in Figure 2, with the observed and expected frequencies of the number of death fitted by Poisson and discrete Lindley distributions. The expected frequencies from discrete Lindley seem closer to the observed frequencies. The variance of mortality cases is about two times the mean, so it is available to apply the GLM with discrete Lindley distribution.

The model
In this example, we focus on the relationship between daily temperature and the number of all-cause death at Dien Chau, Nghe An, Vietnam. The suggested model is where Y is the daily all-cause mortality following the discrete Lindley distribution, µ t = E(Y t ), T emp is the daily temperature, f ·g is the bi-dimensional function exploring relationship along with the temperature over their lags with the outcome. Also, s(trend) and s(P re) denote for smooth functions of the long-time trend and total precipitation (Pre). DOW is a system including six dummy variables for days of the week, with Monday being the base category to which no dummy is assigned.
f · g combines the exposure-outcome and lag-outcome functions, in which f and g can be parameterized as a linear combination of basis functions. Denote R to be a n×(L+1)v x matrix containing the evaluations of lagged occurrence of each basis function at all observations of variable Temp, where v x is the number of basis functions and n is the sample size. In the same vein, C is a (L + 1) × v matrix (v < L) representing the basis functions applied to the lag vector. Gasparrini et al. (2010) construct a cross-basis matrix W encompassing values of the bidimensional function by re-arranging and summing along with the lag of the following matrix where 1 refers to an all-one vector with length denoted by a subscript, ⊗ and denote the Kronecker product and Hadamard product, respectively.
The representation of f · g is with γ being a vector of new parameters corresponding to the new design matrix W .
The type of smooth functions for the temperature and lag are chosen independently. The natural cubic spline is chosen to describe the relationship in each dimension because of its flexibility at two boundary points where some degree of non-linearity is expected (Goldberg, Gasparrini, Armstrong, and Valois 2011). The knots are evenly spread over the temperature range and placed at equal intervals over the logarithm of lags. These choices are motivated by substantive papers on epidemiological literature (Gasparrini et al. 2010;Guo et al. 2011). Besides, the natural cubic spline with knots placed at all points (degree of freedom equals 1) is chosen to control the long-time trend and total precipitation. The model (25) now becomes where B trend and B P re refer to the basis relating to the long-time trend and the total precipitation.
With the given basis functions of the exposures and the lags, we can generate a cross-basis function by using the command crossbasis of the package dlnm in R-software (Gasparrini, Armstrong, and Scheipl 2011). The model (28) with the new system of predictors is treated as the GDLLM. Then, the estimates and the inferences of the parameters can be carried out through the iterative weighted least square discussed in section 2.2. Detailed pieces of code to generate cross-basis variables and estimate the GDLLM in R are available in the Appendix.
For the mortality analysis, we are often interested in the relative risk of an exposure x with the given reference value x 0

Results and discussion
The QAIC is used to select the best degree of freedom (df) for the smooth functions of the predictor Temp and the lag. The best degree of freedom for the exposure dimension is two, while four for the lag dimension. The maximum lag is set to 20 days. The relative risks and their 90% confidence intervals are presented. We choose 25 0 C (the median temperature) as the reference to the data analysis. Figure (3) presents the three-dimensional plot of the relative risk along with the temperature and lags. It is found that the effects of heat and cold happen strongly and immediately but last for only two days. Furthermore, the shape of the temperature and mortality relationship changes along with lags. Figure (4) shows the effects of the extreme cold and extreme hot temperature, corresponding to 13 0 C (the 1st percentile) and 33 0 C (the 99th percentile), on the all-cause mortality risk throughout 20-day lag. The strong effects appear in the first two days and rapidly decline afterward. There is no significant effect of the heat after lag 1. However, extremely cold temperatures also suggest delayed effects on all-cause mortality with significant relative risks 11 to 15 days after exposure. Generally, the temperature effects on mortality manifest a nonlinear and U-shaped curve. At lag 0 and 1 days, the minimum mortality temperature is 24 0 C, and the temperature below 18 0 C or above 29 0 C causes a significant increase in mortality risk. Though the causes of death are not examined in this study, we suppose that sudden death can be attributed to heart attacks, cardiac arrests, and strokes. Besides, the mortality risks relating to cold temperatures (less than 18 0 C) are assessed over the 11 to 15 lag period. This result tallies with Carder, McNamee, Beverland, Elton, Cohen, Boyd, and Agius (2005), and Wang, Shi, Zanobetti, and Schwartz (2016)

Conclusion
It becomes essential to construct specialized models that adapt well to the over-dispersed dependent count data. This study focuses on the generalized linear model, assuming that the response variable follows the discrete Lindley distribution, and shows how this model can fit the response variable with the over-dispersed problem. The iterative weighted least square is employed to estimate the model's parameters.
The performance of the discrete Lindley distribution model is evaluated through a simulation study at different levels of over-dispersion. The discrete Lindley model gets better behavior than the Poisson model under some criteria, including common bias, standard deviation, and mean squared error (MSE). Data applications are also carried out to demonstrate the modeling and assessing issues. The first example is cross-sectional data in ecological fields. The discrete Lindley model is better than the Poisson while performing equally well with the negative binomial model in AIC, QAIC, and residual plots. Hence, the discrete Lindley model can be recommended for the over-dispersed data.
Finally, the empirical analysis in the second example offers evidence for the usefulness of the proposed model in environmental epidemiology. Because of the nonlinear and delayed associations in the temperature and all-cause mortality relationship, the distributed lag nonlinear model is carried out to generate cross-basis variables included in the regression model. The final model has a representation of the generalized linear model. Applying the estimating technique for the GDLLM model, we obtain the estimates of the parameters relating to these cross-basis variables, which describe the bi-dimensional effect of the predictor and the lag. This example provides a flexible method that can be used to investigate the complex association in exposure-health studies.