An Extension of the Geometric Distribution with Properties and Applications

A new parameter is introduced to extend the geometric distribution using Azzalini’s method. Several important structural properties of the proposed two-parameter extended geometric distribution are investigated. Characterizations including for the geometric distribution, in terms of the proposed model, are established. Maximum likelihood estimation, method of moment estimation and relative frequency based estimation of the parameters are discussed in detail. The likelihood ratio test regarding relevance of the additional parameter is presented. Bayesian estimation of the parameters using STAN is also discussed. The proposed model is compared with some recently introduced two-parameter count models by analyzing two real-life datasets. The ﬁndings clearly indicate superiority of the proposed model over the rest.


Introduction
There are several existing methods of generating and generalizing continuous probability distributions. In particular, skewed continuous distributions have been thoroughly examined and extended by many researchers stemming from the popular skew normal distribution of Azzalini (1985). The standard skew normal distribution with skewing parameter α ∈ (−∞, ∞) has the probability density function (pdf) f (z) = 2φ(z)Φ(αz), z ∈ (−∞, ∞). (1) Here, φ and Φ denote the pdf and cumulative distribution function (cdf) of the standard normal distribution. Let g and G denote pdf and cdf of a continuous distribution with support S ⊂ (−∞, ∞), respectively. Thus in general, f (z) = g(z)G(αz) S g(t)G(αt)dt , z ∈ S.
Clearly, Azzalini's method is a particular case of the above weighting mechanism with cdf as the weighting function. While numerous continuous distributions are introduced in last three decades with Azzalini's method, such attempt with discrete distributions is unavailable.
Thus, it is of interest to apply Azzalini's method to a discrete distribution with compact cdf, such as the geometric distribution. Extension of geometric distribution has attracted many researchers over the years. For instance, Jain and Consul (1971), Philippou, Georghiou, and Philippou (1983), Tripathi, Gupta, and White (1987), Makcutek (2008), Gómez-Déniz (2010) and Nekoukhou, Alamatsaz, and Bidram (2012). Many of these generalizations rely on the discretization of continuous distributions or extension using the Marshall-Olkin family of distributions. Recently, Chakraborty and Gupta (2015) proposed a generalization known as the exponentiated geometric distribution and Bhati, Sastry, and Maha Qadri (2015) introduced a weighted geometric distribution. The extended geometric distribution to be introduced in this article by using Azzalini's method, is different from the generalizations of the geometric distribution mentioned above. It is to be noted that the motivation behind the use of Azzalini's method, is for extending the geometric distribution, unlike skewing a symmetric distribution.
In the next section, we introduce the extended geometric distribution. We investigate its distributional properties in Section 3. In Section 4, we discuss a few related characterizations. Section 5 is about estimation of the parameters in the model from both the frequentist and Bayesian perspectives. Two real life applications of the proposed distribution are presented in Section 6. We conclude the article by pointing out several scopes and some limitations of the current work.

Proposed model
Consider the random experiment of tossing a coin independently until head appears. Let the probability of obtaining head in each toss be p, 0 < p < 1. If Y denotes the number of tails obtained before getting the first head, then Y has the following probability mass function (pmf). p Y (y) = pq y , q = 1 − p, y = 0, 1, 2, ...
Here, Y is said to follow the geometric distribution with parameter p and we write Y ∼ G(p). The cdf of Y is given below.
Since, Y has non-negative support, F Y (αy) = 0 for α < 0. In analogy with (2), we propose an extension of the geometric distribution with the following pmf.
Now, we present a stochastic formulation of the EG(p, α) distribution considering the following coin tossing experiment. Consider two persons A and B tossing two different coins independently. Both the coins have same probability p of showing head. Let the number of tails obtained by A and B before the first head be U and V , respectively. If V ≤ U is known, then U is distributed as EG(p, 1). Generalizing this situation with an additional parameter α ∈ {0, 1, 2, ...}, EG(p, α) can be viewed as the distribution of U when it is known that V ≤ αU . The genesis of EG(p, 0) or G(p) is clear when we put α = 0 in the conditional statement V ≤ αU . This formulation can be justified by the following. Clearly, U and V are iid G(p). Then The converse is also true and we discuss it further in Section 4. Though the formulation given above is for non-negative integer α, we study the proposed distribution with non-negative real α as motivated from Azzalini's mechanism. Remark 1 is justified by Figure 1. For larger values of α and p = 0.5, the distribution has two modes, at 0 and 1. For smaller values of p, the shape of the distribution changes from the J-shaped geometric distribution with non-zero mode. For larger values of p, the distribution is able to accommodate outliers. The distribution has longer tail for smaller values of p.

Distributional properties
The EG(p, α) model, proposed in the last section, is simple, having only two parameters, flexible and has some useful shapes. Therefore in this section, we explore different properties of the proposed model in detail.
. Thus, EG(p, α) is is log-concave and hence by Theorem 3 of Keilson and Gerber (1971), it is also strongly unimodal.

Recurrence relation and tail length
Probability recurrence relation plays an important role in computing mass function and in determining tail behaviour of a distribution. For fixed x, From (6), we have the following two-term recurrence formula.
The behaviour of the tail of the distribution may be studied from the ratio of probabilities (Ong and Muthaloo 2012). As x → ∞ the RHS of (6) tends to q. Hence the distribution is short or long tailed depending on the value of p. When p → 1, we get a Poisson-type tail and for p → 0, a longer tail appears.

Random sample generation
As an explicit from of the cdf is available, one can use probability integral transform to generate a random observation x from EG(p, α) by solving F X (x) = u for x, given the values of p, α and u ∼Uniform(0, 1). Note that, F X (x) = u implies Here, C = pq/(1 − q α+1 ). After some algebraic patchwork, from (9), we obtain Here, B = 1 − C − uW (p, α) and y = q x+1 . It is obvious from (10), for the proposed distribution, a direct solution for x, as outlined above, is not feasible. Application of different numerical routines, for solving (10), yield inconsistent and imprecise results. Note that, y ∈ (0, 1) since q ∈ (0, 1). Thus, we prescribe the following exhaustive search based algorithm that provides consistently precise output.

Generating functions
Using (5) in P X (t) = ∞ j=0 t j p X (j), we get the probability generating function (pgf) as Here, Similarly, the moment generating function (mgf) of X is readily obtained in terms of Q(t), the mgf of G(p), as follows

Moments
The r th factorial moment can be obtained by differentiating the pgf in (11) r times with respect to t and putting t = 1.
Variance follows immediately as Other higher order moments can be derived similarly. The index of dispersion (ID) is given by

Reliability properties
The reliability function of a discrete random variable T at time t is the probability that the survival time T is at least t. From (8), the reliability function of EG(p, α) is given by The hazard function of a discrete random variable T at time point t is defined as the conditional probability of failure at t, given that the survival time T is at least t. Now, using (5) and (16), the hazard function is obtained as follows.
From Figure 3, it is seen that EG(p, α) has monotonically increasing hazard rate.

Characterizations
We investigate three interesting stochastic characterizations of the proposed extension of the geometric distribution through the following results.  Proof. If part: Discussed in Section 2.
Therefore P (U = u) = q u P (U = 0) = pq u and thus, U ∼ G(p) and U, V being identical V ∼ G(p).
Theorem 2. Let X be any discrete random variable with support {0, 1, 2, ...}. For any α > 0, X follows EG(p, α) iff Proof. If part We have seen that (26) can be written as From this we write Taking difference of (27) and (28) we get Theorem 3. Let X be any discrete random variable with support {0, 1, 2, ...}. For any α > 0, X follows EG(p, α) iff with initial condition Proof. If part: From (30), it readily follows that Hence, from (31) Only if part: Can easily be checked.

Estimation
Here, we discuss four different methods of estimating the parameters of the EG(p, α) model. Consider a random sample X = (X 1 , X 2 , ..., X n ) from X ∼ EG(p, α). A realization on X is denoted by x = (x 1 , x 2 , ..., x n ). From the viewpoint of application, x is the working dataset. (13), E(X) = µ X (q, α), a function of q and α, is

Maximum likelihood estimation
From (5), the log-likelihood function is l(q, α|x) = n log(1−q)+n log(1−q α+1 )−n log(1−q+q 2 −q α+1 )+log q By differentiating l(q, α|x) with respect to q we obtain a score function, Similarly, by differentiating l(q, α|x) with respect to α, we obtain the other score function as It is not feasible to solve s 1 (q, α|x) = 0 and s 2 (q, α|x) = 0 in q and α explicitly. In absence of closed form solutions, a common practice is to numerically maximize l(q, α|x) with respect to q and α. For initialization, required for numerical optimization routines, the estimates discussed in Section 5.1 and Section 5.2 are found suitable. However, Algorithm 3 is computationally less intensive compared to Algorithm 2. If one can afford such exhaustive searching for mere initialization, it is reasonable to employ similar exhaustive searching, directly for finding maximum likelihood estimates. This will also remove the issue of instability that might creep in numerical optimization routines. As in Section 5.1, transforming α to β = q α , we obtain Algorithm 4 (For estimating (p, α) by method of maximum likelihood) • Step-1 : Fix q ∈ q = {0.0001, 0.0002, ..., 0.9999}. • Step-2 : For each q ∈ q, find β ∈ β = {0.0001, 0.0002, ..., 0.9999, 1.0000} such that l(q, β|x) is maximum. • Step-3 : For each pair of (q, β) ∈ (q, β), obtained in Step-2, find (q L ,β L ) for which l(q, β|x) is maximum. • Step-4 : Returnp L = 1 −q L andα L = log(β L )/ log(q L ).
The second-order partial derivatives are given as follows.
The Fisher's information matrix for (q, α) is This can be approximated aŝ As n → ∞, the limiting distribution of √ n(q L − q,α L − α) is bivariate normal with mean vector (0, 0) and variance-covariance matrix From the relationp L = 1 −q L , it is evident that √ n(p L − p,α L − α) asymptotically follows bivariate normal distribution with mean vector (0, 0) and variance-covariance matrix

Bayesian estimation
When prior information are available, the Bayesian method of estimation with informative prior distributions on the parameters, proves to be efficient in many applications. However, in absence of the reasonable prior guess, one may assume non-informative flat priors on the parameters and compute the posteriors. Here, the two parameters are p with support (0, 1) and α with support (0, ∞). For the informative case, beta prior on p and gamma prior on α are reasonable. Unfortunately, with these prior choices, the posterior and hence the full conditionals though proper, are not in any identifiable family of distributions. Thus, for sampling from the marginal posteriors, the hamiltonian markov chain technique implemented in STAN (Carpenter, Gelman, Hoffman, Lee, Goodrich, Betancourt, Brubaker, Guo, Li, and Riddell 2017) is a viable alternative. Obviously, the proposed model is not available in STAN as a standard model. First, we present the function for writing the proposed model in STAN below. For both the informative and non-informative cases, the above code for the data-model remains same. For the informative case, we set p ∼ Beta (δ 0 , τ 0 ) and α ∼ Gamma (α 0 , β 0 ). Now, we present the remaining part of the STAN code below, comprising of the 'data' part containing information on the hyper-parameters. data{ int<lower=0> n ; vector [ n ] y ; real<lower=0> a l p h a 0 ; real<lower=0> b e t a 0 ; real<lower=0> d e l t a 0 ; real<lower=0> tau0 ; } parameters{ real<lower=0,upper=1> p ; real<lower=0> a l p h a ; } model{ p˜beta ( d e l t a 0 , tau0 ) ; a l p h a˜gamma( alpha0 , b e t a 0 ) ; y˜EG( p , a l p h a ) ; } The information required for the data-block are to be supplied through R, interfacing STAN with R by the RStan library. For the non-informative case, we set π(p, α) ∝ 1, where π denotes the prior density. For this case, the remaining part of the STAN code is given below. data{ int<lower=0> n ; vector [ n ] y ; } parameters{ real<lower=0,upper=1> p ; real<lower=0> a l p h a ; } model{ y˜EG( p , a l p h a ) ; } After obtaining the posterior samples through specified number of chains, usually initial half of the samples are discarded and the estimates of p and α, namelyp B andα B are obtained by averaging the corresponding posterior samples.

Applications
Here, two datasets are considered for modelling application. The first dataset records the number of claims made by automobile insurance policy holders (Klugman, Panjer, and Willmot 2012). The second one records the number of ticks counted on sheeps (Fisher 1941). For comparing the proposed model, we consider the following discrete distributions with the same support.
• N B(r, q): The negative binomial distribution discussed in Johnson, Kemp, and Kotz (2005) having the following mass function.
The new discrete distribution discussed in Gómez-Déniz, Sarabia, and Ojeda (2011) having the following mass function.
• N GP L(α, θ): The new generalized Poisson-Lindley distribution discussed in Bhati et al. (2015) having the following mass function.
(  The method of maximum likelihood estimation has been employed for estimating (p, α) of the proposed distribution for modelling the datasets. The results regarding modelling the number of claims and the number of ticks are presented in Table 1 and Table 2, respectively.
As pointed out earlier in Section 2, the geometric distribution is a particular case of the proposed extended geometric distribution when α = 0. Thus, for a given dataset, a natural question regarding the necessity of the additional parameter can be settled by testing the following hypothesis.
Here,p = 1/(1 +x), is the maximum likelihood estimate of p, under H 0 . H 0 is rejected at 5% level of significance when Λ exceeds 3.841. For the dataset in Table 1, Λ = 1.2502 and for the dataset in Table 2, Λ = 10.434. Thus, for the first dataset, the additional parameter has no significant role. Still, from Table 1, it is observed that, the proposed model marginally beats its closest competitors in terms of χ 2 and AIC. On the contrary, for the second dataset, the additional parameter in the proposed model, plays an important role. Thus, from Table 2, it is also seen that, the proposed model beats its closest competitors in terms of χ 2 and AIC, substantially.
Now we consider estimation of p and α for the two datasets under the Bayesian framework. We do not have any prior guess about the parameters for both the datasets. Thus, for the sake of demonstration only, we estimate p and α from Bayesian perspective using the flat prior. The flat prior considered here is π(p, α) ∝ 1, a non-informative prior, mentioned in Section 5.4. For a detailed elicitation on the choice of priors, one may refer to Robert (2007). For both the applications, we set the number of iterations to be 5000 and warm-up length to be 2500. The 'iteration' and 'warm-up' are two important arguments of the stan function in R. While the former indicates the number of samples to be generated for each one of the Markov chains, the later stands for the size of the initial sample to be discarded. Multiple Markov chains are simulated simultaneously to analyze issues related to convergence. We fix the number of chains to be 4 and note that the Rhat value in both the cases turn out to be 1. Rhat is a measure of convergence diagnostic. Rhat compares the between-and withinchain estimates for model parameters. Rhat exceeding 1 is an indication of the chains not being mixed up well. Thus, it is warranted that the posteriors are proper and the chains are convergent. For the dataset in Table 1, we obtainp B = 0.187 andα B = 7.411. For the datset in Table 2, we obtainp B = 0.726 andα B = 3.424. The within chain variability ofα B is found to be higher compared top B .

Remarks
Azzalini's skewing technique is used for the first time for generalizing a discrete distribution in this work, by considering the geometric distribution as baseline. Additionally, we have also presented a genesis of the proposed extension. The proposed model is very flexible and possesses compact forms for pmf, cdf, survival function, probability generating function and moments. It is an over dispersed distribution and can be long tailed, zero inflated, zero deflated for appropriate choices of the parameters. Three useful characterizations of the new distribution have been proved. Parameter estimation is discussed in detail and the performance is found to be satisfactory in applications. Thus, it is envisaged that the proposed two-parameter model will find its application in count data modelling. Of course, the idea can be used for other count distributions as well. Further work on numerical comparison of performances of different estimation procedures is warranted. Modelling count data with covariates using appropriate re-parameterized version of the proposed model will also be taken up in a follow-up work.