A Skewed Model Combining Triangular and Exponential Features : The Two-faced Distribution and its Statistical Properties

Abstract: A new continuous distribution model is introduced, joining triangular and exponential features, respectively on the left and right side of a hinge point. The cumulative distribution function is derived, as well as the first three moments. Expected values and the Pearson index of skewness are tabulated. A possible step-by-step approach to parameter estimation is outlined. An application to Italian geographical data is given, referring to a set of municipalities classified by population, showing a very satisfactory goodness of fit.


Introduction
The research regarding probability density models is always in progress, since many researchers are interested in finding simple (or quite simple) functions, which are able to fit a set of experimental data.Nowadays, we have a lot of quantitative information from several fields, from biology to sociology, from chemistry to economy, from sports to music, and the list would be much longer.Therefore, we often have to deal with "new" sets of data, and it is always more important to have a large choice of probability models, discrete or continuous, in order to find the best fitting one.
Recently, the triangular distribution, a very simple and well-known one, has been reconsidered and developed by Johnson (1997), Johnson and Kotz (1999), Van Dorp and Kotz (2002b).One of the most recently developed family of densities, the STSP family proposed by Van Dorp and Kotz (2002a) includes the triangular distribution as a particular case.Here we propose a mixed distribution, having some triangular and exponential features at the same time.The new distribution is then fitted to a set of Italian population data.

Two-faced Distribution: Characterizing Functions and Moments
Suppose that a random variable Y has a probability density function (pdf) which increases linearly until a modal point θ, and then decreases exponentially.We call such a variable, having a triangular left side and an exponential right side, a two-faced (TF) variable.The original parameters are θ (hinge point) and λ (exponential coefficient).
The pdf of such a TF distribution follows the following pattern where The value of α and β has been determined by the two necessary conditions: 1. the pdf integral has to be equal to 1, 2. the value at the hinge point has to be unique, i.e., the pdf has to be continuous at θ.
For better simplicity, it seems to be useful to change the second parameter.Let κ = λθ denote a shape parameter.With this notation we rewrite (1) as (2) In Figure 1 we represent the pdf (2) for various values of κ and in Figure 2 for different values of θ.
The cumulative distribution function (cdf) is directly derived from (2) as (3) The pdf and the cdf at the hinge point θ are In particular: The expected value E(Y ) is to be calculated by considering separately (and then adding) the two parts (two "faces") of the TF distribution: (5) Therefore, moving the hinge point to the right (i.e., increasing the value of θ) the expected value E(Y ) becomes larger, while the parameter κ is inversely linked to the expected value.In particular we notice that E(Y ) = θ as κ = √ 6, since In the same way we derive the second moment and determine the variance as The variance, as also the mean, increases directly with θ and decreases as κ increases.
We now compute the third moment, useful for evaluating the skewness, and get The well-known Pearson index of skewness is defined as The resulting formula for γ is quite complicated and not comfortable to use.Thus, we report in Table 2 the first three relative moments as functions of κ.Looking at Table 2, we realize that the moments directly depend on the parameter θ and inversely depend on κ.
The skewness index depends only on κ and it is positive for small values of κ and becomes negative for larger values.The value of κ corresponding to γ = 0 is, approximately, κ = 6.15.Obviously, even in this context, the pdf will not be perfectly symmetric, having a limited left tail and an unlimited right one.

Parameter Estimation
Let y 1 , . . ., y n be an iid sample from a two-faced distribution.There are two parameters to be estimated.If θ is known, the maximum likelihood (ML) estimator of κ is The more complicated problem is to estimate θ, since the pdf depends directly on its value.Since θ denotes the modal point, a possible estimating procedure may be like: 1. Find the part of the sample, where the points are more dense, thus identifying a subset of say n * elements with values not too far from each other.2. Let θ be equal to each of the n * points in the subset and calculate the corresponding value of κ by applying (9).We find n * parameter estimates ( θj , κj ), j = 1, . . ., n * .
3. Calculate the log-likelihood for each ( θj , κj ), and consider ( θ * , κ * ) with the largest log-likelihood.4. Change the value of θ * slightly in both directions.If the resulting likelihood does not decrease, we keep ( θ * , κ * ) as the ML estimate.
Example: Consider the sample (2.5, 3.4, 4.1, 4.8, 5.2, 5.3, 5.9, 6.8, 7.2, 8.9, 10.5, 13.4) with n = 12 values.The central part of it (five sample points, values ranging from 4.1 to 5.9) is the "dense kernel", and we consider for each kernel point the corresponding estimate κ and calculate the log-likelihood.Results are shown in Table 3.The log-likelihood increases until reaching its maximum for θ = 5.3, then decreases again.It is easy to verify that neighboring values all show log-likelihood values less than −29.299.Therefore, we can consider θ = 5.3 and κ = 1.662 as the ML estimates of the parameters θ and κ.We found a nice application of the two-faced model to the distribution of 328 (out of 341) municipalities of the Emilia Romagna region.We excluded 13 cities with more than 50000 resident people, since the model does not seem to fit well with big cities.The region is located between the Po river and the Apennine ridge.The biggest city of the region is Bologna with about 375000 inhabitants, and there are many middle-sized towns.The total population is around 4 million people, corresponding to 7% of the Italians.The working variable is Y which denotes the population Census 2001 (in thousands of inhabitants).We tried to find the hinge point by analyzing the modal range of the population.Evidently, the most dense interval corresponds to small municipalities between Y = 1 and Y = 3.We divided this subset of small municipalities in classes of width 0.2 (200 people), and identified two main classes: "2.0 to 2.2" with 14 units and "2.2 to 2.4" with 12 units.These classes are sensibly denser than the neighboring classes.We then chose the intermediate value as hinge point, i.e., θ * = 2.2.Applying (9) we derive the corresponding ML estimate κ = 0.384.The estimated model (see formula 2) is The resulting cdf and the empirical cdf are compared in Figure 3. Considering the several factors that may affect the distribution of the population, the similarity between the two curves is quite impressive.The maximum absolute deviation between the model and the data, i.e., the Kolmogorov distance, is 0.0394.
In order to give another visual representation of the good fit of the model, a quantilequantile plot is shown in Figure 4, putting in abscissa the observed centiles and in ordinate the corresponding centiles of the fitted two-faced model.A perfectly fitting model would give a straight line.Looking at the plot, the correspondence is evident, and some slight deviation from linearity may be found only in the right tail (upper centiles).This gives a further evidence of a satisfactory goodness-of-fit of the proposed model.

Concluding Remarks
The two-faced distribution may be suitable for fitting some natural or social phenomena in which there is a "hinge point" which is the limit point between "small" and "large" values: the density reaches its maximum at this point, and then decreases exponentially, with a very long tail.The application to Emilia Romagna municipal population data marks a hinge point of 2.2, having some observations which are 15 to 20 times larger than the modal value.It could be interesting to fit the same distribution to other geographical data, in which there may be a great majority of small and middle-sized observations and a little minority of very large ones.Probably, a similar uneven distribution may be found when analyzing some economic and social variables, and it would be interesting to check the adequacy of the present model.Probably, the TF distribution pattern may be proposed to other "classical" models, in which the left and right tails have a different shape.The triangular distribution may be "faced" with other models, like Gaussian or Gamma (a triangular-Gamma model would include the TF distribution).Another challenging problem would be to find real phenomena behaving like that.The scientific research, as always, has a primary role, either in developing (and improving) techniques and models or in finding connections with real life.Both aspects have a great importance and are strictly connected to each other.

Figure 3 :Figure 4 :
Figure 3: Population of municipalities in Emilia Romagna: comparison of the theoretical and the empirical cumulative distribution functions

Table 1 :
Expected value E(Y ) for some parameter values

Table 2 :
First three relative moments and some related indices, where μ3 = E[Y −E(Y )] 3 and it depends, essentially, on the right tail values of the sample.

Table 3 :
ML Estimates of both TF parameters