Analysis of Count Time Series: A Bayesian GARMA(p, q) Approach

Extensions of the Autoregressive Moving Average, ARMA(p, q), class for modeling non-Gaussian time series have been proposed in the literature in recent years, being applied in phenomena such as counts and rates. One of them is the Generalized Autoregressive Moving Average, GARMA(p, q), that is supported by the Generalized Linear Models theory and has been studied under the Bayesian perspective. This paper aimed to study models for time series of counts using the Poisson, Negative binomial and Poisson inverse Gaussian distributions, and adopting the Bayesian framework. To do so, we carried out a simulation study and, in addition, we showed a practical application and evaluation of these models by using a set of real data, corresponding to the number of vehicle thefts in Brazil.


Introduction
The literature on time series was pointed out by Cox, Gudmundsson, Lindgren, Bondesson, Harsaae, Laake, Juselius, and Lauritzen (1981) and Khandelwal, Adhikari, and Verm (2015) as one of the most active topics in statistics, being employed in economics, physics, and engineering, for example.In the time domain, the ARMA(p, q) class, proposed by Box and Jenkins (1976), is widely used in applied studies, being an efficient option to create forecasts (Silva 2020).
Despite the usefulness and efficiency of the ARMA(p, q) models, this class is considered more appropriate for fitting Gaussian data, assuming a linear structure of the correlation (Silva 2020).In financial or count time series, these assumptions can be strong, as they can have an asymmetric behavior and heavier tails than the Gaussian distribution.
According to Dunsmuir and Scott (2015); Davis, Fokianos, Holan, Joe, Livsey, Lund, Pipiras, and Ravishanker (2021); Kong and Lund (2022), there has been a recent interest in proposing and building models for non-Gaussian time series.This theme was discussed in Cox et al. (1981), who specified observation-driven models, in which the dependence structure is related to the past of the series, and parameter-driven models, in which the dependence follows a latent process.For further details about observation-driven and parameter-driven models, see Davis et al. (2021).
Concerning observation-driven models, Benjamin, Rigby, and Stasinopoulos (2003) introduced the GARMA(p, q) class considering that the conditional distribution, given the past information, belongs to the exponential family, with the conditional mean related to a linear predictor that contains explanatory variables and the dependence structure in the ARMA(p, q) form.
A Bayesian analysis of the GARMA(p, q) class was presented and discussed in the study of Andrade, Andrade, and Ehlers (2015), who used conditional distributions such as Poisson, Binomial, and Negative binomial.The authors indicated contributions using this approach in terms of point estimation and the range of the credible intervals of the parameters when modeling count data.
When dealing with count time series, overdispersion is commonly observed (Barreto-Souza 2017; Gonçalves and Barreto-Souza 2020), a peculiarity that makes the Poisson not suitable, since it is equidispersed.A solution for solving this problem is to consider mixed Poisson distributions like the Negative binomial (NB) and Poisson inverse Gaussian (PIG) (Dean, Lawless, and Willmot 1989).In addition to these, alternative approaches to model correlated counts using different conditional distributions, such as the Conway-Maxwell-Poisson and Bernoulli-Geometric, have arisen in literature (Davis et al. 2021;Sales, Alencar, and Ho 2022).
The PIG distribution was introduced into the time series context by Barreto-Souza (2017) using the INteger-valued AutoRegressive (INAR) structure.However, the use of the PIG in regression models was proposed in Dean et al. (1989) for modeling insurance data, being considered an attractive distribution in the presence of heavy tails.For further information on modeling insurance data using the PIG regression, see Willmot (1987).
Our main goal in this paper is to study models for time series of counts using the Poisson, Negative binomial, and Poisson inverse Gaussian distributions, adopting the Bayesian framework used by Andrade et al. (2015) and extending it to the Poisson inverse Gaussian.As an application example, we analyze the performance of these models by forecasting the number of vehicle thefts in the region of Campinas, Brazil.This paper is organized as follows: In Section 2 we described the models and Section 3 describes the Bayesian analysis.Section 4 focuses on computational and simulation results.In 5, we presented the real data application.Some remarks and topics for future research are discussed in Section 6.
The Autoregressive and Moving Average components that compose the dependence structure (τ t ) are represented by the vectors Φ = (ϕ 1 , . . ., ϕ p ) ⊤ and Θ = (θ 1 , . . ., θ q ) ⊤ , respectively, where p and q are the Autoregressive and Moving Average orders of an ARMA(p, q) process.The vector of coefficients, which is related to the r explanatory variables of x t , is expressed by β = (β 1 , β 2 , . . ., β r ) ⊤ .Also consider that the likelihood function, denoted as L(θ | Y ), is given by the product of the conditional distributions, and it can be approximated as shown in Equation 1: where m are the m = max(p, q) first observations of Y and θ represents the set of parameters of each model.

Poisson
In this model, we assume that the conditional distribution of each observation, given the past information, follows a Poisson distribution.So, p(y t | F t−1 ) is given by Equation 2: being Ω = {µ t | µ t > 0} with y t ∈ N. As known, the Poisson belongs to the exponential family with logarithm as the canonical link function, being it a submodel of the GARMA(p, q) class, where E(y Considering the link function, the linear predictor of this model is given by: This can be written as the back-shift operator (B) (Briet, Amerasinghe, and Vounatsou 2013).
In some cases, it is necessary to insert a threshold, c, to guarantee the existence of the link function.A possibility is to replace y t−j by y * t−j = max(y t−j , c), c ∈ (0, 1).Given the conditioning of L(β, Φ, Θ | Y ) to the first m observations of the series, we can suppose that the m first errors are zero.Moreover, the residuals associated with the moving average term can be constructed by using the Pearson residuals, residuals on the original scale, or residuals on the predictor scale (Benjamin et al. 2003;Rocha and Cribari-Neto 2008).

Negative binomial
We supposed that y t | F t−1 follows a Negative binomial distribution, i.e.NB(µ t , σ).The conditional density is given by Equation 3: There are several forms of the Negative binomial in the literature.Andrade et al. (2015), for example, considered the form where ν = 1/σ is supposed to be known, belonging to the exponential family and resulting in a GARMA(p, q) submodel.
We relaxed the assumption of the GARMA(p, q), in which p(y t | F t−1 ) belongs to the exponential family, assuming that the dispersion parameter (σ) is unknown.Therefore, the conditional mean is given by E(y t | F t−1 ) = µ t and the conditional variance is equal to Considering the logarithmic link function to ensure that µ t ⊆ Ω, the linear predictor can be written as follows: which is equal to the predictor of the Poisson, with the same restriction replacing y t−j by y * t−j = max(y t−j , c), c ∈ (0, 1).Given the conditioning of L(β, Φ, Θ, σ | Y ), the assumption that the m first errors are zero following the conditions established by the Poisson model.

Poisson inverse Gaussian
Supposing that y t | F t−1 follows a Poisson inverse Gaussian, PIG(µ t , σ), the conditional density is: where ) dx, where K(•) is the modified Bessel function of the third kind.The conditional mean and variance are similar to the Negative binomial, that is, E(y t | F t−1 ) = µ t and V(y t | F t−1 ) = µ t (1 + µ t σ).We consider the same linear predictor as shown in Poisson and the approximated likelihood is analogous to the Negative binomial model.
Due to the complexity of the likelihood functions of the Poisson, Negative binomial, and Poisson inverse Gaussian models, iterative methods can be used to estimate the parameters.In this paper, we consider a Bayesian analysis, justified by the inferential gain obtained in Andrade et al. (2015) and by the possibility of inserting prior knowledge.
For the parameters that correspond to the effects of explanatory variables, we suppose that each component of β is normally distributed, i.e.: for j = {1, . . ., r}, being µ j and τ j the hyperparameters associated to β j .A similar structure was adopted for Φ and Θ.It is: where k = {1, . . ., p} and l = {1, . . ., q}.The hyperparameters were fixed at For the dispersion parameter of the Negative binomial and Poisson inverse Gaussian models, we considered a non-informative Gamma(s, a) prior with hyperparameters a = 1 e s = 100 −1 , that is given by: In Figure 1, we present the behavior of the prior densities of β j (a) and σ (b).Analogously, the behavior of the densities of ϕ k and θ l is similar to that shown in Figure 1(a).
Considering the algebraic complexity of the joint posterior distributions, namely π(θ | Y ), the inference procedure can be performed by using Markov chain Monte Carlo (MCMC) methods, such as the Metropolis Hastings (MH) to sample from the joint posterior distribution, or using iterative algorithms for numerical optimization.According to Korn, Korn, and Kroisandt (2010), the basic idea of the MCMC methods is to draw samples from a target distribution by simulating a Markov chain, in which the stationary distribution follows the target.The MH is a popular MCMC algorithm, proposed in Metropolis, Rosenbluth, Rosenbluth, Teller, and Teller (1953) and generalized by Hastings (1970).It builds a chain started in an arbitrary state and has a transitional probability p i,j , representing the movement probability of the state i to j, and satisfying the reversible equation.Defining q(i, j) as the transition kernel, the MH generates candidates from the transition and evaluates their acceptance probability (Korn et al. 2010).Because this process is dynamic, the chain will reach the stationary distribution over time.For properties and extensions of the MH algorithm, see Hastings (1970) and Korn et al. (2010).

Predictive density
In this Subsection, we derived the density of Y t+h conditioned on all parameters and past observations, which is also called predictive density.Combining the joint posterior π(θ | Y ) with the density of the future observation, y t+h , p(y t+h | θ, F t+h−1 ), the predictive density is: which does not have a closed form.In this case, one strategy is to produce a Monte Carlo approximation of the predictive, drawing N samples from θ i , i = {1, . . ., N }, as follows: This procedure was also performed by Andrade et al. (2015) and discussed in Gamerman and Lopes (2006); Krüger, Lerch, Thorarinsdottir, and Gneiting (2020).On this wise, the expected value of y t+h is: which can be approximated from µ t+h , drawing N samples of θ i , i = {1, . . ., N }, i.e.:

A simulation study
In this Section, we implement a simulation study to analyze the performance of the Bayesian approach to POI-AR(1), NB-AR(1), and PIG-AR(1) models.Some parameter settings are taken into account, three sample sizes n = {75; 125; 225} and each model will be replicated w = 1,000 times.
To simulate the w series, we used the gamlss.distpackage (Stasinopoulos and Rigby 2020) with the inverse transform method, available in the R software (R Core Team 2021).Data was generated following its respective distributions and for the NB-AR(1) and PIG-AR(1) models, we fixed the dispersion parameter at σ = 0.25.We used the MH to estimate the parameters by using the package MHadaptive of Chivers (2015).The code was executed in Python, with the library rpy2 (Gautier 2021).Configurations such as burn-in, thin, and total samples were determined by a pilot study.
The convergence was assessed in each replicated model via HW (Heidelberger and Welch 1983), G (Geweke 1992), and the Dependence Factor (I) (Raftery and Lewis 1992) criteria.
To evaluate the convergence, we adopted an α level of 0.05 in the HW diagnosis and a threshold equal to 0.10 (eps) in the half-width; we compared the |G| statistic with the Z 1− α 2 quantile and verified if the value of I tended to one.
In the simulation step, only the processes in which the convergence was obtained, according to the three criteria shown above, were considered.We evaluated the performance of the inference by using the Corrected Bias (CB) and Corrected Error (CE) in a similar way as Andrade et al. (2015).Those metrics were estimated by θ , being τ the standard deviation of θ among the w replicates.In Algorithm 1, we described our simulation procedure and some simulation results are available in Subsection 4.1.In the Appendix we presented the settings of the MH, including the burn-in period and convergence criteria.For convenience, we presented the mean of the convergence criteria and the acceptance probability.

Algorithm 1 Steps of the simulation process
1: Start 2: Set the model, the number of samples, burn-in and thin period.
3: Set the parameters and n for generating the data.4: for w in 1 until w = 1,000 do

5:
Draw the series Y of length n.

7:
Evaluate the convergence using the HW, G, and I criteria.

8:
if the criteria indicate convergence then 9: Estimate the mean, mode, and standard deviation of the posterior distributions. 10: Store (9) and the convergence results.

11:
Store the acceptance probability and the Monte Carlo error. 12: w = w + 1.

16:
end if 17: end for 18: Save the average of the w results of the posterior mean, mode, and standard deviation.
Name these quantities as mean, mode, and SD, respectively.19: End

Parameter settings and results
We simulated the series according to the following equation: where β = (β 1 ) ⊤ is the vector associated with the level and x t = (x 1 ), being x 1 = (1, 1, . . ., 1) ⊤ .Table 1, shows the values used for the simulation of the artificial series.In Table 2, the results of the first scenario are presented, where the parameter ϕ 1 = 0.10.In general, there were reductions in the CB values when the sample size increased, suggesting good properties in the estimating process.This was also seen in the CE values, which tended to one.
The estimates of the dispersion parameter, σ, in NB-AR(1) and PIG-AR(1) models were improved as n increased.Our findings suggest a better performance of the modes estimates when compared to the averages to infer about the parameter σ, producing a good approximation when n = 225.
The other scenarios are included in the Appendix.When increasing ϕ 1 to 0.40, available in Table 7, the results of CB and CE metrics were similar to the first one, indicating improvement of the estimates with the increase in sample size.The estimates of the mode of the marginal distribution of σ were closer to the real value when compared to the average estimates.Similar results were found in the second and third scenarios, where the estimates of ϕ 1 improved as the sample size increased, indicating the necessity of large samples sizes, mainly in models with dispersion parameters.The results of these scenarios can be seen in Tables 7  and 8, respectively.
The importance of time series with large sample sizes was noted by Barreto-Souza (2017) when studying the PIG model with INAR(1) structure, reducing the bias and the standard errors as n increased.Regarding the scenarios studied in this paper, series in which n ≪ 75, with the same linear predictor, are not considered interesting for modeling.
We verified a high computational cost with the MH algorithm in settings where ϕ 1 was greater than 0.80.In the situation where ϕ 1 = 0.90 and n = 225 in the PIG-AR(1), the pilot model with thin = 20 and 80,000 total samples resulted in a low acceptance rate and a high values of I, suggesting the need to reparametrize or adopt another sampling technique.

Real data analysis
We  We used a trend test, proposed by Cox and Stuart (1955), to evaluate the trend component.
Graphically, this component can be verified by analyzing the behavior of the series over time, presented in Figure 3  To model this series, it was considered a linear predictor with level and trend control and an AR(1) dependence structure, which is similar to Equation 5.However, in this case β = (β 1 , β 2 ) ⊤ and x t = (x 1 , x 2 ) ⊤ , x 1 = (1, 1, . . ., 1) ⊤ and x 2 = (1, 2, . . ., n) ⊤ .The MH algorithm was used for sampling from the joint posterior, setting the burn-in period equal to 2,000, 10,000 total samples and keeping the twentieth sample value as the thinning period.The estimation results are in Table 4 and the graph of the posterior marginal density of each parameter is available in Figures 4, 5, and 6.In relation to the convergence to the stationary distribution, all the processes met the established criteria.
We compared the Bayesian estimates with the maximum likelihood (ML) ones.The likelihood functions were numerically optimized using the L-BFGS-B algorithm (Byrd, Lu, Nocedal, and Zhu 1995) and the standard errors were computed based on the inverse of the Hessian matrix.
When analyzing the ML results, available in Table 12, we can verify the proximity of the estimates between these approaches, especially for the parameter β 2 of the POI-AR(1) and NB-AR(1) models.The greatest distances between the Bayesian and maximum likelihood estimates were observed in β 1 and ϕ 1 of the PIG-AR(1) model.Considering the results shown in Table 4, we can observe the significant trend effect in the three models, indicating a reduction in the number of thefts during the analyzed period.Furthermore, the credible intervals of β 1 and ϕ 1 were wider in NB-AR(1) and PIG-AR(1) models.
In PIG-AR(1), the dispersion parameter was estimated in 0.012, CI σ = {0.089;0.015}, and a similar result occurred with NB-AR(1).It indicates that the variability of the series is greater than the mean and the phenomenon has slightly heavier tails.According to these models, the conditional variance is given by V(y t | F t−1 ) ≈ µ t (1 + 0.012µ t ), for all t.
The deviance information criterion (DIC) and the conditional predictive ordinate (CPO) results are available in Table 5, suggesting a preference for mixed distributions when compared to Poisson, since they presented lower DIC and higher CPO values.We proceeded with the analysis estimating the randomized quantile residuals proposed by Dunn and Smyth (1996), defined as r t = Φ −1 (F yt (y t | F t−1 )), where Φ −1 is the inverse cumulative distribution function of a standard normal and F yt is the fitted conditional distribution that was used for modeling.According to the Box and Pierce (1970) test, based on 20 lags, there is no evidence of any correlation within the residuals (p-values equal to 0.419, 0.127, and 0.128 for POI-AR(1), NB-AR(1), and PIG-AR(1), respectively).The Kolmogorov-Smirnov test did not reject the null hypothesis of normality of the residuals, returning p-values equal to 0.123, 0.414, and 0.421.
Predictions for April and May 2022 are shown in Table 6.We estimated the Mean Absolute Percentage Error (MAPE) and Mean Absolute Error (MAE) for these predictions.The values of MAPE and MAE were MAPE = {14.55;14.70; 14.68} and MAE = {60.60;61.67; 61.32} for POI-AR(1), NB-AR(1), and PIG-AR(1), respectively.Overall, mixed models obtained good results according to the information criteria, but the MAPE and MAE metrics were similar to Poisson's.In addition to this result, the predictions made with PIG-AR(1) showed a lower error when compared with the predictions of NB-AR(1).The predictions for April were overestimated, with values consistently above the real counts, and the credible intervals did not include those values.The prediction for May, based on the PIG-AR(1) model, was relatively close to the real value and the April prediction was overestimated.The same happened with the NB-AR(1) model.

Final remarks
In this paper we investigated models of count time series using Poisson, Negative binomial, and Poisson inverse Gaussian distributions, the last two being alternatives for overdispersed count data.We extended the Bayesian analysis presented in the literature to the Poisson inverse Gaussian and relaxed the assumption of known dispersion parameter in the GARMA(p, q) class for the Negative binomial distribution.
Through a simulation study, we verified the advantages of the Bayesian inference in terms of implementation, estimation, and flexibility of the models, as well as the possibility of modeling the dispersion parameter in the mixed distributions.The Bayesian perspective works well for the given scenarios.The analysis with real data showed that the Poisson inverse Gaussian distribution is an alternative for modeling count data, providing good predictions.
Our possible points for future research include constructing the Probability Integral Transform (PIT) of Czado, Gneiting, and Held (2009) and studying higher AR(p) orders in the Poisson inverse Gaussian model using the Bayesian inference, and considering the Hamiltonian Monte Carlo (HMC) algorithm to draw samples from the posterior distribution.

Figure 1 :
Figure 1: Prior densities of β j and σ, respectively, that were used in the Bayesian analysis

Figure 3 :
Figure 3: (a): Number of thefts reported in the region of Campinas; (b): Autocorrelation function of ∆Y ; (c): Partial autocorrelation function of ∆Y

Table 1 :
Parameter values used to generate the artificial time series

Table 2 :
Main results of the first scenario for POI-AR(1), NB-AR(1), and PIG-AR(1) models based on the w replications, where β 1 = 1.00, ϕ 1 = 0.10, and σ = 0.25 Geographic location of the cities included in the Campinas region.State of São Paulo, BrazilThe samples from April and May 2022 were removed from the training stage, being used to perform an out-of-sample forecasting analysis.A brief summary of the data is available in Table3.Note that the skewness and kurtosis measures indicate an asymmetric and leptokurtic behavior of the data.
considered the number of vehicle thefts in the region of Campinas, São Paulo, Brazil, from January 2010 to May 2022.It is a monthly time series, made available by the Secretaria de Segurança Pública do Estado de São Paulo (2021).For additional details about the database, visit the following website: http://www.ssp.sp.gov.br/estatistica/pesquisa.aspx.The series contains data from 38 localities near the city of Campinas, state of São Paulo, showed graphically in Figure2:

Table 3 :
Descriptive statistics of the number of thefts reported between January 2010 and May 2022 in Campinas, Brazil

Table 9 :
MH settings and convergence analysis in the first scenario based on the w replications.Where AC is the mean of the acceptance rate, G is the mean of the G statistic, I is the mean of I, and HW is the mean of the p-values of the HW test.

Table 10 :
MH settings and convergence analysis in the second scenario based on the w replications.Where AC is the mean of the acceptance rate, G is the mean of the G statistic, I is the mean of I, and HW is the mean of the p-values of the HW test.

Table 11 :
MH settings and convergence analysis in the third scenario based on the w replications.Where AC is the mean of the acceptance rate, G is the mean of the G statistic, I is the mean of I, and HW is the mean of the p-values of the HW test.