GARCH Models under Power Transformed Returns: Empirical Evidence from International Stock Indices

This study evaluates the empirical performance of four power transformation families: extended Tukey, Modulus, Exponential, and Yeo–Johnson, in modeling the return in the context of GARCH(1,1) models with two error distributions: Gaussian (normal) and Student-t. We employ an Adaptive Random Walk Metropolis method in Markov Chain Monte Carlo scheme to draw parameters. Using 19 international stock indices from the Oxford-Man Institute and basing on the log likelihood, Akaike Information Criterion, Bayesian Information Criterion, and Deviance Information Criterion, the use of power transformation families to the return series clearly improves the fit of the normal GARCH(1,1) model. In particular, the Modulus transformation family provides the best fit. Under Student’s t-error distribution assumption, the GARCH(1,1) models under power transformed returns perform better in few cases.


Introduction
Normality is assumed in various statistical methods such as hypothesis testing, linear regression analysis, econometric modeling, and quality control problems. Many processes, however, follow non-normal distributions perhaps due to the absence of exactly normal distributed characteristics. Transformation can be employed as an alternative approach for dealing with non-normal data. One such popular transformation is the Box-Cox family, introduced by Box and Cox (1964). For example, in the case of econometric models-especially AutoRegressive Conditional Heteroscedastic (ARCH) model of Engle (1982) which was extended by Bollerslev (1986) to become Generalized ARCH (GARCH)- Sarkar (2000) and Utami and Subanar (2013) applied the transformation of extended Box-Cox (Bickel and Doksum 1981) and shifted Box-Cox (Box and Cox 1964), respectively, on returns data with volatilities following an ARCH(1,1)-type model. Through an empirical study, Sarkar (2000) found that the residual of the proposed model was closer to normality than that of the basic one. The proposed model is also able to capture the strong nonlinear dependence in financial time series data. In the study of realized variance (RV) model, Gonçalves and Meddahi (2011) and Nugroho and Morimoto (2016) applied the Box-Cox transformation on RV data, and showed that the transformation outperforms the logarithmic transformation.
As in Sarkar (2000), our preliminary study showed that the parameter estimation of the extended Box-Cox transformation, i.e. T (x, λ) = sign(x)|x| λ − 1 /λ for λ > 0, is significant and larger than 1, but is outperformed by the basic model. When the simple version of extended Box-Cox was applied-which is none but the extended Tukey transformation (Tukey 1957), i.e. T (x, λ) = sign(x)|x| λ for λ > 0-the parameter estimation is significant and the extended model outperforms the basic model in terms of likelihood ratio test. This result confirms the findings of Nugroho, Susanto, Prasetia, and Rorimpandey (2019). With the absence of the transformation scarcity to improve statistical models, the first objective of this study is examining the performance of the power transformation families for returns in the context of GARCH model. Findings are illustrated by real examples. We give special attention to the transformation families of extended Tukey (Tukey 1957), the Exponential (Manly 1976), the Modulus (John and Draper 1980), and the Yeo-Johnson (Yeo and Johnson 2000), as all of them accommodate negative values of returns. The last three transformations were successfully applied by Tsiotas (2009) and Nugroho and Morimoto (2014) for laggedvolatility process in the context of stochastic volatility. Recently, Nugroho (2019) used the Yeo-Johnson function to transform the returns data that is applied to estimate a normal GARCH(1,1) model. The proposed model gives a better fit than the basic model. To the best of our knowledge, no study has been conducted on the employment of the four transformations in the context of GARCH model. Furthermore, previous studies show evidence that financial time series often yield the leptokurtic characteristics (which means the return distribution is heavy-tailed), and even asymmetric. Financial literature accordingly suggests replacing the assumption of error normality with flexible distribution functions which are capable of accommodating heavy tails or asymmetry, such as the Student-t, generalized error, and skewed Student-t; see, e.g., Angelidis, Benos, and Degiannakis (2004), Braione and Scholtes (2016), Nugroho, Mahatma, and Pratomo (2018), and Nugroho and Morimoto (2019). This study focuses only on two different distributions on error terms: normal and Student-t distributions, and compares the performance with that of the the power transformation families. The second goal of this study is thus extending the study done by Sarkar (2000), assuming the error term with Student-t distribution.
In contrast with Sarkar (2000) and Utami and Subanar (2013), who respectively suggest the maximum likelihood (ML) and second-order least square (SLS) methods for estimating their models, we use the Adaptive Random Walk Metropolis (ARWM) method of Atchade and Rosenthal (2005) in the Markov Chain Monte Carlo (MCMC) algorithm to estimate the model parameters. In some cases of GARCH-type models it is difficult to apply an ML-based method because of the positive and stationary conditions for variance and also the complexity of the models (see, e.g., Nakatsuma (2000), Ardia (2009), Henneke, Rachev, Fabozzi, and Nikolov (2011), and Boonyakunakorn, Pastpipatkul, and Sriboonchitta (2019). The MCMC algorithm offers a flexible and attractive method, and has been widely used in statistics and other scientific fields.
The rest of this article is arranged as follows: Section 2 provides the methodology including the proposed model, power transformation family, and estimation method. Section 3 presents the data analysis and empirical application of the model on real financial data. The last section provides the conclusion and future work.

Proposed model and estimation
This section extends the GARCH(1,1) model with normal and Student-t distributed error terms, by applying a power transformation family to return series. We give attention to four power transformation families that work on real values. Firstly, GARCH model is constructed based on the assumption of normally returns errors, which are not able to accommodate the leptokurtosis commonly found in financial time series. In the Value-at-Risk (VaR) estima-tion using GARCH model, the general finding is that the use of normal assumption might underestimate the VaR. Angelidis et al. (2004) and Braione and Scholtes (2016) showed the importance of allowing for heavy-tails in the distributional assumption, such as Student-t, since this distribution is able to produce better accuracy of the VaR forecasts modeled by using GARCH models. Meanwhile Satoyoshi andMitsui (2011) andLiu, Li, andNg (2015) considered the use of GARCH-type models with Student-t distributed errors to price options. Their empirical illustrations suggest that the proposed pricing models provide more accurate price estimates than the normal GARCH-type models.
We realize that the ML method is the most popular method to estimate GARCH parameters, but this study employs the powerful MCMC technique to deal with inference Bayesian problems. In the context of normal GARCH model, Nakatsuma (2000) and Haitao (2010) compared the MCMC method to the ML method by using real financial data and showed that the MCMC method is more precise and reliable. In particular, Nakatsuma (2000) employed the Metropolis-Hastings algorithm in the MCMC procedure dan obtained that the numerical standard error of the MCMC is much less than the standard error of the ML for each paramater in the ARMA-GARCH model. In the context of ARCH models, Andrade (2011) also recommended the Metropolis-Hastings MCMC algorithm since it provides more accurate estimates and more robust than the ML method in estimating the credibility intervals of parameters.

GARCH(1,1) models with power transformed returns
The GARCH model is one of econometric models widely employed on modeling and forecasting of time-varied variance. A member of the family is the GARCH(1,1), implying an interpretation that a variance of tomorrow is simply a function of long-term average variance, squared return of today, and today's variance. This comes in the form: where R i denotes the asset return of an asset at time i.

Model
(1) has some constraints including ω > 0, α, β ≥ 0 for positive variance, and α + β < 1 for the covariance stationarity. In most cases, we get a model with a small α and large β. The α + β magnitude is called variance persistence, measuring the endurance of the innovation effects on future volatility (or, how quickly the volatility to decay towards its averages). When the sum is 1.0, we have a long memory in volatility, then the IGARCH model should be employed. The high persistence (approaching 1.0) indicates slow decay, while low persistence indicates fast decay.
This study generalizes the GARCH(1,1) model by applying four power transformation families for return series. Given a transformation function indicated by T and involving a transformation parameter λ T , the return equation in Model (1) is now expressed as where T (R i , λ T ) represents a power transformation function upon data R i . For any family T with the parameter λ T , the likelihood function of original data R i , expressed by L(R i |σ 2 i , λ T , T ), is fully determined through an inverse transformation T (R i , λ T ) → R i . Thus this consists of transformed likelihood data function multiplied by Jacobian of involves transformation, i.e.
with the Jacobian term is expressed by J(R i , λ T |T ) = ∂T (R i , λ) ∂R i . Let us denotes the observed data by R = (R 1 , R 2 , . . . , R n ). For instance, by taking the natural logarithm for the likelihood function L i di Eq. (2), the log-likelihood function for the model is given by When the conditional distribution of R i is the standardised t-distribution with zero mean, the variance σ 2 i , and degrees of freedom ν, following the definition of t-distribution in Bollerslev (1987), the log-likelihood equation for the model is expressed as

Transformation families
A conventional power transformation was firstly proposed by Tukey (1957Tukey ( , 1977 to simplify data analysis, addressing on three characteristics: (1) effects are additive; (2) the error variability is constant; and (3) the error distribution is symmetrical and possibly nearly normal. The transformation was then modified by Box and Cox (1964) to improve additivity, normality, and homoscedasticity of a set of observed data. The transformations of Tukey and Box and Cox works only on positive real values. To accommodate all real numbers, several modifications were already proposed. This study discusses and compares four transformation families, namely the Extended Tukey (ET), Eksponensial (Exp), Modulus (Mod), and Yeo-Johnson (YJ), that are applied to returns series whose volatility follows the GARCH model. Manly (1976) proposed an exponential transformation allowing negative values. This was reported as effective to transform skew unimodal distribution to normal-like, almost symmetrical distribution, but not good enough for bimodal or U-shaped distribution. John and Draper (1980) proposed a modification called the Modulus transformation allowing negative values. They reported that the transformation works most effectively on a distribution which is nearly symmetrical to a centre point, and change the long-tailed distribution to become more normal. The basic idea is applying the same power transformation to two tailed distribution which is symmetrical to zero. (2000) proposed a power transformation family suitable for reducing skewness to normality and having many good characteristics such as the Box-Cox power transformation family. While estimating the transformation parameters, they found a parameter value which minimise the Kullback-Leibler distance between the normal and transformed distribution.

Yeo and Johnson
Thus while the BC, Exp, and YJ transformations are used to make skewed distribution to become more symmetrical and normal, the modulus transformation family proposed by John and Draper (1980) can be used to eliminate the positive kurtosis.
The formulas for the transformation families compared in this study are shown in Table 1, with associated Jacobian added. The signum function is defined as The values λ ET = 1, λ Exp = 0, λ M od = 1, and λ Y J = 1 correspond with no transformation.

Parameter estimation
The MCMC algorithm involves two steps; see Casella (2011) andGeyer (2011). The first step is the construction of Markov chain having a posterior distribution as its stationary distribution. The second step involves the use of the output of Markov chain to calculate some posterior estimation such as the mean, standard deviation, and credible set, using the Monte Carlo approximation. Previous papers have applied the MCMC algorithm in modeling the volatility, and shown its applicability. Further explanation on MCMC can be found on Robert and Casella (2011).
There are several ways to construct the Markov chain. One of the most popular classes is a method based on the random walk Metropolis (RWM). The theoretical explanation of RWM is provided by Roberts and Rosenthal (2001) on optimal scaling and posterior shape, and by Roberts, Robert, and Frigesso (2003) on convergence. In the efforts to improve the efficiency of RWM in terms of convergence, Atchade and Rosenthal (2005) proposed an adaptive model of the method, named the adaptive RWM (ARWM). The model sequentially sets the parameter proposal to achieve optimal acceptance rate. Nugroho and Susanto (2017) and Nugroho (2018) applied the method in the context of GARCH-type models. The following is the ARWM algorithm to update a parameter value of x: (i) Initialise the parameter x 0 and step size s 0 .
(ii) At the (j + 1) iteration, given x j and s j .
(c) If δ > u for u ∼ Uniform(u; 0, 1), then the proposal is accepted and x j+1 = x * ; otherwise, the proposal is rejected and proposal acceptance x * and τ s the expected acceptance probability. If s * > s max , then s j+1 = s max ; if s * < s max , then s j+1 = s j .
Following Roberts and Rosenthal (2009), we set s min = 10 −5 and s max = 10 as such so that the asymptotic acceptance rate of the algorithm is approximately 0.44, and λ = 0.6. The choice of step size scale of s i has big impacts on the changes of proposal values. Intuitively, when s 2 i is very small, the algorithm resulting movement is also very small, which further resulting a poor mixing time. Contrarily, if s 2 i is very large, then the movement is also large, resulting in the rejection of the proposal, and thus a poor mixing algorithm.
Furthermore, here we accept the fact that the proposal distribution for the Metropolis ratio in the ARWM method can be calculated using the posterior distribution p(θ|R) for a parameter θ = (ω, α, β, ν, λ T ). In the Bayesian framework, the posterior distribution is the product of likelihood function and prior distribution: where p(θ) is the prior distribution capturing prior uncertainty on the parameter θ. In this study we adopt the prior distribution: where 1 [x] = 1 if condition x holds and 0 otherwise, as in Ardia and Hoogerheide (2010), as in Deschamps (2006) and λ T ∼ N (0, 1000).

Data and preliminary findings
Before estimation, we shall look at the sample data, which-in this case-is the daily returns.  Table 2 reports standard summary statistics of the 19 stock indices. The skewness and kurtosis of the return series result in some deviation from a normal standard distribution as indicated by Jarque-Bera (JB) normality test with critical value of 5.98, at the 5% significance level. Therefore, assuming normality for the errors when estimating the models should be rejected. It is suggested that the error distribution of returns is assumed to be not normal. Specifically the S&P CNX Nifty data is highly skewed, as the skewness is less than −1; Nikkei 225 data is negatively skewed, as the skewness is between −1 and −0.5; the other ones are fairly symmetrical, as the skewness is between −0.5 and 0.5. For this characteristic, it is probable to transform the return data so that its distribution can approximate the symmetric distribution (Sarkar 2000), a part from its ability to capture the presence of nonlinear dependence in time series data. Meanwhile the kurtosis for all series is greater than 3 which indicates a leptokurtic distribution (i.e. tails are heavier than the normal distribution. Therefore our presumption is that the Modulus transformation introduced by John and Draper (1980) is better than the others.

Estimation results
For parameter estimation, this study employs the MCMC simulation with 6000 iterations for the Markov chain, where the initial 1000 samples are discarded (the practice of so-called burnin) and the remaining samples are stored for Monte Carlo calculation. The burn-in is done to eliminate the worst early samples and its implementation was succesful in our MCMC calculations although not reported (see, e.g., Figure 1). It means that our Markov chains (after the burn-in period) have reached stationarity. For our data set, we set the plausible default initial values: ω 0 = 0.01, α 0 = 0.2, β 0 = 0.7, λ T,0 = 0.5, ν 0 = 10.
Notice that although the initial values vary (even for the very poor initial values λ T,0 = 4 and ν 0 = 40) in our experiments, the posterior results are almost equal to the results obtained for the above values. This result is consistent to Ardia and Hoogerheide (2010). This is one of the advantages of the MCMC method compared to the ML method which is often sensitive to initial values (Takaishi 2010).
Furthermore, the efficiency of Markov chain might be measured by Integrated Autocorrelation Time (IACT). In this study, IACT was estimated using adaptive truncated periodogram estimator of Sokal (1997). Although not reported, we particularly notice that the Markov chains of ν and λ T are well mixed, which are indicated by the values-less than 50-of the IACT. The IACT quantity can be interpreted as the number of MCMC samples required for an independent sample to be drawn, or equivalently, the length of Markov Chain (after the burn-in) divided by the Effective Sample Size (ESS). Therefore, we have ESSs larger than 100. According to Givens and Hoeting (2013), it indicates that the estimation method is efficient to estimate the parameters of degrees of freedom and power transformation, which are not simple, and recommends to make reliable inferences on the interest parameters. In a visual way, for example, Figure 1 displays the traceplots of the posterior samples for the NLR-GARCHt(1,1) models adopting FTSE 100 returns and shows that the chains fluctuate around their means.
Due to space limitation, this study only presents the posterior mean for λ T in the GARCHn(1,1) model as in Table 3 and for (ν, λ T ) in the GARCHt(1,1) model as in Table 4. The asterisk symbol indicates a significant estimated value in terms of 95% Highest Posterior Density (HPD) interval, meaning that the interval does not contain a value corresponding to no transformation. Here the HPD interval is calculated using the algorithm proposed by Chen and Shao (1999).
For the GARCHn(1,1) model, it can be seen that almost all sample data provide significant   Although not reported, this study has also studied and compared the residuals of competing models to determine the transformations that are successful in transforming returns so that the distribution of residuals approaches normal distribution. For all cases, we find that-although still deviate from normality-all power transformations successfully reduce the coefficients of skewness and kurtosis as well as the JB statistics, even when the estimation of λ T is not significant. The Exp, Mod, and YJ transformations provide the best performance to reduce the skewness coefficient in the cases of six, five, and eight data, respectively. When comparing the coefficient of kurtosis, this study finds that Mod transformation is the most effective for reducing the coefficient of kurtosis in all cases of data. Furthermore, the results of the normality test confirm the previous assumption that the Mod transformation is most suitable for all cases of return data, which is indicated by the smallest JB statistical value. Moreover, the quantile-quantile (QQ) plots in Figure 2, for example, demonstrate that the estimated residuals from fitting the S&P CNX Nifty daily returns to the NLR-GARCH(1,1) models are closer to normal distribution. In particular, the panel (d) suggests that the specification of Mod transformation provides the best fit.
Move to the model with the Student-t distribution specification for return errors, Table 4 shows that the parameter of power transformation is no longer significant in most cases when the normal distribution is replaced by the Student-t distribution. It is very clear because the use of different distributions for return errors will affect the treatment of returns. In this case, the Student-t distribution causes a weak support for the power transformation of return. The parameter estimates are significant for six, five, nine, and six data when applying the ET, Exp, Mod, and YJ transformations, respectively. Regarding the degrees of freedom parameters ν, the ET, Exp, and YJ transformations do not affect the estimate of ν. It means that the estimates of ν from the two model classes are close each other. In contrast, Mod transformation increases the degrees of freedom in all data cases, except the FT Straits Times data. It means that the tail-thickness of Student-t distribution is reduced, which is in accordance with the objective of Mod transformation which changes long-tailed distribution to be more normal. Although the power transformations are supported by several data, the next section discusses whether the significance of λ T is sufficient to provide evidence against the basic GARCHt(1,1) model.

Model selection
In modeling case, one of the important parts of statistical analysis is the selection of the best model from a set of candidate models, given a set of data. Since the NLR-GARCH models nest a basic GARCH model, Log-likelihood Ratio Test (LRT) can be used to compare those two models. The LRT statistic for the general model M g against the simpler model M s is calculated as whereθ denotes the vector of parameter esrtimates. The criterion is based on the ratio likelihood having an asymptotic chi-squared distribution, with degrees of freedom equal to the number of additional parameters in the general model. For our study, notice that at the 1%, 5%, and 10% level of significance with 1 degree of freedom, we reject M s when the LRTs are greater than 6.64, 3.84, and 2.71, respectively. In these cases, the LRT tells us to favor M g over M s .
Furthermore, the models in a class of NLR-GARCH models are non-nested, so we can not compare them by conducting an LRT. We therefore computed the value of three selection criteria which do not assume nested models, namely Akaike Information Criterion of Akaike (1976), Bayesian Information Criterion (BIC) of Schwarz (1978), and Deviance Information Criterion (DIC) of Spiegelhalter, Best, Carlin, and van der Linde (2002). The AIC and BIC are respectively calculated using: AIC = 2k − 2 log L(R|θ max ) and BIC = k log(s) − 2 log L(R|θ max ), with k is the number of model parameters, s is the number of data points, and θ max is the parameter set that maximizes the observed data likelihood L(R|θ). Meanwhile, DIC is a Bayesian generalization of AIC and is based on the posterior distribution of the deviance D(θ) = −2 log p(R|θ). The formula of DIC is expressed by with θ is the parameter mean. Notice that the models associated with the lower AIC, BIC, and DIC are more adequate. Table 5 reports the log-likelihood values for all models that are fitted to the observed data. It can be seen in the case of NLR-GARCHn(1,1) models with a significant paramater for λ T , the LRT at the 5% and 10% levels of significance favors the models and provides an evidence against the basic model. This finding is confirmed by AIC and DIC, where  Table 7, which the only exception is the Russel 2000 data series. The third and fourth best ranks are very competitive between the Exp and YJ transformations. In this case, the BIC selected fewer data than the AIC and DIC. Both AIC and DIC provide evidence for 16 data, whereas the BIC provides evidence for seven data only. Those findings therefore confirm the previous findings that the financial return series should be transformed using power transformations, specifically Mod transformation, in the context of GARCH(1,1) model with normally distributed errors.
After we compare the models with Student-t specification based on the LRT statistics, there are seven data adopting the ETR-GARCHt(1,1) model, five data adopting the ExpR-GARCHt(1,1) model, five data adopting the ModR-GARCHt(1,1) model, and six data adopting the YJR-GARCHt(1,1) model which provide evidence against the GARCHt(1,1) model. On that result, both AIC and DIC indicate that each of ET, Mod, and YJ transformations performs the best fit on three sample data. A similar result is produced by BIC value (except for CAC 40 and IPC Mexico in the Mod specification). These results seem to weakly support the specification of the power transformation families for return series in most cases when the return errors are Student-t distributed.
Furthermore, both AIC and DIC indicate that the superiority of NLR-GARCHt(1,1) model is supported by two sample data with insignificant in the ET parameter and three sample data with insignificant in the Exp and Mod parameters. Meanwhile, in the case of data providing significant evidence in the tranformation parameter, the ETR-GARCHt(1,1), ExpR-GARCHt(1,1), ModR-GARCHt(1,1), and YJR-GARCHt(1,1) models outperform the GARCHt(1,1) model on 5 of 6, 2 of 5, 1 of 9, and 3 of 6 sample data, respectively. The results on Mod and YJ specifications are similar to the BIC results. According to BIC the ETR-GARCHt(1,1) and ExpR-GARCHt(1,1) models outperform the GARCHt(1,1) model on 3 of 6 and 1 of 5 sample data with a significant transformation parameter. Those findings show that the statistical significance of the transformation parameter in the NLR-GARCHt(1,1) models can not guarantee whether the model performs better than the basic GARCHt(1,1). On the basis of DIC estimates, the FTSE 100, FT Straits Times, Hangseng, IPC Mexico, and KOSPI Composite data series support all transformation specifications, where the best performance is produced by different transformations.

Conclusion and future work
This study investigated the performance of four power transformation families, i.e. the extended Tukey, exponential, Modulus, and Yeo-Johnson, in transforming the returns of GARCH(1,1) models. The proposed models, namely NLR-GARCH(1,1) models, are extensions of the basic GARCH model of Bollerslev (1986Bollerslev ( , 1987 and provide the alternative models to Sarkar (2000) and . We performed the Bayesian inference of the proposed model by the ARWM sampler in the MCMC method. Firstly, in terms of autocorrelation time, we observed that the ARWM sampler has high efficiency for sampling the transformation parameter in the nonlinear function. It is concluded that our MCMC algorithm works well and the ARWM sampler can reliably estimate the key parameters in the NLR-GARCH(1,1) models. The reader can easily extend and employ the computing algorithm for the other specifications of NLR-GARCH models. Secondly, all observed return series were found to support all considered power transformation families in the context of normal GARCH(1,1) model where the Modulus transformation appears to be the best fitting model, and this outperformance is always statistically significant. Therefore this study concludes that the power transformation families can be applied successfully to transform the returns for GARCH(1,1) model with normally distributed errors. Meanwhile, the results of Student-t NLR-GARCH(1,1) models indicate that the performance of transformations does not depend on the significance of transformation parameter.
Areas for further research would include investigating the skew version of error distribution in the returns. It would also be interesting to apply the power transformation families to the volatility process in order to extend the results of .