Robust Estimation for a Generalised Ratio Model

It is known that data such as business sales and household income need data transformation prior to regression estimate as the data has a homoscedastic error. However, data transformations make the estimation of mean and total unstable. Therefore, the ratio model is often used for imputation in the field of official statistics to avoid the problem. Our study aims to robustify the estimator following the ratio model by means of Mestimation. Reformulation of the conventional ratio model with homoscedastic quasi-error term provides quasi-residuals which can be used as a measure of outlyingness as same as a linear regression model. A generalisation of the model, which accommodates varied error terms with different heteroscedasticity, is also proposed. Functions for robustified estimators of the generalised ratio model are implemented by the iterative re-weighted least squares algorithm in R environment and illustrated using random datasets. Monte Carlo simulation confirms accuracy of the proposed estimators, as well as their computational efficiency. A comparison of the scale parameters between the average absolute deviation (AAD) and median absolute deviation (MAD) is made regarding Tukey’s biweight function. The results with Huber’s weight function are also provided for reference. The proposed robust estimator of the generalised ratio model is used for imputation of major corporate accounting items of the 2016 Economic Census for Business Activity in Japan.


Introduction
Ratio imputation is a special case of regression imputation (De Waal, Pannekoek, and Scholtus (2011), pp.244-245). When there are missing values in the target variable y, the observed auxiliary variable x is used to estimate missing y values. Therefore, x must be chosen from the variables that are highly correlated with y. The imputation model is where data i = 1, . . . , n of (x, y) are observed n units in the imputation class of size N . The true ratio β is obtained byȳ/x; however, it is usually unknown due to the existence of missing values in y. The estimated ratioβ is used to substitute for the missing y values such that The ratio model (1) is a best linear unbiased estimator (BLUE) under the following two conditions: (i) the relationship between variables y and x is a straight line through the origin and (ii) the variance of y about this line is proportional to x (Cochran (1977), pp. 158-159.). The model (1) looks like a single regression model without intercept, however; the error term of a linear regression model is ε i ∼ N (0, σ 2 ) while that of the ratio model is i ∼ N (0, x i σ 2 ).
The ratio model is useful for imputation, as it accommodates heteroscedastic data without transformation. On the other hand, the ratio model is easily affected by outliers just like regression models (e.g. Farrella and Salibian-Barrerab (2006)).
In this paper, the idea of M-estimation for regression models is briefly explained, and reformulation of the ratio model is described so that it has the homoscedastic error term as same as a regression model. Then generalisation of the ratio model, and a robustified estimator for the generalised ratio model is proposed in section 2. Estimation of the proposed model by the iteratively re-weighted least squares (IRLS) algorithm is explained in section 3. Reasons for selecting Tukey's biweight function and its tuning constant with relation to the scale parameter are also discussed in the section. R functions based on the robustified estimator are implemented and evaluated in section 4. Monte Carlo simulation is conducted with random datasets to compare their accuracy and computational efficiency with different scale parameters regarding Tukey's biweight function. The results with Huber's weight function are also provided in Appendix. Application to a real dataset is illustrated in section 5, and section 6 concludes the paper.

M-estimation for regression models
A regression model, has a homoscedastic error term ε i , which is assumed to be normal with a mean of 0 and constant variance, V (ε i ) = σ 2 , where x i = (1, x i1 , . . . , x ip ) , β = (β 0 , β 1 , . . . , β p ) , and p is number of explanatory variables. The estimation equation of β can be expressed as n i=1 (y i − x i β)x i = 0 (e.g. Huber and Ronchetti (2009), p. 155). Huber (1973) extended his idea of M-estimation for a location parameter (Huber 1964) to the case of linear regression. The proposed estimation equation is where w i is the weight function w i = w(e i ) based on the standardised residuals e i = r i /σ. The idea of M-estimation is controlling influence of outliers by weights w i derived by a weight function. The value of w i , which is within the range between 0 and 1, is determined according to the magnitude of a standardised residual. A smaller weight is allocated to an outlying observation, and then the observation has less influence to the parameter estimation.

Reformulation and gerenalisation of the ratio model
The obstacle of M-estimation for the ratio model is its heteroscedastic error term i . Because of the error term's characteristic, residuals of the ratio model cannot be used as a measure of outlyingness. Therefore, we first reformulate the conventional ratio model so that it has a homoscedastic error term like a regression model.
The error term of a regression model (3) is assumed to be normal with a mean of 0 and constant variance, which can be written as ε i ∼ N (0, σ 2 ). Meanwhile, the error term i of the ratio model (1) is proportional to √ x; i.e., the variance of i is proportional to x and can be written as i ∼ N (0, xσ 2 ). The ratio model can be expressed in the following form: as these two different error terms have the relationship of i = √ x i ε i . We refer to ε i in the ratio model hereafter as the quasi-error term because the true error term of the model is i . Then, we also propose extending the model (4) to obtain an error term that is proportional to x γ i as follows: The corresponding estimator of the generalised ratio model iŝ and its quasi- The model (5) and the estimator (6) Cases A', B' and C' have different features. In this paper, we discuss about cases A' and B' in particular, since our focus is on the models with a heteroscedastic error term. Case C' is a regression model without an intercept and has a homoscedastic error.
As the ratio β of case B' is estimated by the sum of y divided by the sum of x regarding observed data, the value is mostly decided by very large-scale observations. This estimator has a relatively small variance compared with case A' inherent to its definition, even when x and y contain extreme values. One may be able to demonstrate under what conditions estimator A' has smaller sampling variance than estimator B'. On the other hand, influence of very large observations is much smaller in the case A' since the estimation is made by the mean of ratios of each observation; however, this definition makes the value ofβ relatively unstable especially when there are very small observations in x. Scatter plots of data sets following these models are shown as Figure 1 to render the differences. The size of the data sets are n = 1000, and the explanatory variables of these data sets follows x ∼ N (5, 1) and β = 2. Objective variables y are derived based on each model with normally distributed quasi error term ε ∼ N (0, 1).

Robustification
The robustified estimator for the generalized ratio model (5) is derived by means of Mestimation as follows:β where w i is a weight function of quasi residualsř. The role of the weight function is to alleviate influence of the observations with large residuals. There are a variety of choices in Holland and Welsch (1977) and Zhang (1997), for examples. The following two are the most popular functions among them. One is Tukey's biweight function, described in Beaton and Tukey (1974), and the other is Huber's weight function, proposed by Huber (1964). The standardised residuals e i are quasi-residualsř i divided by an estimated scale parameterσ. The selection of a scale parameter and tuning constants c and k are discussed in the next section. Quasi-residualsř i based on the homoscedastic quasi-error term ε i are obtained by 7.
The cases with γ = 1, γ = 1/2 and γ = 0 are shown in Table 2. The corresponding models are similar to those for cases A', B' and C'.

Selection of weight function
It is important to think of the purpose of estimation and policy toward outliers for selecting a weight function. Among the two described in the previous section, we adopt Tukey's biweight function which can eliminate the influence of extreme outliers, since our purpose is imputation. The underlying policy corresponding to Tukey's biweight function is to assume that outliers are not representative for the part of the population under scrutiny. The estimation is made to complete missing data, and elimination from the estimate for imputation does not mean exclusion of the outlying observations from the survey results.
In contrast, Huber's weight function may be prefered if the purpose is a population estimate, since this function does not eliminate any observation from the estimation. Survey observations should not be wasted for the population estimates unless they are erroneous or invalid. Figure 2 shows the difference between these weight functions. While the tails of the biweight function reach zero when the absolute value of the standardized residuals exceeds a certain threshold, the tails of Huber's weight function only approach zero at infinity.  Beaton and Tukey (1974) use interquartile range as the scale parameter for Tukey's biweight function. Bienias, Lassman, Scheleur, and Hogan (1997) adopts average absolute deviation (AAD)

Scale parameter and tuning constant
instead and recommends the range of tuning constant c from 4 to 8. As for Huber's weight function, Huber (1964) adopts median absolute deviation (MAD) σ MAD = median(|ř i − median(ř i )|). Holland and Welsch (1977), which proposes the IRLS algorithm, provides tuning constants for 95% asymptotic efficiency under the standard normal distribution for both of these functions with σ MAD as their scale parameter.
It is known that the scale parameters based on standard deviation σ SD and σ AAD have the relation σ AAD = 2/π · σ SD ≈ 0.80 · σ SD , Similarly, where Φ(·) is the cumulative distribution function of the standard normal distribution.
The tuning constants of these three scale parameters for 95% asymptotic efficiency under the standard normal distribution are obtained by Holland and Welsch (1977). Based on these figures, the range of the tuning constant k for Huber weight with σ MAD corresponding to Tukey's c proposed by Bienias et al. (1997) are derived as in Table 3. Wada and Noro (2019) made a comparison between Tukey's biweight function and Huber's weight function with σ MAD and σ AAD for a simple regression model by Monte Carlo experiments with random error terms following various t-distributions. The results indicate that Tukey's biweight is more compatible with σ AAD , while Huber's weight function is better with σ MAD . For the tuning constants, a larger value for Tukeys' with σ AAD is recommended and a smaller value for Hubers' with σ MAD . Smaller values of these tuning constants make the estimation more robust but reduce weights and efficiency.

The algorithm and other settings
Among a few well-known iterative schemes for obtaining M-estimators in regression, which include Newton's method, we adopt the iteratively re-weighted least squares (IRLS) algorithm according to Holland and Welsch (1977). For the weight function and scale parameter, we choose Tukey's biweight function with σ AAD in accordance with Bienias et al. (1997), as well as the convergence condition. Our choice of the tuning constant is c = 8 to minimize the weight and efficiency reduction.
The modified IRLS algorithm for a robust estimator of the generalised ratio model is as follows. The superscript index in parentheses (j) on each variable shows the iteration number.

iv) Obtain new quasi-residualsř
(2) i by (7) corresponding toβ (2) , update the scale parameter σ (2) by (10), and the weights w (2) i by (9). v) If the scale parameters σ (1) and σ (2) satisfy the convergence condition then makeβ (2) the final estimatorβ rob and stop iteration. Otherwise increment index j by 1 and go back to iii). Table 4 are implemented together with a parent function named RE-GRM, which calls an appropriate child function according to the arguments. An R package containing all those functions is created and stored at the repository, https://github.com/ kazwd2008/REGRM.

Illustration
To see the effect of robustification, experiments are conducted with 30% contaminated twovariable datasets as shown in Figure 3 and 4. One dataset is size 1000 and the other, 15. The 70% normal data follow a normal distribution with correlation of 0.6. Outliers, which follow a normal distribution without correlation, are placed with some distance away from normal data. The results of estimator A and B are shown together with the experiments of estimator A' and B' with all data, normal data (excluding outliers) and outliers (excluding normal data).
The experiments reveal estimator B' is affected more than A' by outliers with larger values, while estimator A and B are successfully provide similar results of B' with normal data.

Evaluation of the proposed estimators
Monte Carlo simulation with random data is performed based on the models A', B', and C' shown in Table 2. In the simulation, variable x is uniformly distributed random numbers from 1 to 10 and the ratio is β = 5. The quasi-error term ε i is also random following a t-distribution with degrees of freedom 1, 2, 3, 5, 10, and infinity. The objective variable y is calculated based on each model equation of Table 2 using the above-mentioned components. The experiments to estimateŷ by the estimators A, B, and C with σ AAD and σ MAD are performed k = 100, 000 times with the size n = 100 by each degree of freedom of the t-distribution for the quasi-error term. Hereafter, we describe the estimators A, B, and C with σ AAD as A AAD , B AAD , and C AAD , and those with σ MAD as A MAD , B MAD , and C MAD .

Accuracy
For the above-mentioned data, observed values of y contain errors, while true values are known and derived by y = βx. Comparisons among the three sets of the estimators, i.e., (A , A AAD , A MAD ), (B , B AAD , B MAD ), and (C , C AAD , C MAD ), are made based on RMSE A , RMSE B , and RMSE C , respectively. They are defined as follows: Then relative efficiency is calculated by dividing the RMSE values of A AAD and A MAD by that of A', those of B AAD and B MAD by B', and those of C AAD and C MAD of C', to see the improvement by the robustification. Results are shown as Table 5.
The set of 100, 000 experiments was repeated four times regarding each degree of freedom for the quasi-error term and the figures in table 5 is one of them. Superior figures are underlined in the table after comparing estimators of a same model with different scale parameter. The pairs with no underline show disagreement in results among the four set of experiments. Those figures may not be stable because of the mismatched models (e.g., data following model A with estimators other than A) or datasets of df=1 which may contain extreme outliers.
Nevertheless, the robust estimators are better alternatives than the non-robust one when the data fit the model of the estimator. In addition, σ AAD is a better choice than σ MAD with Tukey's biweight function, unless the data are highly contaminated with extreme outliers or in the ideal situation that the quasi-error term follows the normal distribution.

Computational efficiency
The maximum number of iteration needed to compute the estimators and mean number of iteration are shown in Table 6 and 7, respectively. At least two iterations are necessary because of the algorithm shown in the previous section, and in most cases investigated here, less than 10 iterations were necessary. Number of iteration tends to increase as the tails of quasi-error term longer.

Other aspects
The problem of nonconvergence of Tukey's biweight function is reported by Wada (2012) and Wada and Noro (2019) in the context of estimating a linear model. We did not encounter the same problem, since the models used in this paper have only one parameter to estimate. Another problem reported by Wada and Noro (2019) is the phenomenon that all the robust weights w i reach zero during computation of a robust estimator of Tukey's biweight function with the MAD scale. It occurred in our experiments of the estimator A with the MAD scale for the data of df=1 and 2 as shown in Table 8.
Although this study does not focus on the comparison of the two weight functions, resuls of the estimators of Huber's weight function with the same datasets are shown in Appendix.

Application with real data
The robust estimators for the generalized ratio model proposed in this paper was developed for the imputation of the 2016 Economic Census for Business Activity in Japan. The 2016 Census was conducted by the Ministry of Internal Affairs and Communications and the Ministry of Economy, Trade and Industry on June 1, 2016. It aims to identify the structure of establishments and enterprises in all industries at the national and regional levels, and to obtain basic information to conduct various statistical surveys by investigating the economic activities of these establishments and enterprises.
The major corporate accounting items, such as sales, expenses, and salaries, surveyed by the census require imputation to avoid bias. Although ratio imputation was a leading candidate at the beginning, it is well known that the estimation is very sensitive to outliers; therefore, we needed to take appropriate measures for the problem.
After implemented the functions based on the robust estimator of the generalised ratio model, estimator A and B were compared using previous census data by Monte Carlo simulation. Estimator B was selected to estimate missing sales by expenses, salaries by expenses, and expenses by sales.
The estimator B has a problem to be influenced by extremely large observations in spite of the robustification. Such observations are not regarded as outliers even when they have different tendencies compared to the majority of other observations. Figure 5 shows a good example. The left side of the scatter plot shows the whole enterprises in the industry together with the results of estimator A and B. There are a few extremely large values, and estimator B is affected by them seriously. The right side plot closes-up the smaller observations with higher density in the same dataset. To cope with the problem, extremely large outliers if any in each imputation class were removed from estimation in the course of data processing for the 2016 Census.

Conclusions
The proposed generalised ratio model broadens the conventional definition of the ratio model with regards to the variance of the error term. Robustified estimators based on the model effectively alleviate the influence of outliers.
Application of the robust estimator may contribute to the accuracy of official statistics, as the survey data tend to have longer tails. The R functions based on the proposed estimator are implemented, evaluated and provided at a public repository.
Users' policy toward outliers may reflect the choices of a weight function, and a suitable scale parameter for Tukey's biweight function is AAD regarding longer tailed datasets.
As for the conventional ratio model, the robustified estimator is highly affected by very large observations; therefore, removing extremely large outliers may necessary before estimation.

Future work
Simulations in this paper are with uniformally distributed x values to have fatter tails than normal distribution. However, further simulation may also be needed for economic data with lognormally distributed x, since it would be more realistic.
Another interesting topic is an application for restricted data. Economic surveys gather multiple financial variables for each establishment, for an example. Those variables may contain some restrictions such as a total and its components. Further study is necessary.
The results of Monte Carlo simulation for the estimators with Huber's weight function are shown in this appendix. The datasets used is identical with thosed used in section 3. The  table 9, 10 and 11 are comparable with Table 5, 6 and 7.   When looking at the cases in those models of the data and the estimators are consistent, results of the simulation indicate Tukey's biweight function is slightly more efficient than Hubers', while Hubers' converges a little bit faster, except for the unstable results with the data of df = 1. However, those difference may be negligibly small in practical use.