Computing Robust Regression Estimators: Developments since Dutter (1977)

: The proposal of M estimators for regression (Huber, 1973) and the development of an algorithm for its computation (Dutter, 1977) has lead to an increased activity for further research in this area. New regression estimators were introduced that combine a high level of robustness with high efﬁciency. Also fast algorithms have been developed and implemented in several software packages. We provide a review of the most important methods, and compare the performance of the algorithms implemented in R . Zusammenfassung: Die Einf¨uhrung von M Sch¨atzern f¨ur Regression (Hu-ber, 1973) und die Entwicklung eines Algorithmus f¨ur die Berechnung (Dut-ter, 1977) hat zu einer regen Forschungst ¨ atigkeit auf diesem Gebiet gef ¨ uhrt. Neue Sch ¨ atzer f ¨ ur Regression wurden eingef ¨ uhrt, die einen hohen Grad von Robustheit mit hoher Efﬁzienz kombinieren. Auch schnelle Algorithmen wurden entwickelt und in diversen Softwarepaketen implementiert. Wir geben einen ¨ Uberblick ¨ uber die wichtigsten Methoden und vergleichen die in R im-plementierten Algorithmen.


Introduction
Linear regression belongs to the most important methods in statistics.There are numerous applications of linear regression in practice, and usually the least squares (LS) estimator is taken as the standard estimator.The LS estimator has excellent theoretical properties, and its computation is usually very simple.However, this estimator also relies on quite strict assumptions, and a violation of these assumptions may lead to useless results.
The idea of robust statistics is to account for certain deviations from idealized model assumptions.Typically, robust methods reduce the influence of outlying observations on the estimator.In the context of regression, there are two types of outliers: • vertical outliers: these are outliers from the linear trend along the response variable; • leverage points: these are outliers in the space of the explanatory variables.
Leverage points can even improve the estimation if the values of the response variable correspond to the linear trend.Here we assume that their response values are also deviating from the linear trend, which is often referred to as bad leverage points (e.g.Rousseeuw and Leroy, 2003).The LS estimator is sensitive with respect to both types of outliers.Leverage points may even "lever" the regression line or hyperplane.The goal of robust regression estimators is to protect against both types of outliers.
The history of robust regression dates back to the 19 th century, where L 1 regression was treated in Edgeworth (1887); it was even earlier mentioned than LS regression (Boscovich, 1757).However, it turns out that L 1 regression is only robust with respect to vertical outliers, but not to leverage points.A first formal approach to robust regression was done with the M estimator in Huber (1973).This estimator was difficult to compute, and Dutter (1977) introduced an algorithm and compared with several other proposals.Extensions of this work are available in Dutter (1983).Since this time, a lot of research was devoted to developing robust regression estimators and algorithms for their computation.In this contribution we focus on the theoretical and numerical developments in robust regression since Dutter (1977).Several key proposals are compared in a simulation study.

Basic Concepts
For the following sections we consider the linear regression model where x i is a p-dimensional vector of explanatory variables, y i is the response, β = (β 1 , . . ., β p ) is the vector of unknown regression coefficients, and i ∼ N (0, σ 2 ) are independent normally distributed errors.By setting x i1 = 1, an intercept term is included in the model.The regression residuals are defined as r i (β) = y i − x i β.It is clear that meaningful regression estimators attempt to minimize the regression residuals, or more precisely, a function of the residuals.

Least Squares Regression
The most common approach for estimating β is least squares (LS) regression, which minimizes the sum of squared residuals: The squared loss function is not bounded, which means that large (absolute) residuals contribute much to the overall sum.This is a particular problem if leverage points are present in the data.Figure 1 depicts the effect of a leverage point (triangle) in simple linear regression.While the six regular observations (circles) form a linear trend, the leverage point deviates from this trend and is outlying in the x space.The amount of outlyingness of the leverage point is smaller in the left picture, and more severe in the right plot.The LS regression line (solid line) is attracted by the leverage point in both situations, and the effect is much more severe in the right plot.Any diagnostics based on the size of the residuals would be misleading, since the residual of the leverage point is smaller than some of the residuals of the regular observations.We conclude that already a single observation may cause the LS estimator to break down.In general, we define the proportion of observations that cause an estimator to give arbitrary results as breakdown point.For an exact definition of the breakdown point see, e.g., Hampel (1971).Thus, LS regression has a breakdown point of 0 %.X y q q q q q q q regular observations leverage point X y q q q q q q q regular observations leverage point LS regression is straightforward and very fast to compute because of its explicit solution.In the software environment R it is available as function lm: > lm ( y ~x )

L 1 Regression
One of the main reasons for the strong influence of outliers on LS regression are the properties of the squared loss function in (1), since large residuals have an even magnified effect on the overall sum.A possibility to reduce this effect is to replace the squared residuals with their absolute values: This type of regression is called L 1 regression (Edgeworth, 1887).It was the first step in the direction of robustifying LS regression.However, it can be shown that L 1 regression does not achieve a higher breakdown point.Again, a single outlying observation in the x space may corrupt the results of the regression.L 1 regression comes at the cost of a more complex solution.As there exists no analytical way of solving the minimization problem (2), an iterative algorithm is required.A popular method is the simplex based Barrodale-Roberts algorithm (Barrodale and Roberts, 1973), an implementation of which can be found in R in the quantreg package (function rq):

M Estimators
M estimators for regression were introduced by Huber (1973), and they generalize the idea of LS regression in the following way: The squared loss function is replaced by a function ρ with certain properties.This leads to the M regression criterion: We see that for ρ(r) = r 2 , (3) is equivalent to (1) and for ρ(r) = |r| we obtain the L 1 regression criterion in (2).The criterion in (3) has the disadvantage of not being equivariant with respect to scaling of the response variable.In order to obtain the desired regression equivariance, the residuals r i are standardized using a robust scale estimator σ: Differentiation of the objective function ( 4) with respect to the regression coefficients β, and setting the result to zero leads to a system of p equations, where ψ = ρ .For LS regression we have ψ(r) = r, and in this case (5) are the usual normal equations.

If we define weights
depending on the scaled residuals, we obtain a system of weighted equations For LS regression we get w i = 1 for all i.In contrast, robust methods attempt to downweight outliers.Popular choices of the functions ρ, ψ, and the weights w i are shown in Table 1.M estimates based on Huber's ψ function have computational advantages, but they are sensitive to leverage points.A bounded ρ function as given by Tukey's biweight is able to handle large x-values in a better way, but it results in a more difficult computation.For details, see Maronna et al. (2006).Unless the explanatory variables follow a fixed design (Mizera and Müller, 1999), the breakdown point of M estimators is 1/p, and thus for larger p it can become very small.In order to solve the system (7) with respect to β, an iterative procedure called iteratively reweighted least squares (IRWLS) can be applied.The method repeatedly solves (7) for β.The convergence is guaranteed for a decreasing weight function w(r) (Maronna  , 2006).Since there may exist many local minima, it is important to initialize the IR-WLS algorithm with a suitable starting value β0 .Furthermore, the solution depends on the choice of σ.
An alternative is the so-called H-algorithm, as presented by Huber and Dutter (1974) and Dutter (1977).The method iteratively performs LS regression using pseudo observations ỹi = ŷi + ψ(r i ), where ŷi are the fitted values obtained from the previous step and r i are the corresponding residuals.In each step, the estimates of β and σ are computed simultaneously.The procedure is repeated until convergence to obtain a final estimate.An implementation of the algorithm is given in the computer program LINWDR (see Dutter, 1976Dutter, , 1983)).
In R , the M estimator is available as function rlm in the package MASS, which uses the IRWLS algorithm: > require ( MASS ) > rlm ( y ~x , psi = psi .huber , method = " M " )

S Estimators
In the following we want to investigate regression estimators which are based on a robust scale measure σ: An S estimator βS is a solution of (8), with σ being a solution of where δ is a constant.Equation ( 9) can be solved iteratively.The solution σ is called an M estimator of scale.After that, the S estimator can be computed using IRWLS.Again, a reliable starting value is needed in order to obtain a global minimum in the procedure.
The S estimator is implemented in the R package robustbase: > require ( robustbase ) > lmrob .S (x , y , lmrob .control ()) Note that this function is not intended to be used on its own, because the S estimator has a low efficiency.The advantage of the S estimator is its high breakdown point of 50 %, and therefore it is used as an initial estimator for MM regression, see Section 2.7.

Least Trimmed Squares Regression
Another intuitive way of robustifying LS regression results from the following idea: If the sum of squared residuals in (1) is minimized including not all but only h < n observations, possible outliers would not affect the parameter estimate β.This leads to the least trimmed squares (LTS) estimator which was introduced in Rousseeuw (1984).The criterion can be written as follows: where H ⊆ {1, . . ., n} and |H| = h < n.Of course, the parameter h has to be chosen carefully, in order to not exclude too much of the information from the regression.Note that this criterion is equivalent to (8), if where r (i) are the ordered residuals.If h = n/2 + 1, a breakdown point of ( n/2 − p + 1)/n is obtained (see Rousseeuw, 1984).
A fast algorithm for the LTS regression estimator is available in the R package robustbase: > require ( robustbase ) > ltsReg ( y ~x )

Least Median of Squares Regression
The advantage of robustness of the median over the mean is well known and can be extended to the regression context: The LS criterion in (1) is equivalent to minimizing the mean of squared residuals.Rousseeuw (1984) introduced the least median of squares (LMS) estimator by replacing the mean with the median: This corresponds to (8) for σ = med {r i (β) 2 }.Like for LTS regression, an asymptotic breakdown point of 0.5 is obtained.
The LMS regression estimator is available in the R package MASS: > require ( MASS ) > lqs ( y ~x , method = " lms " ) However, a reliable solution in higher dimension is computationally expensive, and thus this estimator will not be used in the simulations in Section 3.

MM Estimators
MM estimators (Yohai, 1987) reach a high level of robustness as well as high (tunable) efficiency, by combining the properties of M estimators and S estimators.Let β0 be an S estimator, and let σ be the corresponding M estimator of scale, solving (9) for β = β0 .The MM estimator is then defined as local solution of obtained with IRWLS and initial value β0 .A implementation of MM estimators is available in the package MASS (function rlm).A fast alternative can be found in the R package robustbase: > require ( robustbase ) > lmrob ( y ~x )

Comparison
The robust regression methods listed in the previous sections have been intensively studied during the last years (decades).Accordingly, a lot is known about their theoretical properties, in particular about their robustness properties.These properties, however, can be partially weakened by the algorithms for the computation of the estimators.In R we can find fast algorithms, like a fast algorithm for LTS regression (Rousseeuw and Van Driessen, 2006), or a fast algorithm for regression S estimators (Salibian- Barrera and Yohai, 2006) (and therefore for MM estimators).Without these fast implementations it would be impossible to deal with larger data sets, but one has to be aware that the algorithms not necessarily find the global optimum.This is in particular problematic if the number of explanatory variables gets larger.In the following we compare the previously described regression methods in simulated data examples, where the true regression parameters are known.

Different Contamination Schemes
In this simulation experiment we generate the data as follows: x i ∼ N (0, I p ) and i ∼ N (0, 0.5), for i = 1, . . ., n, β j = 1/ √ p for j = 1, . . .p, and thus β = 1, This setting is referred to as uncontaminated data.In a second setting we create vertical outliers by changing the first n out elements of the error term to i ∼ N (10, 0.1), and y i = x i β + i , for i = 1, . . ., n out .
The third setting refers to leverage points, where the first n out observations of the explanatory variables and the response are replaced with x i = (10, . . ., 10) and where a is orthogonal to β, and is constructed using a vector v with entries (−1) j by a = v − (v β)β, normalized to unit norm.For p = 1 we take a = −1.The leverage points are thus placed on the least favorable position, see also Figure 5.
In this simulation we take n = 200 observations, and n out = 20 are replaced by either vertical outliers or leverage points.The number p of explanatory variables is taken from the set {1, 5, 10, 25, 50}.For the three scenarios, we apply the described regression methods in m = 100 replications.As a measure of performance we use the mean squared error between the estimated parameters in the l-th repetition, β(l) , and the true parameters β: The results are shown in Figure 2. The upper plot panels represent the uncontaminated case, the middle plot panels are for the vertical outliers, and the lower plot panels for the leverage points.The right panels zoom into the left figures to show some details.The legend is placed in the lower left panel.In the uncontaminated case we see that, independent of the dimension, LS regression gives the smallest MSE (as expected), closely followed by M and MM regression.L 1 and LTS regression are slightly worse, and S regression is the worst among these estimators due to its low efficiency.When including vertical outliers, LS regression is only reliable for p = 1.This situation is a special case, caused by the specific design of the simulated data.For the other methods we see that MM regression gives the best results, closely followed by LTS.For the scenario with leverage points, MM estimation gives again the best performance, followed by LTS and S estimation.Their resulting MSEs are even close to the uncontaminated case.Figure 3 (lower left panel) shows that LS, L 1 and M regression break down, however, one gets the impression that with increasing dimension these regression methods improve.This is not correct for the following reasons.Since the leverage points were included in the orthogonal direction to β, the worst possible breakdown of a regression estimator is a vector of estimated regression coefficients β orthogonal to β, i.e. β β = 0. Since the simulation design is without intercept, for p = 1 we have β = 1 and the worst possible estimation β = −1.The resulting expected MSE according to (13) is 4, which, due to Figure 3, is indeed attained for LS regression.For larger p, the expected MSE gives in the worst possible situation if we assume that β has norm 1. LS regression is indeed close to these values, which have to decrease with increasing dimension.
Figure 3 shows the average computation times in seconds of the regression methods (left: original scale, right: log-scale), using a standard personal computer.With an increasing number of explanatory variables, the time for computing the S estimator increases rapidly.Since the MM estimator is based on the S estimator, its computing time is almost the same.The computation time for LTS also increases exponentially, but it is much lower in absolute terms.The time increase for the other estimators is much smaller.

Breakdown
We use a very similar simulation design as above to evaluate the regression methods (algorithms) for their breakdown behavior.The idea is to include an increasing amount of Austrian Journal of Statistics, Vol. 41 (2012) contamination.The most severe contamination is to include leverage points, and thus the simulation design is as follows: x i ∼ N (0, I p ) and i ∼ N (0, 0.1), for i = 1, . . ., n, β j = 1/ √ p for j = 1, . . .p, and thus β = 1, The first n out values of the response and the explanatory variables are replaced with x i = (10, . . ., 10) , and y i = x i a, for i = 1, . . ., n out , where a is taken as before, and n out = f • n with f = 0, 0.05, . . ., 0.5 and n = 200.
Then the MSE from ( 13) is computed over m = 100 repetitions of the simulation.Figure 4 shows the results (legend in the bottom panels).Theoretically, the breakdown point of the estimators S, MM and LTS, is 50 %, and also the default parameters in their R implementations are set to achieve this maximum possible breakdown point.Practically, the breakdown depends on the simulation setting, and mainly on the implemented algorithm.We can see that with increasing number of explanatory variables, the breakdown occurs much earlier.The S, MM and LTS estimator show a very comparable performance.On the other hand, LS regression breaks down already with 5 % contamination.It is interesting that for p = 50, L 1 and M regression are still robust against 5 % contamination (leverage points).This might be due to the special geometry in higher-dimensional spaces.In a final simulation example we demonstrate that the breakdown of a regression line can indeed depend on the regression problem.We use the simulation design described in this subsection for 25 % contamination and p = 1.Here the leverage points are not placed on one point but spread along the orthogonal direction for visual purposes.Figure 5 shows a generated data set according to this setting.In the left plot we use i ∼ N (0, 0.5), as in the simulation of Section 3.1, in the right plot we increase the error variance using

Conclusions
The R package robustbase offers various possibilities for robust regression, in particular routines for computing the MM and LTS regression estimators (lmrob and ltsReg).These functions are implementations of fast algorithms (Salibian- Barrera and Yohai, 2006;Rousseeuw and Van Driessen, 2006), which can handle regression problems with a reasonably large number of explanatory variables.As we have shown by simulations, the theoretical robustness properties of these estimators can be far too optimistic in practice, which is due to the algorithm but also due to the problem at hand.Especially in higher dimension the estimators can break down already at moderate levels of contamination, much lower than 50 %.This could be avoided by changing the default parameters of the routines-a task that is not straightforward to the average user.This does not mean that robust regression becomes useless in higher dimension; in contrary: Thanks to the fast algorithms it is possible to obtain results in reasonable time, but these are a compromise between computation time and optimal robustness properties.

Figure 1 :
Figure 1: Effect of a single leverage point (triangle) on the LS regression line.

Figure
Figure Mean squared errors between estimated and true regression parameters for different numbers of explanatory variables.The upper panels are for the uncontaminated case, the middle panels for vertical outliers, and the lower panes for leverage points.The right panels zoom into interesting parts of the left panels.The legend is placed in the lower left panel.

Figure 3 :
Figure 3: Average computation time (in seconds) depending on the number of explanatory variables; left: original scale, right: log-scale.

Figure 5 :
Figure 5: Simulated data example with 25 % leverage points.MM and LTS regression is robust in the left panel, but fails in the right panel where the residual variance has been increased.

Table 1 :
Different ρ functions, together with the corresponding derivatives ψ and the resulting weights w.