On Robust Estimation of Power Spectra

Let us consider the problem of robust spectral density estimation. Conventional methods to obtain estimates of spectral density function are not robust in the presence of outlying observations. We present different methods to robustly estimate the spectral density function that are insensitive against outliers. The proposed methods are applied to simulated and real data and the results are compared. As a special practical application we focus on the frequency-domain analysis of short-term heart rate variability measurements of diabetes patients.


INTRODUCTION
IN practice, statistical models only rarely describe a given situation exactly. One particularly important kind of inadequacy stems from the fact that the errors are usually assumed to have a Gaussian distribution. But actual data frequently contain a small fraction of unusual data points, or "outliers", which are not consistent with the Gaussian assumption. Such data could in principle be modelled by distributions which have a nearly Gaussian shape in the central region and tails which are considerably "heavier" or "thicker" than those of a Gaussian distribution corresponding to the central part. But accurate estimation of the tails of such distributions requires sample sizes which are much larger than one normally encounters in statistical practice. Thus, one will usually not be able to use fully efficient estimates, because the exact structure of a heavy-tailed nearly Gaussian density is not known.
On the other hand, it is often argued that the careful statistician will examine the data, discover the outliers, take appropriate ad hoc remedial action (such as removing or downweighting questionable data points) and then apply conventional least squares or Gaussian maximum-likelihood procedures. The presumption here is that the ensuing inferences are more or less correct. Responsible and serious investigators do indeed often proceed this way with considerable success. For some outstanding examples see, for instance, Daniel and Wood (1971 However, there are serious shortcomings to this approach. First, the ad hoc nature of the procedures obviates the possibility of providing precise distributional characterizations of the final estimates (such as estimator efficiency under both Gaussian and alternative distributions). Second, the amount of effort, time and insight required to spot all but the most obvious outliers is often quite considerable (see again Daniel and Wood, 1971)and often the sheer bulk of data to be processed or real time requirements essentially rule out careful ad hoc approaches.
What is desired are statistical procedures whose performance may be evaluated for various distributions and which remain quite good for a broad class of underlying distributions (including the Gaussian and heavy-tailed nearby alternatives) but which are not necessarily best for anyone of them. Such procedures are called robust.
There now exists a large and rapidly growing statistical literature dealing with robust estimation. A large portion of this literature treats location and linear regression models with independent and identically distributed errors. A few contributions deal with robust estimation of covariance matrices. Research on robust estimation in the time series context has lagged behind, and understandably so, in view of the increased difficulties imposed by dependency between the data points. Furthermore, in spectrum analysis it is sometimes claimed that, although outliers in time series may frequently occur, conventional methods of estimating power spectra have enough inherent robustness to make special methods unnecessary. This claim is based on the assumptions that one is usually interested primarily in the shape of a power spectrum and that the shape is quite insensitive to all but extremely bad contamination. This may be true for spectra consisting primarily ofa narrow band or low frequency component with little else of interest. As we will show below, this is not the case when one is looking for comparatively low amplitude peaks in a wide-range spectrum, a situation frequently encountered in practice. We will also show that outliers in the original series need not be extremely large with respect to the scale of the observed process before they can obscure important features of the corresponding power spectrum. They only need to be large compared to the scale of the prediction residuals (sometimes called the innovations process) and may, therefore, be easily overlooked.
In the next section, we review some of the central notions of robustness and discuss theoretical and computational aspects of M-estimates. This appealing class of estimates was introduced by Huber for location (1964) and regression (1973) and is motivated by maximumlikelihood procedures. Section 3 deals with the lack of robustness of both parametric and non-parametric estimates of the spectrum. Section 4 recalls the conventional motivation for prewhitening techniques and proposes a general approach for robustly estimating spectral densities via robust prewhitening combined with the downweighting of outliers in the whitened process.
In Sections 5 and 6, two particular tactical approaches are suggested for carrying out the above strategy. Both procedures are modifications of the ordinary M-estimate approach to regression, and can be regarded as (perhaps approximate) solutions to an intuitively appealing minimization problem. Both procedures have proved to be successful over a wide variety of data sets with real and simulated data, as we demonstrate in Section 7.

ROBUSTNESS CoNCEPTS AND M-EsTIMATES
To provide motivation and focus for the remainder of the paper this section summarizes some basic notions and background material on robustness. In addition robust maximumlikelihood type estimates (M-estimates) and generalized M-estimates (GM-estimates) for regression are discussed. The presentation is of necessity brief, but missing details are available in the references cited.

Robustness Concepts
The loose definition of a robust estimate given in the previous section can be generalized. of potential true observation distributions where each G, is not too far from F in some appropriate metric. Then an estimate T is robust at F if its performance is good over the family {F} U {G i } . The definition of the previous section and the focus of interest in this paper corresponds to a Gaussian F and heavy-tailed G/s. Although this is the case which has received the most attention in the literature, other situations may be of interest too. For example Hogg (1974) has suggested that more attention should be devoted to short-tailed distributions and Hampel (1968) has discussed some possibilities where F is some non-Gaussian parametric model, such as the Gamma distribution.
Whether or not we limit ourselves to the nominally Gaussian Fwith heavy-tailed alternatives, as we shall do for the remainder of this paper, more precise notions of robustness are useful. The papers by Tukey (1960), Huber (1964) and Hampel (1971) provide three important concepts of robustness, namely efficiency robustness, min-max robustness and qualitative robustness (see also Box, 1953, who introduced the term "robust" to the field of statistics).
An efficiency robust estimate is an estimate whose efficiency is high over {F}u{G i } . While the Cramer-Rae bound is usually adequate for asymptotic comparisons of efficiencies, in finite samples other standards of comparison, such as the best known estimate in a given class, may be appropriate (see, for example, Andrews et al., 1972), with suitably high efficiencies being somewhat arbitrarily taken as 90-95 per cent. In practice only a finite number of distributions can be considered and Tukey (1975) has emphasized "tri-efficiency", selecting the Gaussian, a moderately contaminated normal and the Cauchy distribution to comprise {G I , G 2 , G s }· Huber's min-max robust location estimates minimize the maximum asymptotic variance over uncountably infinite families of symmetric distributions. These estimates also yield good, though not best, efficiency robustness.
Hampel's qualitative robustness is a rather natural equicontinuity requirement. Suppose that F o and F are two distributions for independent and identically distributed observations, {Tn} a sequence ofestimators indexed by sample size n, and H(T n, F o ), H(T n, F) the distributions of Tn under F o , Frespectively. Then {Tn} is qualitatively robust at F o if the sequence of mappings F-+H(Tn,F) is equicontinuous at F in appropriate metrics. Hampel's (1971) results do not apply in the time series setting and suitable alternatives are needed. One possibility has been recently proposed by Kazakos and Gray (1979).
Estimates which possess all three of the above robustness properties are quite attractive. However, the applied statistician might well be quite happy with just efficiencyrobustness and qualitative robustness.
2.2. The Lack of Robustness ofLeast Squares Least squares estimates are notoriously lacking in robustness. Consideration of the sample mean y, for instance, reveals that the value of y can be affected greatly by the presence of a single outlier. To use Tukey's (1975) data-oriented term, the sample mean lacks "resistance". Simple alternatives such as the sample median and trimmed means, as well as other modem robust estimates, are resistant.
The lack of efficiency robustness of y towards contaminated normal models of the form f=(I-y)N(O,I)+yN (O,a2) was pointed out by Tukey (1960), and Hampel (1968 showed that y is not qualitatively robust. The sample mean does possess certain min-max properties (Mallows, 1975), but since y is neither efficiency robust nor qualitatively robust these are of no particular interest from the robustness point of view.
Similar comments apply to least squares estimates of coefficients in linear regression models. However, the regression situation is considerably more complicated than the location problem. One reason for this is discussed in the next section.

Regression M-estimates
There are by now quite a number of robust alternatives to least squares regression, including some which enjoy all three of the robustness properties we have mentioned. In fact KLEINER, MARTIN AND THOMSON -Robust Estimation 0/Power Spectra [No.3, the rate at which new estimates are being proposed for some estimation problems may well exceed our ability to sort out the better estimates and understand the issues involved. For an indication of the wide variety of estimates that have been proposed for regression see, for example, Hill's recent (1977) Ph.D. thesis.
Despite the large number of proposed estimates for some problems, there has been almost no attention devoted to the robust estimation of power spectra. Thus it seems appropriate to determine those situations for which the conventional procedures are not robust, and to devise robust procedures for estimating spectral densities as needed.
Since both classes of estimates we will propose for robust estimation of power spectra make use of modifications of ordinary M-estimates for regression (Huber, 1973), these estimates are briefly described here.
Consider the usual linear model In each of these cases does not depend on §. However, for other choices of robustifying pet) it is necessary to compute § to ensure scale invariance for~. Probably the best method of doing this is to follow Huber's (1973Huber's ( , 1977 proposal to solve (2.2) simultaneously with the equation The constant A is chosen so that § converges to • when the Si have a normal distribution with mean zero and variance all.
Two popular choices for ifi are Huber's psi-function ( t, It is known that ifiBS yields higher efficiencies than v« for extremely heavy-tailed distributions (Andrews et al., 1972).
Another form of .p, which was used implicitly by statisticians and engineers long before psi-functions entered the statistical literature, is the hard rejection rule. It rejects all points which are more than 30' away from the centre of the data, i.e.
( t, ItI <3, Due to its discontinuities however, .pR should not be used in robust regression. For monotone .p, Yohai and Maronna (1977) have established the existence and uniqueness of solutions of (2.2) and (2.3) under certain assumptions, but non-monotone .p such as .pBS give rise to multiple solutions and consequently require special computational strategies.
The estimating equation (2.2) can be regarded as a weighted-least-squares (WLS) equation where the weights Wi = w{(Yi-xT~)/s}, with wet) = .p(t)/t, are data-dependent. For .pH' .pBS' or any other good .p, large standardized residuals r _Yi-xiĩ -S produce small weights (of the order r-1 for .pH and weights identically zero for .pBS when Irl> I). Thus a small fraction of outliers will not have much influence on the estimate.
In addition to providing an intuitively appealing equivalent form of the M-estimate at the solution point, the WLS form suggests how to compute M-estimates iteratively using existing weighted-least-squares computer programs. Write (2.3) as well as (2.2) in WLS form and iterate both equations to approximate convergence, starting with~LS and the robust initial scale estimate So = median (I Yi -xi~LSI)/0·6745. This iterated-weighted-least squares (lWLS) algorithm works well in practice provided .p is monotone. A strategy to find the "correct" root of (2.2)-(2.3) for redescending .p (such as .pBS) which works well in practice is as follows: first obtain an IWLS solution (~,s) using a monotone .p, and then compute an IWLS solution for a redescending .p by using (~,s) as a starting point and iterating only once or twice. Note that more than two iterations can be dangerous since the IWLS procedure need not converge for non-monotone e. Some other computational algorithms for solving (2.2)-(2.3) are discussed by Huber (1977). Hampel (1971Hampel ( , 1974 also introduced the breakdown point and the influence curve, both of which provide useful indicators or robustness (the former being global in nature while the latter is local).

Other Robustness Tools
The breakdown point of an estimate is essentially the largest fraction of the data, over all combinations for each fraction, which can be taken to ± 00 without taking the (asymptotic) value of the estimate to the boundary of the parameter space. In location problems the natural boundary points are too. For estimating a correlation the points ± I might appear to be the natural boundary points; however, under the hypothesis that the true correlation is non-zero, it would be reasonable to include the origin as well as ± I in the boundary set. The influence curve measures the effect on an estimate of adding a data point x in a limiting manner as the sample size n -+ 00. Given a distribution function F and a statistic T which is itself a function of a distribution function in the neighbourhood of F, the influence curve of the statistic Tat F is defined as .~o e where 8 x is the distribution function having all its mass at x. Hampel (1974) argues that a highly robust estimate should have the following two properties (among others): (i) the highest possible breakdown point (t is an upper bound) and (ii) a bounded and continuous influence curve which 'redescends" to zero.
Consider the special location estimate case of (2.2) (2.7) (2.8) Unlike in the general regression case, it is possible here to obtain an adequate robust s directly from the data. It turns out (Hampel, 1971) that the resulting fl has a breakdown point of 1 provided s has a breakdown point of 1 (which is the case if, for example, s= median{IYi-median{Yi}ln. By way of comparison the sample mean has a breakdown point of zero, and an ex-trimmed mean has breakdown point ex. The influence curve IC(y; fl,F) turns out to be equal to ifl{(Yi-P.)/S} except for a scale factor depending on F, where p. and s are the asymptotic values of fl and s. Thus for location M-estimates continuity and boundedness of ifl will result in the influence curve being continuous and bounded. This in turn yields qualitative robustness (which is essentially a weak-continuity property) for fl. The sample mean lacks qualitative robustness because it has an unbounded influence curve. In the general regression problem the situation is more complicated and additional steps are needed to obtain bounded and continuous influence curves.

GM-estimates for Regression
If the possibility of contamination in the Xii is ignored and only contamination of the Yi is allowed, then the breakdown point of the regression M-estimate is 1, as long as ifl is bounded.
However, allowing for the realistic possibility of contamination or gross errors in the Xii yields a breakdown point of zero, a clearly unsatisfactory situation.
Computation of the influence curve for a regression M-estimate gives IC(y,x;~,F) = BXifl(y-:T~) where x T = (Xl' ... , X p ) ,~and s are the asymptotic values of~and S, and B is a p x p matrix which depends upon the distribution F of (Y, x T). Even for monotone and bounded ifl this influence curve remains unbounded as a function of x, for each fixed y. Thus outliers in the independent variable can have a considerable influence on the estimate. In order to deal with this difficulty Hampel (1974) and Mallows (1975) have proposed the use of bounded influence regression estimates. Such estimates might naturally be called generalized M-estimates (GM-estimates). One version of GM-estimate is obtained by solving where W(xJ is a continuous scalar-valued weight function chosen so that W(~)~is bounded.
The scale estimate swould be obtained by simultaneously solving (2.3) along with (2.9).
The influence curve for~GM is then which is a bounded and continuous function of (Y, xT). A GM-estimate having such an influence curve is qualitatively robust. A good choice for W(x) is where W is a non-negative scalar-valued weight function which is monotone decreasing on [0,00), and eis a robust estimate of the covariance matric for the "'t. The breakdown point of the resulting GM-estimate is positive, and is actually! if c has a breakdown point of !.
Favourable small-sample efficiency robustness properties of GM-estimates have been reported by Denby and Larsen (1977) and Hill (1977). The use of GM-estimates for fitting autoregressions is discussed in Section 6.

THE POWER SPECTRUM AND THE NEED TO ESTIMATE IT ROBUSTLY
The simplest and most commonly used representation for a stationary time series {Yk} is the second-order characterization in terms of its mean, p. = EYk' and its autocovariance sequence C(/) = cov [Yk,Yk+,), 1= 0, 1,2, .... Although p. and {C(/): 1= 0,1,2, ... } provide a complete characterization for the Gaussian case, this is not true for an arbitrary stationary series, for which specification of all finite-order joint distribution functions is required. Since the estimation of higher order properties of a time series is rather cumbersome, most investigators content themselves with estimating the second-order properties such as the mean and the covariance C(/) for a finite number of lags 1= 0, I, ... ,L or equivalently the spectrum as defined below.

The Non-parametric Spectrum
Any second-order stationary process {y,} has the spectral representation where Z(f), -!~f~!, has orthogonal increments. The process Z(f) defines a monotone increasing function F through The function F(f) is called the spectral distribution function, and, when it exists, its derivative S(f) = F'(f) is called the spectral density function of {Y,}. Other commonly used terms for S(f) are power spectral density, spectrum and power spectrum.
It follows from (3.1) and the orthogonal increments property of Z(f) that when the spectral density S(f) exists. Thus the covariances C(/), 1=0, ±1, ... , are the Fourier coefficients of S(f) and so which is periodic, with period unity. Many textbooks (especially those oriented towards application) define the spectral density directly by (3.4) without ever mentioning the spectral decomposition (3.1). However, the latter is quite basic and in essence says that any stationary time series can be interpreted as the limit of finite sums of sinusoids of the form

The Autoregressive Spectrum
There are three major time domain models for stationary time series: the moving average model, the autoregressive model, and the mixed autoregressive moving average model. While the finite mixed model often provides the most parsimonious representation and has therefore been used extensively in time domain modelling (see Box and Jenkins, 1970), the finite autoregressive model has received by far the most attention in the context of spectrum estimation. One reason for this is the computational attractiveness of the associated linear least-squares estimation problem, where the presence of moving average terms leads to non-linear problems. Furthermore, under fairly reasonable assumptions (a continuous and non-zero spectrum) a purely non-deterministic stationary time series has the infinite order autoregressive representation ex> Xk = 'J:.t/>jXk-j+ek' (3.5) j-I where 'J:.j~It/>,<oo and the ek are uncorrelated random variables with mean 0 and finite variance a~. The ek were labelled the innovations process by Wold (1954). The spectrum is of the form j-I For any process {Xk} with a continuous spectrum S(J) there exists a finite-order autoregressive process whose spectrum S(f) approximates S(J) arbitrarily well uniformly in f (this is essentially the Weierstrass theorem for approximating a continuous function on a closed interval with polynomials-in this case trigonometric polynomials).

Non-parametric Estimation of the Spectrum
The most frequently used method of estimating the spectrum today is non-parametric and is based on smoothing the periodogram. Given the observed time series Xk' I~k~n, the basic steps in constructing the estimate are as follows. A data window or taper d k , 1~k~n, which slopes gently towards zero at both ends, is used to form the modified data xk = dkXk' The purpose of this tapering is to reduce "leakage" (see, for example, Bloomfield, 1976) associated with the finiteness of any actual set of data. Then the discrete Fourier transform of xk is computed using the fast Fourier transformation, and the periodogram is formed. Since the periodogram is not a consistent estimate, it is smoothed to get a respectable estimate of the power spectrum: It is also possible to reduce variability by averaging periodograms computed from different (often overlapping) time segments (see Welch, 1967).
(3.10) (3.11) 1979] KLEINER, MARTIN AND THOMSON -Robust Estimation ofPower Spectra 321 A good spectrum analysis by the above method usually requires computing several estimates based on different amounts of smoothing in order to exploit the trade-offs between bias and variance. The leakage problem can be alleviated by either applying one of the better tapering functions such as the cosine window, the Parzen window or the prolate spheroidal window (see Thomson, 1977), or through prewhitening (discussed in Section 4).
Further details concerning the above "direct" method (by way of contrast with the prefast Fourier transform approach based on using covariance estimates e(l), (3.4» may be found, for example, in the introductory text by Bloomfield (1976).

The Parametric Autoregressive Estimate
In recent years there has been considerable interest in estimating the spectrum by fitting a finite-order autoregressive model, an approach heavily based on the approximation property mentioned in Section 3.2. Thus least-squares estimates $1' $2' ...,$p,~are obtained from the data and S(f) is estimated by the autoregressive spectral density estimate computed at a grid of frequencies.
For the autoregressive estimate there is of course the problem of choosing a good approximating order p, which is analogous to choosing the amount of smoothing in the smoothedperiodogram approach.
Two commonly used and reasonable order-selection methods are Akaike's (1969) final prediction error criterion and Parzen's (1974) criterion autoregressive transfer function (CAT).
In Akaike's stopping rule the order ft of the autoregression is the value of p which minimizes the final prediction error (FPE) where s~is an estimate of the residual variance of the fitted autoregressive process of order p, The FPE stopping rule is an approximate version of the AIC maximum likelihood criterion of Akaike (1974) under the assumption of a Gaussian process.
Parzen's criterion chooses the order p which minimizes is an estimate of the residual variance of the infinite-order autoregressive fit and s~is the estimated residual variance of autoregressive fit of order p. Although Parzen's and Akaike's criteria are philosophically different, Parzen's experience (1974), as well as ours, show that the two criteria yield very similar results in practice.

Time Series Outliers and the Need for Robust Spectrum Estimates
Most of the work on robust estimation to date has been concerned with distributional robustness for problems where the observations are independent and identically distributed. In time series analysis realistic modelling of outliers poses a more serious challenge because (3.13) the observations are no longer independent and outliers can occur under a wide variety of circumstances or be due to very different mechanisms. In the face of these difficulties the problem is to specify some simple but reasonably representative outlier models which are tractable, and which lead to the construction of good robust estimates. Fox (1972) introduced two distinct kinds of outliers for autoregressive time series: innovations outliers (10) and additive outliers (AO). Their probabilistic models, in their simplest forms, can be described in terms of a single heavy-tailed univariate distribution. The 10 model is of the form where the distribution F. of the innovations Ek is heavy-tailed. The AO model is 14) where Xk is a Gaussian pth-order autoregression and the Vk are independent identically distributed with common distribution F. = (I-y) 8 0+yH; 8 0 is the degenerate distribution having all its mass at the origin and H is a heavy-tailed symmetric distribution with mean 0 and variance 02. Thus P(vk = 0) = l-y and, if y is small, the Xk are observed perfectly most of the time. As we will argue below and show by a striking example in Section 7, H need not be heavy-tailed relative to the scale of the process in order to have a serious impact on spectral density estimates. The above models are obviously rather simple and it is easy to imagine several modifications which might result in more realistic outlier generating mechanisms. Of these perhaps the most important would be to drop the assumption that the Vk are independent while maintaining P(Vk = 0) = l-y with y small. This would account for patchy outliers which occur rather frequently in practice. On the other hand, even the simplest outlier models raise serious enough difficulties and questions to warrant considerable study.
Innovations outliers do not pose serious problems for spectrum estimation when the shape of the spectrum is of primary interest. Only the scaling factor requires a robust estimate. For large sample autoregressive spectral density estimates of the form (3.10) this follows from the fact that the asymptotic covariance matrix of the least squares estimates of the autoregressive coefficients cPLS is independent of Fe' This was pointed out by Whittle (1962)  is raised by A2/n and retains its shape only if the oscillatory term (A/n) 2 Re{X(J;)exp(i27TJi)} is considerably smaller than S:I:(Ji) for all frequencies fz. This means that if the spectrum varies over a wide range (e.g. if the spectrum contains a large portion of its total power in a narrow frequency band and has small but important peaks elsewhere), even relatively minor outliers can cause serious problems, because their effect, although small compared to the total power, may be large relative to the power in small peaks. Two single outliers will give rise to oscillatory behaviour in the low-amplitude portion of the spectrum. Example 4 in Section 7 illustrates this point as well as the fact that outliers can be quite small and stitl have serious consequences. More than two outliers will give rise to more complicated behaviour which is best described in terms of the mean and variance of the spectrum estimate. Thus &I/(f) can be badly biased and also have a seriously inflated variance. This can be the case even for O'~~O'~if Sx(f) is small in the region of interest (as it often is).
The bias and increased variability in the periodogram will result in a corresponding distortion and increased variability in the smoothed periodogram as an estimate of the power spectrum.
A Under the additive outlier model (3.14) the least squares estimates <PLS computed from the observed series Yk will be biased. This bias of <f>L8 can be disastrous for large vk'S, even if only a small fraction of the vk'S are non-zero. For example, when Xk is of order I with parameter cP' $LS~cPo almost surely with where O'~=ya2. Therefore the asymptotic bias of $LS for independent Vk is (3.18) which only vanishes for cP = 0 or O'~/O'~~oo, but can be quite substantial even for moderate amounts of contamination. For instance if y = 0·1 and 0'2/0'~= IO the bias is 50 per cent. The variance of $LS is also seriously inflated. Similar results can be obtained for higher order processes p and there is a corresponding impact on the autoregressive spectrum estimate.
We have seen that the use of either the smoothed periodogram approach or the autoregressive method to estimate the power spectrum can yield poor results when the data contain additive outliers. Furthermore, additive outliers in the original series need not be extremely large with respect to the scale of the observed process to obscure important features of the corresponding power spectrum. They only need to be large compared to the scale of the innovations process and may therefore be easily overlooked.
In the following sections we discuss alternative ways of estimating power spectra which are not easily disturbed by such outliers.

Prewhitening
Prewhitening is an all-too-often neglected technique for reducing the dynamic range of the spectrum by filtering the data so as to transform the original time series into a series whose spectrum is nearly flat (or white). Its main advantage is that smoothed periodogram estimates based on the whitened series do not suffer nearly as much from leakage as do estimates based on the original data. Thus fine structure in the spectrum, such as two close low-amplitude peaks, is more likely to be spotted in such an estimate.
The need for prewhitening along with its utility and interpretation was discussed by Blackman and Tukey (1958). In a later article Tukey (1967) says: "If low frequencies are loa, 1()4 or lOS times as active as high ones, a not infrequent phenomenon in physical situations, even a fairly good window is too leaky for comfortable use. The cure is not to go in for fancier windows, but rather to preprocess the data towards a flatter spectrum, to analyse this 'prewhitened' series, and then to adjust its estimated spectrum estimates for the easily computable effects of preprocessing." The smoothed periodogram and autoregressive spectral density estimates are limiting versions of the prewhitening approach. The former amounts to no prewhitening and the [No.3, latter corresponds to total prewhitening if the process is truly a finite-order autoregression with known order p. Since this is an unlikely event in practice, the autoregressive approach is almost always an approximation. Once we openly admit the approximate nature of the finite-order autoregressive spectral density, we should be rather uncomfortable with the use of~in the numerator of the estimate (3.6), because the approximation may very well miss important low-amplitude details.
Thus we prefer an autoregression prewhitened spectral density estimate having the general character suggested by Tukey: The main aspects of prewhitening which have changed since Blackman and Tukey (1958) are (i) the willingness to fit autoregressive models with orders much higher than two or three (using modern numerical algorithms for the least squares fitting procedure orders of 20, 30 or higher may even be appropriate for some applications; Thomson, 1977), and (ii) the availability of reasonable order selection criteria such as AIC(p) and CAT(p).
It is not entirely clear to us at this point how to use order selection rules in computing the hybrid parametric/non-parametric estimate (4.1). Though our preference is for orders somewhat lower than those obtained with a minimum AIC(p) or CAT(P) criterion, providing the data contain no outliers, we will not belabour the point here since precise estimation of the order seems relatively unimportant. What is important is that the order be high enough so that the residual spectrum is relatively fiat.

Robust Prewhitening
When a time series contains outliers, the advantages of prewhitening as a technique of estimating the spectral density function become even more apparent.
As pointed out in Section 3, it is not sufficient to trim or delete only those time series outliers which are large relative to the scale of the process, but one must also contend with outliers which are only large relative to the scale a. of the innovations or prediction error process. Since prewhitening procedures involve the computation of linear predictor coefficients and associated prediction residuals, they provide a natural framework for dealing with such outliers in a manner that will result in a robust spectral density estimate.
Our basic proposal is to robustify both the numerator S,(/) and the denominator IIif) in (4.1). Thus we shall make use of (i) robust estimates $1' $2' ...,$p of autoregressive/linear predictor parameters, and (ii) a procedure for obtaining prediction residuals which are free of outliers but are otherwise relatively unaltered-these "clean" residuals are then used in computing Sr(/). In the next two sections we describe two distinct methods for carrying out this idea.

The Estimation Procedure
For simplicity let us assume that the observed values Yk have mean zero (translation equivariant versions can be obtained in a straightforward manner). Then the iterative procedure is initiated by fitting (by least squares) the autoregression (5.1) to the data Yk' k = 1, 2, ... , n, where p is chosen according to one of the stopping rules described in Section 3.4. The estimate $T = ($1' $2' ..., $p) is then used in a robust filtering algorithm which operates on the original observations Yk to produce {Xk}, a robust estimate of the underlying process {xk}' The robust filtering algorithm is where ik-l = (Xk-l,Xk-2' ""Xk_p) and §2 is an estimate of the innovations variance O'~of Xk' c is the constant chosen to obtain high efficiencies for both y = P(vk:F 0) = 0 and y> 0 with y not too large. Finally S,.(f) is computed from the {rk} and $ is inserted in (4.2) to give the robust spectrum estimate (4.1). For computational details see Thomson (1977).
If {Xk} is a finite autoregressive process, $ in (5.2) is known and cS is replaced by an appropriate fixed value, the above recursion is a member of a class of robust filters studied by Masreliez and Martin (1977). These filters are robust in the sense that their mean-squared error is bounded over a broad family of distributions including the Gaussian and heavytailed alternatives. For the special case of first-order autoregression this bound has been shown to be quite close to the minimum attainable mean-squared error over a wide range of parameter values.

An Optimality Property of Xk
The algorithm described above is actually an approximate solution to an intuitively appealing minimization problem. Replace $ by cf>' and denote the estimate (5.2) by Xk(cf>').
Then we define an estimate cf> as a solution of the minimization problem min{~p(Yk-iI-~(cf>')cf>')+(n_p)lns'}' If we assume that G k -1 (4)) 4> is small compared with i k-1(4)) for all k = p+ 1, ... .n and all 4> corresponding to a stationary process (heuristic arguments and Monte Carlo results indicate that this is a reasonable assumption), then (5.5) reduces to f i k_1 (4) It is now easy to see that (5.8) is equivalent to the estimation procedure described in the previous section (with the exception of the scale parameter s, which is estimated differently).
Just rearrange (5.2) as and use the fact that the $ satisfy equations of the Yule-Walker form obtained by replacing Xk'S with Xk'S (note that from now on 4> arguments of xi$) are suppressed for convenience): Thus robust filtering is approximately equivalent to iteratively computing the estimates (5.8) to convergence, while being much simpler computationally.
The price paid for using the robust filter estimate when Xk is Gaussian and vk=O is increased variability and some bias. These are obviously difficult to evaluate analytically in view of the complexity of the estimator; some limited Monte Carlo experience indicates that the loss in mean-squared-error efficiency will be small in the Gaussian situation. Some rough variational arguments described in Appendix A indicate that the bias in the spectrum is small and is proportional to the true spectrum.

ROBUST AUTOREGRESSIVE SPECTRUM ESTIMATION
This approach obtains estimates of the spectral density by first robustly fitting an auto regressive model to the data and then operating on the resulting residuals.
For simplicity let us again assume that the observed values Yk have mean zero. Then the basic idea behind our estimation procedure for the robust autoregressive coefficients 4>T = ($t,...,$p) is a generalization of Huber's (1973) M-estimates for regression so that the summands of the estimating equations are bounded and continuous functions of the data. This in turn results in an influence function having the same properties, the desirability of which has been stressed by Hampel (1974)  with W(.) a non-negative continuous function such that W(Yi)Yi is bounded (cf. Section 2.5).
The p x p matrix e-1 is a robust estimate of the inverse of the covariance matrix C of the underlying process {Xk}; it has to be estimated from the observed data Yk' The constant A is chosen so that s is an asymptotically unbiased estimate of the standard deviation (J when the data are Gaussian (some specific values are given by Martin and Zeh, 1978). Equations (6.1) and (6.2) can be solved by using iterated weighted least squares (IWLS); their solutions are called generalized M-estimates (GM-estimates). It can be shown that GM-estimates are qualitatively robust and have positive breakdown point under reasonable assumptions. It is somewhat disappointing that the breakdown point is bounded above by 1/(P+ 1) under the AO model (3.14); it can be higher if the outliers Ilk are not independent.
For W(Yi)= 1 the GM-estimates reduce to the autoregressive version of ordinary M-estimates for regression. These are however quite non-robust toward additive outliers having asymptotic biases nearly as large as those of least squares estimates (Denby and Martin, 1979).
Computational details concerning the IWLS procedure and the estimate of e-1 can be found in Martin and Zeh (1978). This reference also gives examples of applications of GM-estimates to real data.
To compute the robust spectral density (4.1) the GM-estimates cI» are inserted into (4.2).
The residuals ri used to compute the numerator Sr(f) can be obtained either by Tri = W(yi_Js,p(Yi-~-l~) (6.4) or by inserting cI» into (5.2) and computing the residuals according to (5.3). The latter version appears to give somewhat better results.

EXAMPLES
Four examples are shown here, two with real data, two with artificial data.
For all examples exactly the same computational procedures are used and all parameters are kept fixed at the same values. This is done to demonstrate that the procedures can be used for quite different types of data without the need for constantly adjusting parameters. Of course this does not preclude the use of specific knowledge about a particular set of data to fine-tune the procedures and parameters for even better results.
All examples show a basic set of four to five curves (sometimes augmented by further figures) which includes: (1) At least halfof the raw data in the time domain.
(2) The non-robust classical periodogram-based power spectrum (always chown as a dashed line): The time series, minus its mean, is tapered by the prolate spheroidal window with parameter c = 417 (see Thomson, 1977); then its discrete Fourier transform is computed at 1,025 (regardless of sample size) equidistant frequencies between 0 and 0·5 (the Nyquist frequency), the Fourier transform is squared and then smoothed over 61 neighbouring values by modified Parzen weights (Cleveland and Parzen, 1975). Every eighth point is plotted. This function represents a middle ground between ,pBS (which it resembles for q around 1'75) and the hard rejection rule (which it approaches for q> 5). It stays close to the line rejection rule. In examples I, 3 and 4 the value of q was 3'1, in example 2 the value of q was 2·9. Four iterations were used, unless noted otherwise. The residual spectrum S,(f) of (4.1) was computed in the same way as the periodogram-based spectrum but with a prolate parameter c = 17.
(4) The power spectrumfrom the robust autoregressive approach (always shown as a solid line): The GM-estimates of the autoregressive coefficients $ are computed as described in Section 6. The robust estimates $ are inserted into the robust filter (5.3) using !/Jell with the same values of q as in (3) and an estimate of the innovations scale s from (6.2); S,(f) is computed the same way as in the periodogram-based method. Experience shows that the final spectrum is not too dependent on the order of the autoregressive fit; since we have not yet been able to find a very reliable robust stopping rule, we always picked order 7. This is a compromise between a too low order, which might not represent an adequate dynamic range for some data sets of interest, and a too high order which could overfit and would lower the breakdown point of the GM-estimation procedure. Order 7 followed by robust filtering proved to be adequate even for spectra where max, S(f)/min t S(f) was as large as IOS-I()lo; for most cases orders 2-3 would suffice.
Example I This is an AO-like artificial example composed of the following three autoregressive processes: ' {k"'" N(O, I). Uk denotes a process with a large amount of power at very low frequencies, Wk constitutes a narrow band component with a peak at 0·17 and a bandwidth of 0'1, and zk has a peak at 0·22 with a bandwidth of 0'1. We standardize Uk' Wk and zk to unit variance, compute Yk =~(75)Uk+Wk+Zk' k = 1,2, ... ,n, and then standardize Yk to unit variance. Fig. IA shows the first 512 points of a realization of length 1,024 of Yk while Fig. IB shows the first 512 points of the same realization, but with noise from 0·9S o+0·IN(0, 10). Fig. l C shows the spectra (on a logarithmic scale) of the theoretical spectrum (dotted line), the periodogram-based spectrum and the spectrum resulting from the robust filtering approach. Fig. 10 shows the same spectra with the robust filtering spectrum replaced by the robust autoregressive spectrum. Both robust spectrum estimates follow the theoretical spectrum reasonably well while the periodogram-based spectrum does not bear any resemblance to the true spectrum. Fig. IE shows the spectrum based on the transfer function of the least squares autoregressive model of order 7 (dashed line), the spectrum based on the robust autoregressive fit of order 7 (solid line) and the theoretical spectrum (dotted line). Here, as in many other instances, the autoregressive spectrum is not able to reproduce the shape of the theoretical spectrum and one would have to go to much higher orders of autoregression (which would lower the breakdown point of the GM-estimates considerably). Note, however, the large gain in range of the robust spectrum over the least squares method.

Example 2
This synthetic example is taken from Brillinger (1973) and consists of 512 values of white noise with five adjacent outliers close to the centre of the data set. Fig. 2A shows the data in   the time domain. Fig. 2B shows the theoretical spectrum without the outliers (dotted), the periodogram-based spectrum and the robust filter spectrum. Only one iteration was used here in the robust filtering approach, because the stopping rule chose an autoregressive representation of order O. In Fig. 2C the robust autoregressive spectrum replaces the robust filter spectrum. The periodogram-based spectrum is heavily influenced by the five outliers, while both robust techniques produce more or less flat (i.e. white) spectra, as does Brillinger's technique. However, while his technique would break down if, for instance, confronted with two sets of outliers not close to each other, the techniques described here would still work in that case.

Example 3
This geophysical data set consists of 900 adjusted voltage measurements taken every 2 sec from a relative ionosphere opacity meter stationed at the Siple observation post in Antarctica. This data set not only has a relatively large number of isolated outliers (due to electrical discharges caused by snow blowing across the antenna), but also a very large transient signal resulting from calibration. Fig. 3A shows the 900 data points in the time domain; the first outlier is not shown completely (it runs as low as -300 over two data points); the other outliers are around -250. The transient signal due to calibration creates enormous outliers; note the break in the axis. Fig. 3B shows the periodogram-based spectrum and the robust filtering spectrum; Fig. 3C the periodogram-based spectrum and the robust autoregressive spectrum. While the periodogram-based spectrum shows nothing but the effects of the calibration period, both robust techniques show several distinctive peaks at frequencies 0'07, 0,18, 0,20, 0·24 and 0·29. These peaks could be due to periodic modulation of the ionospheric conductivity, which could be caused by hydromagnetic waves and particle precipitation from the magnetosphere into the ionosphere (Lanzerotti et al., 1977).

Example 4
This telecommunications example consists of 1,000 measurements of diameter distortions of a dielectric-lined waveguide, a copper-plated circular steel tube capable of carrying over 200,000 simultaneous telephone conversations. Here relatively minor outliers, barely detectable on a plot of the original data, hide important features of a spectrum with a wide dynamic range. Fig. 4A shows the data in the time domain, with part of the curve enlarged. The two outliers are close together shortly after 2 m and are both due to a malfunctioning of the recording instrument. These outliers are extremely small on the scale of the original  process; however, since the ratio of the prediction variance/process variance for the autoregressive fit of order 7 is~1O~, they are very large on the scale of the innovations and distort the power spectrum accordingly. Fig. 48 shows first differences of the data; here the outliers do not appear to be negligible. Fig. 4C shows the periodogram-based spectrum and the spectrum based on robust filtering. Only one iteration was used because, due to round-off errors on our computer (a Honeywell 6078 with 36 bits per word), further iteration would not have yielded significantly better results. Fig. 40 shows the periodogram-based spectrum and ..... the robust autoregressive spectrum. Above frequency 0'1, the periodogram-based spectrum just reflects the effects of the two outliers, while the two robust spectra exhibit a much wider range and show several distinctive peaks. For example, the peak at 0·074 is twice the basic frequency of the straightener applied to the steel tubes housing the waveguide (see Boyd and Thomson, 1976); the peak at 0·136 is twice the basic frequency of the extruder from which the plastic interior lining of the waveguide flows (the basic frequencies of both the straightener and the extruder do generally not manifest themselves in the spectrum of diameter distortions, but could be seen in the spectrum of the curvature). The relatively minor peak at 0·111 is at three times the basic frequency of the straightener, the large peak at 0·185 at five times, while the broad hump around 0·265 consists of contributions at seven times the basic frequency of the straightener and at four times the basic frequency of the extruder. The peak at 0·445 is at twelve times the straightener's fundamental frequency.
8. SUMMARY AND CONCLUSIONS Two classes of robust estimates of the power spectrum density have been introduced: a robust filtering procedure and a robust autoregressive approach. Each method provides solutions of an intuitively appealing minimization problem, where these solutions are exact in the autoregressive case, approximate for robust filtering. Both methods make use of robust prewhitening and weighting of prediction residuals in an essential way. The procedures have proved to be successful over a wide variety of data sets.
The robust filtering approach and the robust autoregressive procedure only differ in the way "good" autoregressive coefficients for prewhitening are computed: robust filtering computes its filtered values (and their autoregressive representation) iteratively, while the autoregressive technique gets the autoregressive coefficients by iteratively weighted least squares. For small p both procedures use roughly the same amount of core and computer time and give equally good results.
Both estimates have rather obvious and appealing resistance properties as evidenced by the examples in Section 7. Their robustness properties have, however, not yet been evaluated in detail. This will be a challenging task for several reasons. First, the results of Hampel (1971) on qualitative robustness do not apply to the general time series case and new theory in this area will be needed. Second, although finite sample mean-squared-error efficiency might be investigated by Monte Carlo techniques, computing asymptotic efficiencies requires knowledge of the asymptotic distributions of our hybrid estimates, which remain to be determined. It might be worth noting that the GM-estimates~GM have a positive breakdown point and a purely asymptotic form of qualitative robustness (Martin, 1979). We also believe this to be true for the robust filtering estimates~and {f k } , but have not yet given a formal proof. Thus it is expected that the resulting robust prewhitened estimates have positive breakdown points and at least an asymptotic form of qualitative robustness.
There is of course always the question if simpler and faster methods could not do the job just as well. Earlier (Kleiner et al., 1976) we tried to use non-linear smoothers such as local M-estimates to get rid of outliers before computing the spectrum; this failed because these smoothers usually not only get rid of the outliers but of interesting higher frequencies as well. As mentioned previously, there is generally no need for sophisticated robust techniques if the spectrum is reasonably flat. However, robust techniques may be needed in lightly coloured situations and are definitely a necessity if the spectrum has a dynamic range of more than three or four decades. Such spectra are commonplace in physical sciences such as telecommunications (where wide range spectra arise, for example, in transmitting signals by cable, waveguide or optical fibres) and geophysics (e.g. in measuring phenomena of the magnetosphere); they also occur in the fields of medicine, where Dumermuth (1971) showed spectra of electroencephalograms ranging over six to seven decades, and even economics, where, for instance, the spectrum of the Federal Reserve Board Index of Industrial Production ranges over five decades (Nerlove, 1964).

ApPENDIX A
We give a simple perturbation argument which, for the uncontaminated case, indicates that the bias in the spectrum introduced by using the robust filtering procedure is small and is proportional to the true spectrum.
Let .i k = Xk+8k' I/J(t) = t+>"(t) (A.I) and assume that the original process Xk can be represented by a finite-order autoregressive process (A.2) With this notation the autocovariances {p} from which we can compute the spectrum (f) of Xk are of the form Pi = Pi+2Exk+i8 k+E8k+i 8 k• (A.3) If the scale of the process Xk is much larger than that of the errors, the dominant error term in (A.3) is 2Exk+i8 k. The most useful measure for characterization of a spectrum density estimate is the relative error E~(f)/S(f); this can be obtained not only from the process {x} and {x}, but also by prewhitening both the data {x} and the error process {8} using the autoregressive coefficients cPl' cP2" .., cPP' Prewhitening {x} in this way results in {E}, while for the error sequence one Taking first the specific problem of spectrum estimation with which the authors have been concerned, they have demonstrated convincingly that contamination due to outliers in the time domain can cause substantial bias in any frequency domain analysis throughout the whole range of the spectrum. The examples the authors have produced are very dramatic and convincing. The modification of standard robustness procedures that they have developed and applied to this problem seem to me to have dealt with it satisfactorily. I can claim no great expertise or experience in spectral analysis, but it seems to me that their basic technique is the right way to estimate the spectral density. The authors have been able to draw on the extensive experience of Bell Telephone Laboratories, both in spectral estimation and in robustness. They have gone a long way towards producing a definitive treatment of this problem.