Robust Regression Analysis of Longitudinal Data under Censoring

We consider regression analysis of longitudinal data when the temporal correlation is modeled by an autoregressive process. Robust R estimators of regression and autoregressive parameters are obtained. Our estimators are valid under censoring caused by detection limits. Efficient computation of the estimators is discussed. Theoretical and simulation studies of the estimators are presented. We analyze a real data set on air pollution using our methodology.


Introduction
We consider a time series {X t : t ≥ 1} and an associated series of covariate vectors {Z t : t ≥ 1}, in q , for some q ≥ 1.We postulate a linear model of the form X t = β 0 + β Z t + α t , where the model errors α t , is a stationary autoregressive time series of order p, for some p ≥ 1 : α t = φ 1 α t−1 + • • • + φ p α t−p + t , where { t } are i.i.d.from a symmetric continuous distribution, and α s = 0, for s ≤ 0. We assume that the coefficients φ satisfy the usual invertibility condition.
In some situations, the exact values of X t may be unavailable due to censoring.In this paper, we develop our methodology for the situation when the censoring is to the left which may occur when the values of the time series X t fall below a detection limit D t .Thus, the observed data consists of X c t = X t ∨ D t and the censoring indicators δ t = I(D t ≤ X t ).Our method can easily be adopted to the case of right censored data by simple changes in various formulas leading to our estimators.They can also be extended to the case when an observation is doubly censored but it requires more work.
The number of papers dealing with some form of censored time series data is limited (Vasudaven et al., 1996) although Zeger and Brookmeyer (1986) argue that censoring may occur naturally in longitudinal studies when there are detection limits on the observation that are being collected in time.They took a fully parametric approach to the above problem and fitted a Gaussian error model using the maximum likelihood approach via an EM type algorithm.In this paper, we take an estimating equation approach that is a robustified form of the least squares estimating function.
There is a sizable literature on R-estimators in the regression context (Hettmansperger and McKean, 2011) with i.i.d.errors but not for auto-regressive errors.Furthermore, an added complication arises due to censoring.As use the "approximate unbiasedness" principle of reweighting data to construct our R-type estimating equation for the set of regression and the error autoregression parameters.Since this estimating equation involve ranks of quantities that are not computable due to censoring, the re-weighting principle is used again to compute the approximate ranks to be used in the estimating equation.
The rest of the paper is organized as follows.Section 2 describes our estimation method and discusses an efficient method of computing the estimator.We also present a model based resampling procedure for making inference using our estimators.Section 3 presents results from a number of simulation studies demonstrating the performance of our estimators.We illustrate our methodology on a real dataset dealing with air pollution in Section 4. The paper ends with a discussion section (Section 5).

The estimators
We develop two different estimators of the regression and the error autoregression parameters.In the first approach, we ignore the fact that the models errors are dependent and estimate the regression parameters first which are then used to estimate the autoregressive parameters.In the second approach, a joint objective function of both sets of parameters is formed.

The complete data case
First we consider the situation when there is no censoring so that we have fully observed the time series X t , 1 ≤ t ≤ n.We form an estimating equation that is partly based on ranks of certain model residuals and is therefore yields more robust estimators than the corresponding least squares estimators.Define, for any vector b ∈ q , the residuals for the linear model part a t (b) : = X t − b T Z t , for 1 ≤ t ≤ n, and a t (b) : = 0, for t ≤ 0. Note that even though the true errors α t are not independent, they are still ergodic and thus we could use the same estimating equation of a traditional R-estimation in this context.Thus, we could obtain a "quick and dirty" consistent estimator of β by minimizing the objective function where Here φ 1 is defined on (0, 1) such that φ 1 is monotonic and φ 2 1 < ∞, and φ 1 = n −1 n i=1 φ 1 (i/(n + 1)) .After obtaining an estimate β, the intercept parameter can be (robustly) estimated as Having estimated the regression parameters, a similar objective function can now be formed to estimate the autoregressive part of the error time series: where e t (h) : In the rest of the paper, we refer to these estimators as "partial R estimators".
Finally, we consider a second approach where estimators are obtained by minimizing a joint objective function.For b 0 ∈ , b ∈ q and h ∈ p , define the model residuals (as a function of b 0 , b and h), by e t (b 0 , b, h) : and e t (b 0 , b, h) : = 0, for t ≤ 0. We then form a joint objective function where R and φ 3 are as before.In accordance with the earlier name, the resulting estimators obtained by minimizing D J will be called the "full R-estimators".

Modification for censored data
Next, we will describe how to modify this estimating function in presence of left-censoring.
Both the estimating function and the ranks have to be computed on the basis of observed data.However, in order to avoid any selection bias, the contribution of such a term has to be re-weighted by the corresponding inverse selection probability.These probabilities will have to be estimated from appropriate models fitted to the censoring distribution.
The censored data version of the objective function corresponding to (1) is of the form with Wt , where the presence of δ t indicates that the corresponding a t (b) is computable from the available data and W t is the corresponding selection weight that is described later.The quantity R c denotes a modified "rank" that accounts for censoring.To motivate the definition of rank for censored data, it will be useful to first consider the following sum representation for R(a t (b)) in the full (uncensored) data situation: Using the same re-weighting principle as before, we can define an "estimated rank" that is computable from left censored data by , for any t with δ t = 1.
In the censored data case, a robust estimator of the intercept term can be obtained as where F α is an estimator of the distribution function of the stationary distribution of the a t based on the same re-weighting principle Next note that, in order to calculate the residual function e t (h) corresponding to time t, we need to have complete (i.e., uncensored) observations on X j , t − p ≤ j ≤ t.Thus, the modified objective function for estimating φ will be of the form where the W s are as before, for any t with t j=t−p δ j = 1.In the same way, the joint objective function can be modified to account for censored data as where W j are the same as before and R c is similarly defined as in (7).

Computation of the estimator
The estimating function can be optimized using a general purpose optimizer such as "optim" or "optimize" in R. For the p = q = 1 case, we can perform a grid search algorithm which we describe below.
Note that D 1,M (b) is a linear function in b in regions where the ranks R c (a t (b)) do not change for t's with δ t = 1.For b to be such a change point, there will exist pairs of integers t and i such that In the same way, D c 2,M can be maximized over the grid of points

Computation of the weights
The weights W j are estimates of the conditional (given X j and Z j ) cumulative distribution function of the D j , i.e., W j = P r{D j ≤ X j |X j , Z j }.The simplest way to estimate these will be to consider the corresponding (forward in time) hazard of where the equality is an independent censoring assumption that we impose throughout this paper.We now need a regression model on these hazard rates on the C. A flexible model is given by Aalen's linear hazards model that admits a closed form estimates of these quantities; see, e.g., Aalen (1989) or Datta and Satten (2002).A special case of these models, where we assume that λ c (c|Z j ) is free of the covariate Z j , also yields the simplest choice of W j obtained by the Kaplan-Meier estimator of the survival function based on the (right censored) C j evaluated at (−X j ) − .

Bootstrap inference
While it is possible to develop a large sample theory for our estimators by combining elements from Hettmansperger and McKean (2011), Datta and Satten (2002), and Datta and Beck (2014), we prefer to use model based resampling to perform statistical inference since it avoids the use of tuning parameter that is necessary for smoothing based asymptotic variance estimation.
Having fitted the regression model to the original data, we compute the model residuals ).The rest of the bootstrap procedure is standard leading to either a percentile based confidence interval or a large sample normality based confidence interval where the asymptotic variance is replaced by the empirical variance of independent replicates of bootstrapped estimates of the parameter of interest.

Simulation studies
We consider a single continuous covariate Z that is generated from a N (0, .64)distribution; we simulate the errors from an AR(1) model α t = 0.5α t−1 + t .A number of distributions for the were investigated.The regression parameters used for the simulation were β 0 = 2 and β 1 = 1.The censoring times were generated as D = 1/(E + 3) + m, where E has a standard exponential distribution and m ∈ is chosen to control the censoring rate.Three choices of the censoring rates were used.The bias and variance of the estimators were empirically estimated based on M = 1000 Monte Carlo samples each.
Table 1 reports the results of the simulation.Some general trends are observed from this table.The joint estimators of the slope and autoregressive parameters have better performance that the corresponding partial estimators.The joint estimator of intercept parameter, on the other hand, exhibits substantial bias which worsens with the amount of censoring; however it is corrected by the modified estimator in all cases.The standard deviation, of the estimators of slope and autoregression parameters increases, albeit slightly, with the censoring level.
Next we compute the empirical coverage of the bootstrap based confidence intervals using the percentile methods, as well as the standardized statistic using bootstrap based variance estimate.The coverage appears to be very good when there is no censoring and is adequate even with 30% censoring (Table 2).Overall, confidence intervals using standardized statistics have better coverage as expected.

An application
We illustrate our methodology using monthly data on the chemical composition of atmospheric deposition of dry NH 4 collected by the Environmental Measurements Laboratory between 1977 and early 1980 at a number of sites in the United States (Toonkel 1981); the same data set was used by Zeger and Brookmeyer (1986) to illustrate their method.Since there are lower detection limits of the assays, the data is left-censored.Altogether, there were 43 data points out of which 6 were left-censored.In addition, there were three observations that were missing; in order to accommodate them into our framework, we treat them as left censored by an artificially set high value (larger than all the observed values in the data set).The data were log-transformed as in Zeger and Brookmeyer (1986).A plot of the log-transformed data is shown in Figure 1; where the incomplete observations are denoted by the symbol "+".
One of the main research question was to determine if the amount of deposit is increasing with time.To that end we fit a regression model taking time as a covariate of the form X t = β 0 + β 1 t + α t , where α t was modeled by an AR(1) process.The resulting parameter estimates are given in Table 1; we include the parametric estimates by Zeger and Brookmeyer (ZB) for comparison.
While the two sets of parametric estimates are similar, the robust estimate of the intercept term is slightly smaller.More importantly, all confidence intervals for the slope term include 0 indicating that there is no significant change of the deposit levels with time.In a sense, the fact that the different analyses yielded the same scientific conclusion is reassuring.

Discussion
We introduce a robust and relatively model free technique of analyzing temporally correlated data that are subject to left-censoring.Although, our formulas are given here for left censored data, it is a mater of triviality to change them for right censored.With additional effort, it may be possible to extend the basic regression technique to other form of incomplete data.Another technical extension will be to consider other form of temporal correlation structures for the longitudinal responses.This paper presents a number of novel components which may be useful for other incomplete data problems.In particular, the concept of an approximate or estimated "rank" may be applied to extend other rank based inference for censored data settings.
b) is piecewise linear and continuous on B. Hence β can be obtained by maximizing D 1,M on the grid of points B. If D 1,M (b) is constant on [b j , b j+1 ], we will take β to be the midpoint (b j + b j+1 )/2.

Figure 1 :
Figure 1: Log-transformed data of monthly deposition of dry NH 4 ; the incomplete data values are indicated by "+".

Table 1 :
Performance of various estimators as measured by empirical bias and standard deviation in a simulation experiment.

Table 2 :
Empirical coverages of bootstrap confidence intervals in simulated data

Table 3 :
Parameter estimates for the Dry Deposition data