Empirical Density Estimation for Interval Censored Data

This paper is concerned with the nonparametric estimation of a density function when the data are incomplete due to interval censoring. The Nadaraya-Watson kernel density estimator is modified to allow description of such interval data. An interactive R application is developed to explore different estimates.

The observed data (Y 1 , . . ., Y n ) are independent realizations of Y conditionally on unobservable complete realizations of X.The goal is to estimate the density f .
The above general framework covers a wide variety of incomplete observing situations.It is studied in the literature mainly in the context of nonparametric estimation of the distribution function of X.Some references include Laird (1978), Liu and Zhu (2007), Goutis (1997).
We relate this problem to the problem of smoothing over a discrete distribution.Our goal is to achieve more precise approximation for small probabilities, at the same time, find good visual representation of the true density.
Existing literature on smoothing over discrete distributions concentrate mainly on asymptotic properties of the proposed estimators, even when considering sparse tables.The Nadaraya-Watson kernel density estimator is modified by Simonoff (1996) for estimating ordered categorical data.Moreover, local polynomial estimators are proposed for smoothing discrete distributions.
Results on the asymptotic behavior of the mean sum of squares of the errors are studied in Simonoff (1983Simonoff ( , 1995)), or Aerts, Augustyns, and Janssen (1997).The asymptotics of the local polynomials smoothers were studied by Aerts et al. (1997), establishing sufficient conditions this sparse consistency.
A natural extension of these estimators allows interval data.In Section 2 we give an appropriate definition of empirical density function for interval censored data.In Section 2.1 we derive the maximum likelihood density estimate using uniformity condition.In Section 3 the Nadaraya-Watson density estimator is given.Finally, in Section 5 we present a real data example and demonstrate an R-program for interactive control of the smoothing parameter.

Empirical Density Function for Interval Censored Data
In the study of the empirical density function (e.d.f.) the first difficulty is to find its best definition.Revesz (1972) and Revesz (1974) gives a number of possible definitions of the e.d.f.based on a complete sample from the underlying distribution.We will follow this construction to define e.d.f.based on a censored sample.
For interval censored random variable X the subsets {S i } are intervals specified by some known points τ 1 , τ 2 , . . . of the real line which is independent of X. Define For each X i the corresponding interval Y i may be defined on a different partition of X .The conditional density is itself unknown.
Let Y 1 , . . ., Y n be a sample of observed intervals, where We assume that each X i has an equal impact in the sample so the corresponding interval E. Stoimenova 121 should have.A common approach is to assume the conditional distribution over the interval Y i to be uniform, i.e. we suppose that X i occurs at any point of the observed interval [L i , R i ] equally likely and the total mass in the observed interval is 1/n.
Further, let [a, b] be an interval of the real line and denote by M n (a, b) the restricted mass of all elements of the sample Y 1 , . . ., Y n into this interval.More precisely, the restriction function is defined as follow: Notation: For a non-zero interval (c, d) the restriction function over an interval (a, b) is denoted by J (a,b) (c, d) and equals For two intercept intervals, the restriction function gives the ratio of the length of the common part of the intervals to the length of the second one (here (a, b)).Then M n (a, b) is the sum of all parts of the observed intervals.
Therefore, the probability b a f (t)dt can be estimated by the relative mass The nonparametric maximum likelihood estimate for a density with censored data can constructed using this approach.

Maximum Likelihood Density Estimate from Censored Data
Let Y 1 , . . ., Y n be a sample of observed intervals, where Then the mass of the observed intervals Y 1 , . . ., Y n over an interval (a, b) can be expressed by function J (a,b) as a sum of the restricted mass of all Y 's.
To construct the maximum likelihood density estimate, let C be the set of disjoint intervals (cells) defined by the endpoints of all observed intervals Y 1 , . . ., Y n , and let For any observed interval (c, d) the restriction function over an interval (a, b) ∈ C is That is, any observation covers fully some intervals from the partition C.

122
Austrian Journal of Statistics, Vol. 37 (2008), No. 1, 119-128 Define the mass m(c i , c i+1 ) of any cell Denoting p i = 1 n m i we obtain the vector p = (p 1 , . . ., p r ) that represents the observed probability mass of the sample Y 1 , . . ., Y n over the support of f and obviously, p i = 1.
Thus, the nonparametric maximum likelihood estimate of f (x) is a piecewise function with constant values in each subinterval.It equals to the sum of restricted mass of all observed intervals The quantity p i is an estimate of the i-th cell true probability which represents the probability that a random variable with density f takes a value within the interval (c i , c i+1 ).It is easy to see that for fixed categories specified by endpoints C, the vector (p 1 , . . ., p r ) has a multinomial distribution with E(p i ) = p i .
The estimate p i is an accurate estimate of the true probability p i if the probability mass in interval (c i , c i+1 ) is large.However, the number of cells might be large compared with the number of observation, the vector p could not be a good estimator for true density f .The sparse asymptotics require the size of cells to become zero as the sample size tends to infinity.Nevertheless, this requirement is not relevant to the multinomial distribution, it is useful for modelling of observations in large number of cells.Under sparse asymptotics the cell probabilities p i are not consistent in the sense that if n → ∞ and sup(c The sparse asymptotic property is more natural for vector p obtained from censored data since they may represent probability mass observed in a sequence of binnings.The simplest relation is a sequence of different roundings (see Hall, 1982 for instance).
Since it is reasonably to assume that the mass changes smoothly as x increases, the mass falling in one particular cell provides information about the probability of falling in its neighbors.Therefore, smoothing makes sense since we assume that the probability function is continuous.The improvement using smoothing is most evident when the distribution is sparse in the sense that mass of Y falling each cell are small.

Kernel Density Estimates
In density estimation with complete data, one observes a sample X 1 , . . ., X n from a distribution on the real line admitting a density f .The kernel estimator of f , introduced by Rosenblatt (1956), is defined by where X 1 , . . ., X n are the data, K h (x) = 1 h K x h is a kernel function and h = h(n) is a sequence of constants tending to zero as n → ∞.Kernel function is defined to satisfy the conditions We shall consider both the kernel density estimation and the censored data problems.
Local polynomial smoothers for categorical data is suggested by Simonoff (1995) and further studied by Aerts et al. (1997), Baek (1998), etc.For equispace design c j = j/K with K be the number of cells, the local polynomial estimator p i is defined as the constant term of the minimizer β 0 of Asymptotic behavior of this estimate is derived by Aerts et al. (1997).The advantage of p i over p i is that it is consistent under sparse asymptotics.

Modified Nadaraya-Watson Density Estimator
We shall consider the density estimation and the nonlinear regression problems related to density estimation with censored data.We have already mention that for complete data, the kernel estimator of f is defined by (3).In nonparametric regression with complete data, one observes n pairs (X 1 , Z 1 ), . . ., (X n , Z n ) obeying the model where the regression function m(x) is conditional expectation m(x) = E(Z|X = x), and the random errors i satisfy E( i |X = x) = 0 and var( i |X = x) = σ 2 not necessarily constant.
There are many different methods for estimating m(x).Here we focus on a estimator known as Nadaraya-Watson kernel estimator (Nadaraya, 1964).The Nadaraya-Watson kernel estimator for m(x) is defined by where f (x), K and h are as before.Note that this estimator is the local polynomial estimate with t = 0 When the data are interval censored, so that X i ∈ Y i , i = 1, . . ., n, and only Therefore p i is the Nadaraya-Watson kernel regression estimator on the points (c j , p j ), j = 1, . . ., r, and p i is the solution of the natural weighted least squares problem, being the minimizer β 0 of Thus p i corresponds to locally approximating p i with a constant weighting values p j corresponding to intervals closer to (c i , c i+1 ) more heavily.
One of the main problems in density estimation is to find a good connection between the length of the intervals and the size n.This aspect of density estimation is first discussed by Rosenblatt (1956).Since all estimators of the density function under regular assumptions are bias, mean square error is a standard measure that comprising the bias and variance of the estimator.

Variable Bandwidth and Interactive Choice
The performance of f (x) depends critically on the choice of h that controls the smoothness of the estimates.There are various methods proposed for automatic selecting an optimal h for uncensored data.Often the optimal value of h minimizes the mean square error or some equivalent functional.Note, that any automatic choice of the smoothing parameter should be view as a benchmark, and need to adjusted based on subjective impression.
In this this study we apply variable bandwidth h i in (4) depending on the length of cells (c i , c i+1 ).Adjusting parameter λ stretch all h i and gives numerous estimators.The choice of the kernel is known not to have great effect on the practical performance of the estimator.We use Normal kernel everywhere in our examples.
The R implementation of proposed estimators is in interactive graphic environment.Changing options are done in an easy way, getting a rich graphical response.This is R program which uses the tcltk package to provide simple functions for building of a control panel for the graphics.
We remind that in the literature, authors were mainly concerned with the asymptotic behavior when both the number of cells of the distribution and the number of observations were converging to infinity, maintaining some sort of relation implying that the number of cells in the support would not become to small with respect to the size of the sample.In this paper we are mainly interested in the behavior when the sample is fixed and trying to find estimators that have good performance.We have no objective quantification of what good means, so we propose user to explore many different estimator to find appropriate description of the data.
The numerical results included were obtained from KT-DigiCult-Bg project collection of metadata on mediaeval manuscripts studied for another purpose by Stoimenova, Mateev, and Dobreva (2006).

Application to Chronologcal Data
Temporal data is often subject to dilatation and translation effects.The most of manuscripts, chronicles and other historical documents that we have today, related to ancient and medieval times, were not dated exactly but dated as a likely periods.The two values "Notbefore" and "Notafter" are common in catalogue descriptions and specifies the interval [L, R] for possible origin of the item and can be also extracted from other types of description (see Stoimenova et al., 2006).
In this section the methods considered in this paper are applied to a sequence of 807 intervals of "Notbefore" and "Notafter" dates of origin of mediaeval manuscripts .Figure 1 represents the data as line segments with length equal to the observed interval length R−L and height approximately proportional to log(1/R−L).The exact dated documents are less that 25% of the data and are depicted at the top level on the Figure 1.The left histogram in Figure 2 is the maximum likelihood density estimate calculated using (2) and p i = 1 n m i .The set C of endpoints consists of 258 points.The right histogram is a Nadaraya-Watson estimate calculated using (4).Here the variable bandwidths h i are calculated by (R i − L i )λ 2 /24 with λ = 2.
This is the natural way to define the empirical density function.Using f (x) as an estimator of f (x) one makes two errors: the first error is in the fact that f (x) is estimated by its integral mean 1 b−a b a f (t)dt; the second is made when the probability b a f (t)dt is estimated by the relative mass 1 n M n (a, b).The first error is small if the interval [a, b] is short, while the second one is small if the interval contains enough mass of elements, i.e. if the interval [a, b] is not too short.

Figure 1 :
Figure 1: Raw interval data for medieval manuscripts.

Figure 3
Figure 3 illustrates the real-time evolution of the modified Nadaraya-Watson density estimator.The computation begins with a given adjusting parameter λ (the stretch of the bandwidths).Then the user can change it by a slider and explore different shapes of the distribution.Here λ varies in [2, 40].A snapshot of the estimators obtained by using the interactive program is shown.The values of λ used was 2, 18, and 32.The top left picture shows the control panel.It is clear that using all endpoints C for visual representation of the estimator does not give an easy interpretative estimator.Second interactive adjusting parameter choose

Figure 4
illustrates different shapes of the distribution for different subsets of C. Slider values varies in [2, 40].A snapshot of the estimators obtained by using the interactive program is shown.The number of intervals used was 42, 28, 18, and 9.