Kernel Density Estimation: Theory and Application in Discriminant Analysis

: Nowadays, one can ﬁnd a huge set of methods to estimate the density function of a random variable nonparametrically. Since the ﬁrst version of the most elementary nonparametric density estimator (the histogram) researchers produced a vast amount of ideas especially corresponding to the issue of choosing the bandwidth parameter in a kernel density estimator model. To focus not only on a descriptive application, the model seems to be quite suitable for application in discriminant analysis, where (multivariate) class densities are the basis for the assignment of a vector to a given class. This article gives insight to most popular bandwidth parameter selectors as well as to the performance of the kernel density estimator as a classiﬁcation method compared to the classical linear and quadratic discriminant analysis, respectively. Both a direct estimation in a multivariate space as well as an application of the concept to marginal normalizations of the single variables will be taken into consideration. From this report the gap between theory and application is going to be pointed out.


Introduction
Since non-parametric smoothing methods provide an interesting alternative to classical parametric estimation methods, this paper is concerned with the genesis of kernel density estimation itself (for descriptive means) as well as at its application in discriminant analysis to serve as a competitor in estimating class densities with respect to the corresponding model-based discrimination rules, which are the linear and the quadratic discriminant analysis (LDA, QDA).Starting at the first description of the kernel density estimation concept by Rosenblatt (1956) where K(•) denotes the kernel function (including some appropriate restrictions, e.g.K(u)du = 1, K(u) ≥ 0 for all u, etc.) and h is the so called bandwidth parameter.Many proposals about how to select the bandwidth appropriate have been made.For the kernel function the most commonly used density is N (µ, σ 2 ), but also several other kernel densities with bounded support are used.Section 2 gives an overview about the improvement in techniques and strategies for optimizing K and h with respect to different optimization criteria during the last 20 to 25 years.Results are available for the univariate and the multivariate model.Besides, the issue of handling the data in high(er) dimensions to still keep the model as competitor to LDA and QDA, respectively for discriminatory purposes, is treated.Here, another problem of optimization arises as well as the so called curse of dimensionality which hinders the user to generalize the concept easily without having sufficient data.Finally, in Section 3 the reader will see hard numbers, where the theoretical ideas are applied to different types of datasets by means of a detailed simulation study, where altogether 14 estimators are used for 21 different datasets.

Ideas in Kernel Density Estimation and Techniques for
Application in the Discrimination Context

The Univariate Setting
To classify a kernel density estimation fh (•) having specified kernel K and bandwidth h, as well estimated one has to create some kind of measure of deviation to the underlying original density f (.).A straightforward idea is to take the integrated difference (usually p = 1, p = 2, p → ∞) into consideration.Setting p = 2 here leads to the easiest to calculate results.For p = 1, Devroye and Györfi (1985) offer several optimality results for bandwidths concerning different kernel shapes.Nevertheless, in most cases the easier results for p = 2 are used.As the construction in (2) is a random variable (the so called integrated squared error ISE), it is useful to take the expectation into account.Jones (1991) provides an overview of what is better to use.Again, to circumvent difficulties, a Taylor approximation of (3) (called AMISE) is often used.Marron and Wand (1992) show, that those step-by-step adaptations (AMISE instead of MISE with p = 2 instead of p = 1 in (2) for reasons of easier calculation) can cause essential changes in the resulting estimated model parameter (in particular the optimal bandwidth).One has to be aware of the fact, that the extent of estimating the tails of the distribution well, decreases as p increases, which leads to the fact, that p → ∞ stresses a good overall fit and does not care about a bad fit in the tails.Besides, one can think of other criteria like comparing the number (and location) of the estimated modes to the numbers (and location) of the modes of the original density (see Park and Turlach, 1992), but one cannot carry out useful calculations without knowing the original density in application.Marron and Tsybakov (1995) go even further and suggest to include a kind of horizontal distance between the curves for reasons of a more intuitive fit, as well.Nevertheless, the only useful possibility to derive automatically data-driven parameters in application is to handle a compact formula similar as in (2).Stressing now on the AMISE, Wand and Jones (1995) derive the following formula which is a decomposition of a bias and variance term: where µ 2 (K) = z 2 K(z)dz denotes the variance of the kernel and R(K) = K(x) 2 dx.
After separating K as in (4), the Epanechnikow kernel (see, e.g.Silverman, 1986) minimizes the AMISE, but the choice of the kernel is not that crucial, because even for the triangular and the uniform kernel, there are less than 10% additional data points necessary to get the same AMISE.Small improvements are possible by leaving the restriction K(x) ≥ 0 for all x, which amounts in higher order kernels (Wand and Jones, 1995) leading to values fh (x) < 0 for some x, which are not adequate in the estimation of class densities in discriminant analysis.The more important bandwidth choice is carried out by minimizing (4) with respect to h, which leads to Before considering ideas and techniques for a concrete estimation of h AM ISE (the unknown density f has to be substituted) we discuss two further ways of improving the concept.First it is possible to smooth the density with different bandwidths, depending on a distance d j,k of the datapoint x j to its k-th nearest neighbor (Breiman et al., 1977) fh where the curve is more radically smoothed in sparse regions, which is useful in highly skewed distributions (see Sheather, 1992, p. 246, for an example of failure of the standard model).The problem of choosing another parameter k, which is, again more complicated Austrian Journal of Statistics, Vol. 33 (2004), No. 3, 267-279 in high dimensions does not qualify this model as fully data-driven and is therefore not suitable for unexperienced users.Another concept is the so-called transformed kernel density estimator.Its aim is to transform the observations into others, which results in easier-to-estimate densities (densities, which are much more effective to estimate with respect to the observation number).Wand and Jones (1995) give a parametric concept of a transformation rule, which again amounts in at least one parameter to be estimated by the unexperienced user, while Ruppert and Cline (1994) provide a calculation-intensive non-parametric approach that was used in the simulation study described below.
They refer to the fact, that if F and G are the cumulative distribution functions for the densities f and g, then Y = G −1 (F (X)) has density g.One is now able to choose any distribution, but the user will probably choose a normal distribution, because normalization is wanted.In application t(x) = G −1 ( Fh ) is taken and Fh (x) is a (pilot-) kernel estimate of F (x).The issue of bandwidth selection in the univariate case is discussed detailed in literature.Authors offer many proposals to apply (5) in practice, starting by using a Gaussian kernel N (0, σ 2 ) for K and replacing f in ( 5) by the respective functional of N (0, σ 2 ) as well.This rule of thumb or normal rule was suggested by Silverman (1986) and amounts in the formula for the AMISE-optimal bandwidth h opt ≈ 1.06σn −1/5 , which is easy to calculate and therefore often used.σ can be estimated by the sample standard deviation or by a more robust estimate as the inter-quartile-range R (and a corresponding adaptation of the coefficient).However, it often oversmooths the density.Another idea, which is well-known in statistics is the concept of cross-validation.Bowman (1984) states the optimization problem by giving unbiased estimators for the minimization of the ISE( fh ) itself.In Section 3 the corresponding formula for the multivariate case is given.This estimator is known as least-square cross-validation.Biased cross-validation is similar, but minimizes M ISE( fh .Both, biased-and least-square cross-validation have the drawback of high variances of the estimators and sometimes the occurrence of more than one local minimum.Sheather (1992) and Marron in his discussion of Sheather (1992) give different explanations about which one is the best to take.Finally, the likelihood cross-validation method (Silverman, 1986) leads to problems, if kernels having bounded support are used.
An interesting set of methods are the plug-in methods (see Sheather andJones, 1991, or Park andMarron, 1990), which represent the state of the art in selection of the bandwidth parameter as it seems, because many simulation studies identify them as the best or at least one of the best estimators with respect to an intuitive fit, the variance of the estimator and the performance in the estimation of harder-to estimate densities (Jones et al., 1996;Sheather, 1992;Park and Turlach, 1992;Cao et al., 1994).Common to all plug-in methods is, that they include an estimate of the unknown density functional R(f )in (5), which is performed by a kernel estimate, but in general with another kernel and smoothing parameter as for estimating f .This makes it different to the Biased cross-validation concept.Besides, regarding the asymptotic analysis of the bandwidth selectors, the plug-in approach has a faster convergence rate for h than the cross-validation methods.

The Multivariate Setting
When generalizing the univariate model up to d dimensions one gets where H is a symmetric positive definite d × d-matrix, the bandwidth matrix, and K is a multivariate kernel satisfying K(x)dx = 1 (Wand and Jones, 1995).Regarding the problem of choosing one single parameter in the univariate model the problem of estimating not only d parameters (for d dimensions), but d 2 parameters (indicating the direction of smoothing) arises.Here, one can think of either one parameter h controlling H = h 2 I or parameterizations with d or d 2 parameters.At this point it should be mentioned, that there is much calculation time required for the selection of more than one parameter, regarding algorithms, which are more complicated than the normal rule.Therefore the easiest parametrization was chosen in the simulation study.For reasons of large variances in models having more parameters this choice is a good compromise.Essentially, the same ideas as in the univariate setting can be generalized, but the application suffers from exhaustive computation time and mathematical tractability (e.g. for difficulties in the generalization of the plug-in estimator see Wand and Jones, 1994).However, the crucial problem is the so-called curse of dimensionality (see Silverman, 1986, or Scott, 1992), which is based the fact, that there is, roughly speaking, much more space in high dimensions and hence (strongly) increasing numbers of observations are required to satisfy a constant estimation accuracy (see again Silverman, 1986).Since the calculation time for estimators in high dimensions is even higher (d times) for equal observation numbers, the problem becomes transparent.

The Context to Discriminant Analysis
An important fact is, that concerning the discrimination context the minimization of error rates in classification has not necessarily the same aim as the minimization of error criteria discussed in section 2.1.Since the theoretical misclassification rate is an L 1 -based measure, a MISE-optimal bandwidth selector weights the fit in the tails of the distributions to a smaller extent.Actually, in higher dimensions an increasing part of the data disappears in the tails of the distribution.For that reason, a fit in the tails is demanded in classification tasks and Ripley (1996) underlines, that the estimation of differences of log-densities is crucial in classification.Hall and Wand (1988) treat the topic of at least estimating differences in densities (without logarithm), however, this is ignored in any other source.They use kernel functions having negative values and the multivariate generalization cannot be derived easily.

Preliminaries
The existing results for using kernel density estimation as a classification method go back to the late 1970s and provide a very limited view of the problem.Ness and Simpson (1976), Ness (1980), Habbema et al. (1978), and Remme et al. (1980) essentially treat uncorrelated multivariate normal distributions in high dimensions, in which groups were only separated by one variable.From the viewpoint of a curse of dimensionality improvements can only happen accidently and the essential issue of applying the concept to non-normal data is not taken into account.Besides, the parameters have been estimated with the likelihood cross-validation method, which is somewhat out-dated and often results in oversmoothing.All Studies used two classes and there was also no reason to change this in our simulation study, because the basic problem will probably be the same in a setting with more than two classes.

Construction of the Datasets
The first effect to study is the behavior of the model when the densities gradually move away from the normal distribution.In addition, skewed and bimodal distributions are interesting.Table 1 and Figure 1 show the univariate prototype distributions used and exhibit different types of deviations from the normal case.Table 2 lists the construction principles for the dataset based on univariate prototype distributions.Each dataset has dimension 1400 × 10 and consists of two classes, each with 700 observations, 600 for estimating the class density and 100 for calculating the classification criteria.Table 1 and Figure 1 show the distributions of class 1.For class 2 the parameters of the exponential distribution change from λ = 1 to λ = 2 and the normals and bimodals are shifted by 0.5 to the right.
After this linking step, these ten datasets (1-10) have been transformed linearly by ten 10 × 10-matrices, which are the roots of ten self-produced correlation-matrices.The datasets 11-20 have been produced exactly in the same way as datasets 1-10, but population 1 and population 2 have been transformed by different transformation-matrices. Concerning Table 2 and Figures 2-5 the datasets having equal covariance matrices have 1 as their last digit, the others have 2.For example, the dataset Bi42 consists originally of eight skewed distributions and two bimodals (whose bumps are strongly separated), and the transformation happened for both groups with unequal transformation matrices.The 30 (10 + 2 × 10) correlation matrices have been produced by assuming a common factor in the ten variables having a regression coefficient, whose absolute value is uniformly distributed between 0.3 and 1. Finally an insurance dataset having the same dimensions like the synthetic data was used (dataset 21).
Table 1: Prototype distributions of the synthetic datasets.

Construction of the Estimators
As mentioned above, the simulation study uses 14 estimators.First of all the LDA and QDA serve as conservative competitors.The multivariate densities were estimated by the multivariate normal rule (Scott, 1992) and the multivariate least-squares cross-validation (LSCV) selector (Bowman, 1984), which is given by the minimum of LSCV (H) in ( 7) with respect to h (H = h 2 I).
For this reason, the original datasets have been projected classwise by a principal component analysis (PCA) onto a subspace consisting of 2 to 5 dimensions, respectively and were estimated by both selectors, normal rule and LSCV leading to estimators 3-10.By doing that a trade-off between the information loss caused by the PCA and the accuracy loss for the kernel density estimators caused by additional dimensions can be confronted with each other.The issue of using marginal normalizations amount in the estimators 11-14.They are constructed as described in Subsection 2.1 for each of the ten variables in each dataset.
Here, the univariate plug-in bandwidth selector of Sheather and Jones (1991) and the normal rule (Silverman, 1986) was applied to normalize the datasets and in a subsequent step, the LDA and QDA were used for the transformed ten dimensional distributions, resulting in the 2 × 2 = 4 last estimation procedures.

The Performance Measure
To evaluate the goodness of classification, the classical error rate (percentage of wrong classified elements) and the Brier score (Hand, 1997) were considered.The later is a similar, but less robust measure and is given by where c i = 0 if x i is from group 1 and c i = 1 otherwise.Good discrimination is indicated when both measures are small.The following results refer to the Brier score only.

Results
One of the most important results is that the bandwidth selection procedure was not crucial.As the kernel choice for descriptive applications is not that important, the bandwidth parameter is not for discrimination purposes.The Sheather-Jones selector performed similar to the normal rule in the univariate normalizations and the LSCV selector resembled the normal rule in the multivariate setting, and so using the normal rule, which extremely saves computation time is completely sufficient.
Austrian Journal of Statistics, Vol. 33 (2004) The better performance of LDA compared to QDA in the case of equal covariance matrices was also obvious after marginal normalizations by the kernel method and vice versa in case of unequal covariance matrices.The most interesting result appeared in the comparison of the two main estimation techniques.The rates of the multivariate kernel estimators compared to the classical methods, LDA (in Figure 2 for equal correlation matrices) and QDA (in Figure 4 for unequal correlation matrices), are quite disappointing and the euphoria of the simulation studies in the past (e.g., Remme et al., 1980) are from this point of view not comprehensible.The direct density estimation in 2 to 5 dimensions had a poor performance compared to their parametric counterparts (LDA and QDA).Within those non-parametric estimators, the projection to 3 and 4 dimensions performed better than those to 2 and 5, but all in all this trial failed.

Figure 1 :
Figure 1: Prototype distributions of the synthetic datasets.

Figure 2 :Figure 3 :
Figure 2: Brier-score of the datasets having equal correlation matrices.Comparison between the LDA and the Bayes-rule-kernel-methods constructed by the LSCV-selector.

Figure 4 :
Figure 4: Brier-score of the datasets having unequal correlation matrices.Comparison between the QDA and the Bayes-rule-kernel-methods constructed by the LSCV-selector.

Figure 5 :
Figure 5: Brier-score of the datasets having unequal correlation matrices.Comparison between the QDA before and after univariate normalizations derived by kernel estimates.

Table 2 :
Description of the datasets used.