The Problem of Classification when the Data are Non-precise

Abstract: Non-precise data arise in a natural way in several contexts. For example, the water level of a river does not usually consist of a single number as can be seen from the intensity of the wetness as a function of depth of a survey rod. The temperature of a room varies as a function of distance from a reference point. The color intensities associated with a pixel which describe observations from remote sensing are non-precise numbers because they vary as a function of the reflection from the sun. In these examples, it is the imprecision of the observation itself that is of interest rather than the uncertainty due to statistical variation. Even in the absence of stochastic error, there would still be an imprecision in the measurement. Viertl (1997) developed the subject of statistical inference for such non-precise data and associated it very closely to fuzzy set theory. Precise data can be described by an indicator function whereas non-precise data is described by characterizing functions. In this article, we first review the notation and then consider the problems of classification for non-precise data.


Characterizing Functions
In the presentation below, we shall draw heavily on the analogy with inference for precise data.We shall put aside the presence of statistical error and assume that a precise measurement of a given quantity yields a single value.A precise measurement can be uniquely represented by an indicator function, I [x 0 ] (x) which takes value 1 if x = x 0 and 0 otherwise.A non-precise observation will be mathematically modelled in terms of characterizing function which are a generalization of an indicator function.Quoting from Viertl (1997), Definition 1 A characterizing function ξ(•) of a non-precise number is a real function of a real variable such that Viertl (1997) has shown that a characterizing function can be uniquely determined by the family of α-cuts {B α : α ∈ (0, 1]} and moreover ξ(x) = max α∈ (0,1] αI Bα (x) , ∀x ∈ R . (1) It should be noted that continuous functions fulfilling only conditions (i) and (ii) above can, through the notion of the convex hull, be made to also obey condition (iii).The characterizing function is the unique representation of a non-precise measurement.All inference is drawn on the basis of this representation.Viertl (1997) points out that characterizing functions can be viewed as representing the rate of change of values and he provides a prescription for its construction.Referring to the example on the water level of a river, let w(h) represent the intensity of the wetness of a survey rod as a function of the depth h 1 ≤ h ≤ h 2 , where h 1 , h 2 provide the range of values.Then the characterizing function can be given as The derivative measures the rate of change of the wetness.For values of h close to h 1 or h 2 , ξ(h) should be near 0 since the rod would be either always wet or always dry and one expects very little change.Precise data is described by an indicator function showing that the rate of change is 0 for values on either side of x 0 .
Example 2 Consider the characterizing function given by: The α-cut boundaries are given by Characterizing functions can also be defined for a non-precise n−dimensional vector x * .
Definition 3 A characterizing function ξ x * (•) of a non-precise vector x * is a real function of n variables such that , by which we mean that the line segment joining any two points in the set lies entirely in the set.
An example of a non-precise vector is the location of an object on a radar screen.The object appears as a cloud in two-dimensional space.The characterizing function may be constructed in terms of the light intensity function.Given n non-precise observations, x * 1 , x * 2 , . . ., x * n , each taking values in a space M with corresponding characterizing functions ξ 1 , . . ., ξ n , it is possible to define a characterizing function ξ : M n → [0, 1] for the combined sample via the product or the minimum rule respectively as The α-cuts for the combined sample (and hence the characterizing function) based on the minimum combination rule are easy to obtain from the α-cuts of the individual nonprecise observations.Referring to Example 2, if every observation in a sample of size n has the same characterizing function, then the characterizing function for the combined sample using the product rule is: It can be shown that both the product and the minimum combination rules lead to functions which satisfy the conditions of Definition 3 above.In practice, the minimum rule appears to be the more useful of the two.We now turn attention to functions of non-precise observations, such as the usual sample mean and sample variance.
Definition 4 Let g : R n → R be a real valued continuous function whose arguments are non-precise vectors x * with characterizing function ξ.The characterizing function of the non-precise value y * = g(x * ) is defined ∀y ∈ R as To demonstrate that the definition is reasonable, consider the sample sum.If the characterizing function of the individual measurements is represented by a rate of change, then the range of change in the sum is dictated by the greatest rate of change among the individual components.It can be shown that once again, ψ defined above is a characterizing function.
Austrian Journal of Statistics, Vol. 34 (2005), No. 4, 375-390 Assuming that g : M → R is a continuous function with M R n and sup(ξ(•)) M , Viertl (1997) showed that the general form for the α-cuts (B α (y * ); α ∈ (0, 1]) that define the characterizing function above consists of intervals of the form In practice, this result coupled with (1) leads to the construction of the characterizing function.In order to deal with more general problems of inference in point and interval estimation as well as with Bayesian analysis, Viertl (1997) , in the first case.
This can be seen from the fact that n i=1 and we can choose the x i 's so that In the next example, we consider the sample variance.
, in the first case.
To demonstrate this, using ( 7), we may write The first term is constant, and choosing x = ω, the second term is 0, so that In both examples, the characterizing functions decrease exponentially fast as the sample size increases.The basis for inference involving non-precise data is the construction of the characterizing function ξ(•) of the n-dimensional non-precise vector describing the combined sample x * .The statistical function S(x 1 , x 2 , . . ., x n ) which is the basis of inference for precise data x = (x 1 , x 2 , . . ., x n ) is then adapted for non-precise data by computing its characterizing function in accordance with the rule ∀y ∈ R We make use of this procedure in the problem of classification.

The Problem of Classification
As an application of a classification problem involving non-precise data, we may wish to identify the species of fish on the basis of echo sounder measurements.The depth at which fish travel does not consist of a single number but rather is a non-precise number.Moreover, different species may travel at depths described by different characterizing functions.In the simplest situation, for precise data, samples are observed from two populations π 1 , π 2 described respectively by density functions f 1 (•) and f 2 (•).Let c(i|j) be the cost of misclassification of class j as class i, i = j, and let p(i) be the prior probability for class i.We would like to classify a new observation which may be vector valued, into either π 1 , π 2 so as to minimize the expected cost of misclassification (ECM).
It can be shown (Johnson and Wichern, 1999, chapter 11) that the optimal region consists of classifying the new observation x into population π 1 provided In what follows we will assume equal priors and equal costs, so that As an example, suppose that π 1 , π 2 are described by normal populations with known means and covariances given respectively by (µ 1 , Σ 1 ) and The optimal classification rule for precise data consists of classifying a new observation x into population π 1 if and only if When the parameters (µ 1 , µ 2 , Σ) are unknown, samples of sizes n 1 , n 2 respectively are taken and used to calculate the corresponding sample means μ1 , μ2 and pooled sample covariance where Σ1 , Σ2 are the respective sample covariances.The rule then consists of classifying a new observation x into population π 1 if and only if The statistic T can be considered a score function.Viewed in terms of T (x; μ1 , μ2 , Σ), the region for classification of points into π 1 is always a subset of R. For non-precise data x (1) * , x (2) * , x * , the characterizing function of the non-precise value t * of the statistic T (x; μ1 , μ2 , Σ) is given by its values (∀t ∈ R) where M, M (i) are the respective spaces for the non-precise observations.If the support of t * is contained in either the interval [0, ∞) or its complement (−∞, 0), the observation is classified into either π 1 or π 2 , respectively.On the other hand, if the support has a non-empty intersection with the intervals [0, ∞), (−∞, 0), then the classification is ambiguous.We now consider some examples.
The decision of where to place the measurement clearly depends on the support of t * .Assume that µ 1 > µ 2 .For values of ω (µ 1 + µ 2 )/2, the characterizing function will be centered around a large positive number.Consequently, the measurement is likely to be classified into π 1 .Conversely for values of ω (µ 1 + µ 2 )/2, the measurement is likely to be classified into π 2 .For values of ω ≈ (µ 1 + µ 2 )/2, the characterizing function will be centered around 0 and then the classification will be ambiguous.
The example above can be generalized to the case where the parameters are unknown and are estimated on the basis of non-precise data.The latter will then serve to modify the characterizing function of t * .
We now consider the general classification problem involving several populations.For precise data, Fisher recommended the use of sample linear discriminants.These are defined as follows.Let n i ) be a sample of observations from the ith population and define the mean vectors Define as well the between groups and within groups variation matrices respectively Let (λ s ) denote the non-zero eigenvalues of W −1 B arranged in decreasing order and let (e s ) denote the corresponding eigenvectors.Then, the vector of coefficients l which maximizes the ratio l Bl/l W l is given by l 1 = e 1 .The first sample linear discriminant is given by d 1 = e 1 x.In general, the sth sample discriminant is given by d s = e s x and these are used to classify a future observation x as follows.Compute the discriminants y = (d 1 , d 2 , . ..) along with their vector of means µ Y i = (e 1 µ i , e 2 µ i , . ..) under population π i .We assign x to that population for which the distance y − µ Y 2 is smallest.
2 represent the distance of the discriminants to their mean under the ith population.Then the characterizing function of D i is given by the following, ∀t ∈ R (14) In the case where a single characterizing function, say from population π k , emerges clearly to the left of all the others, then the decision consists of classifying the measurement into π k .In instances where the regions of support of the characterizing functions overlap, there will be ambiguity in the classification.The calculations involved in ( 14) are illustrated in the next section for normal populations using the notion of α-cuts.
Example 9 Suppose that it is desired to classify a non-precise measurement x * into one of several multivariate populations having means µ i and covariances Σ i .The usual classification rule in the case where the data are precise consists of allocating x to that population π k for which the quadratic score Here, {p k t} represent the prior probabilities of selection of the populations.If the parameters are unknown, they are replaced by standard estimates μk , Σk , pk and the quadratic score becomes For non-precise data, the characterizing function corresponding to the ith score becomes We now consider some numerical examples.

Classification -Example
In order to illustrate our classification rule with non-precise observations, we consider three population classes with truncated Gaussian characterizing functions.(For all x such that ξ(x) < α for some small α, we set ξ(x) = 0.This yields a finite support.)We consider samples of size n i = 25 for each class i = 1, 2, 3, and • We set c 1 = 0, c 2 = 2, and c 3 = 3, the "centers" of each class.
• For each non-precise observation j from class i, we set µ i,j = c i + U i,j , where the U i,j are iid uniform random variables on [−1, 1].We also set σ i,j = V i,j , iid uniform random variables on [0.1, 0.5].
• We generated a "new observation", x * , also with a truncated Gaussian characterizing function with µ = 0.65 and σ = 0.05.
We then build the characterizing functions for the D i , the distances of the discriminants to their means under each class, as given in formula ( 14).This is done as follows.We set a fixed value α, and we consider the α-cuts for each observation to evaluate the minimum and maximum values taken by D i for this value of α.We do this for several values of α to obtain a sketch of the characterizing functions for the D i .This is illustrated in the top graphic of Figure 1 where, from left to right, we see the characterizing functions for classes 1, 2, and 3, respectively.From this plot, we see that for α > 0.78 (roughly), x * belongs to class 1.For 0.1 < α < 0.78, there is ambiguity between populations 1 and 2, and for α < 0.1, there is ambiguity between all three populations.We define the values α c (1, 2) = 0.78 and α c (1, 3) = 0.1 as critical points.Another way to describe the classification of x * is to say that it belongs to class 1 with confidence 1 − α c (1, 2) = 0.22, and to class 1 or 2 with confidence 1 − α c (1, 3) = 0.9.
Formally, looking at classes i = j, we define the point at which the α-cuts intersect, if any.In the bottom plot of Figure 1, we compute α c (1, 2) and α c (1, 3) for a range of precise observations x ∈ [0, 2].When we overlay the characterizing function of x * , we clearly see the critical values 0.78 and 0.1.This approach can be generalized to any number of classes and dimensions.

Classification -Discussion
The main difference in classification between precise and non-precise numbers lies in the interpretation of f i (x), class i probability density function (pdf) for precise quantities versus ξ i (x), class i characterizing function for non-precise quantities.For non-precise numbers, class i numbers take all values ξ i (x) > 0 simultaneously, and ξ i (x) represents the intensity at the (precise) value x.Properties such as the area under the curve being 1 (in the continuous case) no longer holds here.This interpretation is crucial in our definitions of classification functions.
As an illustration, consider some two-dimensional objects, such as weather patterns, groups of animals, etc., in R 2 .In the simple example shown in Figure 2, we have two square-shaped objects, with respective area of 1 and 25.We assume that the density is uniform for both objects, with characterizing functions ξ i (x, y) = 1, i = 1, 2, respectively inside the squares, and 0 elsewhere.In this case, a point (x, y) that belongs to both objects is such that ξ 1 (x, y)/ξ 2 (x, y) = 1, since both objects have the same intensity at (x, y).However, if we consider the objects via pdf's, we get f 1 (x, y)/f 2 (x, y) = 25, which gives much more weight to the smaller object.

Two Likelihood Scores
If we deal with precise numbers, the pdf f i (x) is a measure of the likelihood of class i for observation x, and the straightforward classification strategy is to choose class i that maximizes f i (x).We can assign the following scores for each class , which are interpreted as a membership function, or fuzzy classification values.We use this definition first and consider the following scores for each class i In this case however, since the ξ i (•) take values all in the same range [0, 1], we can define another score based on the α-cuts.For a given observation x, ≥ 0 without loss of generality (this is simply a re-definition of the class labels).We see that • x belongs to class 1 α-cut for ξ 1 (x) ≥ α > ξ 2 (x).
At a given α value, the interpretation is that x could belong to all classes i such that x is in the α-cut for this class.We assign scores according to the following table.
Summing all entries from this table we get ξ 1 (x), so the score assigned to class i is given by the sum of all values in column i, divided by ξ 1 (x)

Example
We consider a simple 2-class example where ξ 1 (x) is a triangle defined in the range [0, 2] with ξ 1 (1) = 1, while ξ 2 (x) = 1 over the range [0,2].This is illustrated in Figure 3, along with the corresponding probability density functions.Here are some values for the classification scores of class 1.
We see that S f 1 (1) differs from the other ones at x = 1, since it must be the case that f i (x)dx = 1, so the triangular-shaped pdf gets more weight at x = 1.In the non-precise case however, observing x = 1 is as likely to come from either class.We also see that S I ξ and S II ξ differ, and in the next subsection, we show that S II ξ 1 (x) is always a better choice with respect to some global error criterion.

An Error Criterion
In the N -class classification problem for non-precise numbers with characterizing functions ξ i (x), i = 1, . . ., N , let S ξ be a scoring function such that S ξ i (x) ≥ 0 is the score for class i at x, and N i=1 S ξ i (x) = 1, ∀x.We define the following error function in the continuous case for some large ∆.(In practice, we usually assume finite support so ∆ is finite.)So for each class i, we sum all the score given to the other classes, weighted by ξ i (•).In the discrete case, the integral is replaced by a summation over all cases for which ξ i (x) > 0.
In the example seen previously where ξ 1 (x) is triangular-shaped and ξ 2 (x) is uniform, we get Υ(S I ξ ) ≈ 0.301, Υ(S II ξ ) ≈ 0.292, and Υ(S f ) ≈ 0.338 when using the pdf's.For our next example, we consider two Gaussian distributions (respectively, characterizing functions) with mean and variance (0, 1) for class 1 and (1, γ) for class 2, and we look at several values for γ.In Figure 4, we plot the quantities Υ(S I ξ )/Υ(S f ) (ratio I) and Υ(S II ξ )/Υ(S f ) (ratio II).When γ = 1, we see that Υ(S I ξ ) = Υ(S f ) as expected.Moreover, we see that Υ(S II ξ ) < Υ(S I ξ ) for all cases considered.This is in fact a general result that we prove next.
We illustrate the likelihood scores given in ( 18) and ( 19) with an example based on the one presented in Section 3. Here, we assume that each population is represented by a truncated Gaussian characterizing function with respective centers µ 1 = 0, µ 2 = 2, and µ 3 = 3.We assume that all σ = 1.In Figure 5, we compute the likelihood functions for the three classes for precise values of x ∈ [0, 5].In particular, we look at x = 0.65 and notice that with both measures, this point is more likely to belong to class 1, with respective scores of 0.64 and 0.75.

Conclusion
In this paper, we presented a framework to address the problem of classification for nonprecise quantities.This was achieved by writing the characterizing functions of the distances to discriminants with respect to each class of observations.The notion of critical value α c (i, j) between classes i and j was also introduced.We also compared a new likelihood score to a straightforward extension from probability density functions.We showed that our likelihood score is always a better choice under some global error criterion.

Figure 2 :
Figure 2: Three classes in two dimensions.
Example 8 Consider the case for two univariate normal populations with known means µ 1 , µ 2 and common known variance σ 2 .Assume that the characterizing function of a new measurement x is given as in Example 2. Then the characterizing function of the non-precise value t * is given by