Application of Interactive Regularized Discriminant Analysis to Wine Data

Testing the possibility of determining the geographical origin (country) of wines on the base of chemico-analytical parameters was the aim of the European project ”Establishing of a wine data bank for analytical parameters for wines from Third countries (G6RD-CT-2001-00646-WINE DB)” supported by the European Commission. Therefore a data base containing 400 samples of commercial and authentic wines from Hungary, Czech Republic, Romania and South Africa was created. For each of those samples around 100 analytical parameters, among them rare earth elements and isotopic ratios were measured. Besides other multivariate statistical methods of discrimination and classification the method of regularized discriminant analysis (RDA) was used to distinguish the wines of the different countries on the base of a minimal number of the most important parameters. A MATLAB-program, developed by Vandev (2004) which allows an interactive stepwise discriminant model building on the base of an optimal choice of the “nonlinearity” parameter alpha was used. This program will be described shortly and models for commercial wines with corresponding classification and prediction error rates will be given. As a result of using RDA it was possible to reduce the number of analytical parameters to the eight to infer the geographical origin of these commercial wines. Zusammenfassung: Das Prüfen der Möglichkeit der geographischen Herkunftsbestimmung von Weinen auf der Basis chemisch-analytischer Parameter war das Ziel des von der Europäischen Kommission unterstützten Europäischen Projektes ,,Errichtung einer Weindatenbank für analytische Parameter von Weinen aus Drittländern (G6RD-CT-2001-00646-WINE DB)“. Hierfür wurde eine Datenbasis, die 400 kommerzielle und authentische Weinproben aus Ungarn, Tschechien, Rumänien und Süd Afrika enthält, erhoben. Für jede dieser Proben wurden ca. 100 analytische Parameter gemessen, unter ihnen seltene Erden und Isotopendaten. Neben weiteren multivariaten Methoden der Diskriminierung und Klassifikation wurde die Regularisierte Diskriminanzanalyse (RDA) verwendet, um die Weine verschiedener Länder mit minimaler Anzahl der wichtigsten Parameter zu unterscheiden. Ein von Vandev (2004) entwickeltes MATLABProgramm, das eine interaktive schrittweise Diskriminanzmodellbildung bei optimaler Wahl des ,,Nichtlinearitäts-Parameters“ α gestattet, fand hierbei 46 Austrian Journal of Statistics, Vol. 35 (2006), No. 1, 45–55 Anwendung. Dieses Programm wird kurz beschrieben und es werden Modelle für kommerzielle Weine mit den entsprechenden Klassifikationsund Vorhersagefehlern angegeben. Als Ergebnis der Anwendung der RDA konnte die Anzahl der analytischen Parameter auf die für die Unterscheidung nach ihrer geographischen Herkunft (Land) wichtigsten acht reduziert werden.

Besides other multivariate statistical methods of discrimination and classification the method of regularized discriminant analysis (RDA) was used to distinguish the wines of the different countries on the base of a minimal number of the most important parameters.A MATLAB-program, developed by Vandev (2004) which allows an interactive stepwise discriminant model building on the base of an optimal choice of the "nonlinearity" parameter alpha was used.This program will be described shortly and models for commercial wines with corresponding classification and prediction error rates will be given.
As a result of using RDA it was possible to reduce the number of analytical parameters to the eight to infer the geographical origin of these commercial wines.

Introduction
The responsible wine controlling authorities are often confronted with products which are not correctly marked with regard to their origin, vintages and quality parameters.To find out such adulterations of wines, the identification of the geographical origin of wines is of great interest to wine consumers and producers (Römisch et al., 2001).This was the background for creating a data base of wines from Hungary, Czech Republic, Romania and South Africa over a period of three years (2001)(2002)(2003) in the scope of a European project.
Every year 400 commercial and authentic wine samples were collected based on a sample plan.Commercial wines were purchased directly from the wine producers of the respective countries, whereas authentic wines were produced under standardized conditions in a laboratory.For each of these samples of the first year around 100 chemical parameters were analyzed.After these first analyzes, taking the experiences of involved oenologists into consideration, it was possible to reduce this number to 63: regular 58 parameters plus 5 rare earth ratios, the chemists suggested to include.
Data management included data handling of missing and censored data, log-transformations of 90% of the data and the identification of univariate and multivariate outliers.Then descriptive and inferential univariate methods, variance and correlation analyzes and multivariate classification and projection methods were applied to all wine data and separately to authentic as well as commercial red and white wines.
For the case of commercial wines some results of linear, quadratic and regularized discriminant analyzes will be presented.

Discriminant Analysis
Discriminant analysis is used to analyze differences of two or more groups with respect to a set of variables measured on the objects of these groups.Two questions are to be answered: 1. Which variables are the most important to discriminate between the groups?(discrimination problem) 2. In which groups objects (elements, cases), whose group membership is unknown, will be classified based on their variable values?Which correct classification rates can be found with the estimated discriminant model?(classification and prediction problem) The influence of the independent variables on the groups is to be investigated.Discriminant functions, which contain significant variables, are estimated and objects will be classified on the base of the estimated discriminant model.Different methods of discrimination (e.g.linear, quadratic, regularized, nonparametric, . ..) can be used."Good" discriminant models contain the most important variables for separating groups with minimal misclassification rates.
Here we restrict to presenting the results of an one-parameter based regularized discriminant analysis, including the linear and quadratic case.

Classification
Methods of discriminant analysis (McLachlan, 1992;Fahrmeir et al., 1996) allow assigning objects to one of K (K ≥ 2) distinct groups on the base of a feature vector x = (x 1 , . . ., x p ), containing the measurements from each object.Moreover, the separability of groups in the feature space will be analyzed.
Let the categorical variable Y denote the group membership of the object, where Y = k implies that it belongs to the group with index k (k = 1, . . ., K).Moreover, each object is characterized by the p-dimensional feature vector X.Let p k = P (Y = k) be the prior probabilities, that an object belongs to the group with index k and f (x|k) be the conditional density of X given Y = k.The unconditional distribution of X is then Of special interest for classification problems is the posterior probability p(k|x), i.e. the probability, that an object with observed feature vector x belongs to group k.Then according to the formula of Bayes this conditional probability of Y given X = x can be written as Two well known allocation rules can be derived: • Maximum Likelihood allocation rule for the special case that p k = p, ∀k, That is, an object with feature vector x will be assigned to that group with index k which has the largest posterior probability.The Bayes rule achieves minimal misclassification risk among all possible rules.All allocation rules considered have the general structure where d j (x) are called discriminant functions.
In practice the conditional densities f (x|k) and sometimes also the prior probabilities p k are unknown and have to be estimated on the base of a learning sample.For this purpose an assumption about the group distribution can be used for example.
Austrian Journal of Statistics, Vol. 35 (2006), No. 1, 45-55 2.2 Linear, Quadratic, and Regularized Discriminant Analysis We assume normality for the p-dimensional feature vector X k in group k where µ k denote the group mean and Σ k the group covariance matrix.Then the conditional distribution of X given Y = k can be described by the density of the normal distribution 1) and ( 2))and taking the logarithm leads to the discriminant function of the form Using allocation rule (2) with equation ( 4) minimizes the misclassification risk and is called Quadratic Discriminant Analysis (QDA), since it separates the disjoint regions of the feature space corresponding to each group assignment by quadratic boundaries.The Linear Discriminant Analysis (LDA) is used if the group covariance matrices are identical, i.e., Σ k = Σ, ∀k.In this case the rule that minimizes the misclassification risk leads to a linear separation of the groups.The quadratic term in the discriminant function for all groups then is the same and can be eliminated.Whether LDA or QDA should be preferred depends on the structure of the data.If we consider real data, the parameters µ k and Σ k are unknown and have to be estimated (μ k and Σk ) from a given training sample.In practice, often LDA leads to better classification results than QDA, even when the true group covariance matrices are not equal, because less model parameters have to be estimated and LDA is more robust against violations of its basic assumptions.
Regularization techniques are successfully used in solving ill-and poorly posed problems.If the number of parameters to be estimated is comparable or even larger than the sample size, the parameter estimates can be highly unstable.Friedman (1989) has proposed the Regularized Discriminant Analysis (RDA) as a compromise between linear and quadratic discriminant analyzes.He has proposed two steps of regularization.First, the estimated group covariance matrix Σk should be regularized by where S k and S are the sample-based covariance matrix estimates and n k and n the corresponding sample sizes.The regularization parameter λ ∈ [0, 1] controls the degree of shrinkage of the group covariance matrix estimates toward the pooled estimate.If n is less than or comparable to p, the estimate of Σ k should be regularized further by where I p is the p × p identity matrix, and c k = tr Σk (λ) /p.For a given value of λ ∈ [0, 1], the additional regularization parameter γ ∈ [0, 1] controls shrinkage toward a multiple of the identity matrix.The multiplier c k is the average value of the eigenvalues of Σk (λ).This shrinkage has the effect of decreasing the larger eigenvalues and increasing the smaller ones of Σk (λ), thereby counteracting the bias of the estimates.In Vandev (2004) the covariance matrices are stabilized by one parameter α, i.e., This parameter α ∈ [0, 1] corresponds to (1 − λ) above.The limiting cases correspond to LDA (α = 0) and QDA (α = 1).To determine the optimal value of this parameter α, the error rate estimation has to be minimized during the model building process.As error rate estimations often resubstitution, cross validation or simulation methods are used.The methods we have used are described in Section 3.
3 The MATLAB-Program "ldagui" The MATLAB-program "ldagui" is described in detail in Vandev (2004).It can be used by means of menus, shortcuts and listboxes.The main window of the program shows Figure 1.

Menus
Five menus File, Model, Diagnostics, Use and Help can be activated.
• In File a csv-data file can be loaded and by choosing a classification and selection variable missing data will be replaced with group means.
• Model allows to build interactively a model in dependence on a minimal classification and test error and an optimal choice of the regularization parameter α ∈ [0, 1].
• Diagnostics contains three tools for making adequate decisions: -Test: A small random test sample with 600 observations for each group will be produced according estimated group means and covariance matrices and will be classified.
-"Leave-one-out" (LOO -special case of cross validation): Classical: For each observation in the training sample a model with the same variables will be built but without that particular observation.Then each removed observation will be classified with this model, all misclassifications are counted and the LOO-error will be estimated.Modification: Not only the one removed, but all observations from the training sample will be classified, all misclassifications are counted and LOO-error will be estimated.-Plot: Second and third canonical variables will be plotted against the first.
• In Use other (csv)-data files can be loaded for testing the model ("Hold-out" method).
More detailed results are printed in the MATLAB command window, e.g.
• Ordered variables in model with their F -and p-values, • Wilk's Λ-and p-value, • Results of error estimation of the training sample by methods of resubstitution, simulation (test and theoretical error) and cross validation (classical and modified LOO), including number and cases of misclassifications and cases classified with probability < 0.8.The theoretical error was estimated in the same way as the test error, but by using a large (6000 per group) simulated data sample and the LOO error was obtained as proportion of all errors to the size of training sample.
The algorithms are based on papers of Jennrich (1977) and Einslein et al. (1977).
4 Results of Applying RDA to Wine Data

Overview about Models for Commercial Wines
Several models for commercial wines obtained by using RDA are presented in Table 1.
Here we have used the following strategy: At first we have looked for our ,,best" model (Model 1) by choosing the optimal parameter α manually so that the model has 0 or only a small number of classification and test errors.Then we have considered the same model for α = 0 (LDA) and α = 1 (QDA).In a next step we wanted to find a better linear and quadratic model and we have considered some other acceptable models for different α.
Classification and prediction errors and misclassified samples will be given.

Description of RDA-Model 1
On the base of the print results of "ldagui" our preferred model (RDA-Model 1 for α = 0.95) will be described in more detail.

Models for White and Red Wines
Table 4 contains our preferred RDA-models for white and red wines.In both cases only six variables were selected as being important to separate the four countries.By simulating 6000 samples per group very small "theoretical" error rates could be obtained.

Conclusions
The classical methods of discriminant analysis are suitable for distinguishing wines from different countries.Discriminant models containing most important parameters and allowing minimal misclassification rates can be given.Particularly, methods of regularized discriminant analysis led to good results in our case of investigating commercial wines.Using our preferred model 1 of RDA for all commercial wines, which is much better than the corresponding one of LDA and comparable with that of QDA, all 195 wines could be classified correctly by resubstitution method.Wilk's Λ near 0 shows a high discriminating power of the chosen model.Only 13 wines were classified with posterior probability < 0.8.By simulating 6000 wine samples per country a "theoretical" correct classification rate of 96.32% could be obtained.Using "Leave-One-Out" method led to correct classification rates between 96.4% (classical LOO) and 99.95% (modified LOO).
The eight most important variables are: the isotopic ratios Ethanol (D/H)1 and Ethanol (D/H)2, the trace elements Strontium and Zinc, the macroelements Calcium and Silicon and the biogenic amine Ethanolamine and the classical parameter Gluconic Acid. Figure 2 shows the well separation of the countries by model 1.
As expected the South African wines could be separated very easily from those of the other countries.Only the isotopic ratios could be identified as being important and sufficient parameters in the discriminant model.
Considering only white respectively red commercial wines, RDA-models with six variables led to very good results of discriminating the wines of the four countries.

Table 1 :
Model results for commercial wines(N = 195) No. of LOO-cases leading to one or more misclassifications of cases of the whole training sample **LOO-mean error of misclassifications over the whole training sample * Table 2 contains the variables as result of interactive model building and Figure 2 illustrates this model.

Table 4 :
Model results for white and red commercial wines.No. of LOO-cases which lead to one or more misclassifications of cases of the whole training sample **LOO-mean error of misclassifications over the whole training sample *