Number 1 , 5 – 26 Domain-Based Benchmark Experiments : Exploratory and Inferential Analysis

Benchmark experiments are the method of choice to compare learning algorithms empirically. For collections of data sets, the empirical performance distributions of a set of learning algorithms are estimated, compared, and ordered. Usually this is done for each data set separately. The present manuscript extends this single data set-based approach to a joint analysis for the complete collection, the so called problem domain. This enables to decide which algorithms to deploy in a specific application or to compare newly developed algorithms with well-known algorithms on established problem domains. Specialized visualization methods allow for easy exploration of huge amounts of benchmark data. Furthermore, we take the benchmark experiment design into account and use mixed-effects models to provide a formal statistical analysis. Two domain-based benchmark experiments demonstrate our methods: the UCI domain as a well-known domain when one is developing a new algorithm; and the Grasshopper domain as a domain where we want to find the best learning algorithm for a prediction component in an enterprise application software system. Zusammenfassung: Benchmark-Experimente sind die Methodik der Wahl um Lernalgorithmen empirisch zu vergleichen. Für eine gegebene Menge von Datensätzen wird die empirische Güte-Verteilung verschiedener Lernalgorithmen geschätzt, verglichen und geordnet. Normalerweise geschieht dies für jeden Datensatz einzeln. Der vorliegende Artikel erweitert diesen Datensatz-basierten Ansatz zu einem Domänen-basierten Ansatz, in welchem nun eine gemeinsame Analyse für die Menge der Datensätze durchgeführt wird. Dies ermöglicht unter anderem die Auswahl eines Algorithmus in einer konkreten Anwendung, und der Vergleich neu entwickelter Algorithmen mit bestehenden Algorithmen auf wohlbekannten Problemdomänen. Speziell entwickelte Visualisierungen ermöglichen eine einfache Untersuchung der immensen Menge an erzeugten Benchmark-Daten. Des Weiteren verwenden wir Gemischte Modelle um eine formale statistische Analyse durchzuführen. Anhand zweier Benchmark-Experimenten illustrieren wir unsere Methoden: die UCI-Domäne als Vertreter einer wohlbekannten Problemdomäne für die Entwicklung neuer Lernalgorithmen; und die GrasshopperDomäne, für welche wir den besten Lernalgorithmus als Vorhersagekomponente innerhalb einer konkreten Anwendung finden wollen.

Specialized visualization methods allow for easy exploration of huge amounts of benchmark data.Furthermore, we take the benchmark experiment design into account and use mixed-effects models to provide a formal statistical analysis.Two domain-based benchmark experiments demonstrate our methods: the UCI domain as a well-known domain when one is developing a new algorithm; and the Grasshopper domain as a domain where we want to find the best learning algorithm for a prediction component in an enterprise application software system.

Introduction
In statistical learning, benchmark experiments are empirical investigations with the aim of comparing and ranking learning algorithms with respect to a certain performance measure.In particular, on a data set of interest the empirical performance distributions of a set of learning algorithms are estimated.Exploratory and inferential methods are used to compare the distributions and to finally set up mathematical (order) relations between the algorithms.The foundations for such a systematic modus operandi are defined by Hothorn et al. (2005); they introduce a theoretical framework for inference problems in benchmark experiments and show that standard statistical test procedures can be used to test for differences in the performances.The practical toolbox is provided by Eugster (2011, Chapter 2); there we introduce various analysis methods and define a systematic four step approach from exploratory analyses via formal investigations through to the algorithms' orders based on a set of performance measures.
Modern computer technologies like parallel, grid, and cloud computing (i.e., technologies subsumed by the term High-Performance Computing; see, for example, Hager and Wellein, 2010) now enable researchers to compare sets of candidate algorithms on sets of data sets within a reasonable amount of time.Especially in the Machine Learning community, services like MLcomp (Abernethy and Liang, 2010) and MLdata (Henschel et al., 2010), which provide technical frameworks for computing performances of learning algorithms on a wide range of data sets recently gained popularity.Of course there is no algorithm which is able to outperform all others for all possible data sets, but it still makes a lot of sense to order algorithms for specific problem domains.The typical application scenarios for the latter being which algorithm to deploy in a specific application, or comparing a newly developed algorithm with other algorithms on a well-known domain.
A problem domain in the sense of this paper is collection of data sets.For a benchmark experiment the complete domain or a sample from the domain is available.Note that such a domain might even be indefinitely large, e.g., the domain of all fMRI images of human brains.Naturally, domain-based benchmark experiments produce a "large bunch of raw numbers" and sophisticated analysis methods are needed; in fact, automatisms are required as inspection by hand is not possible any more.This motivation is related to Meta-Learning -predicting algorithm performances for unseen data sets (see for example Pfahringer and Bensusan, 2000;Vilalta and Drissi, 2002).However, we are interested in learning about the algorithms' behaviors on the given problem domain.
From our point of view the benchmarking process consists of three hierarchical levels: (1) In the Setup level the design of the benchmark experiment is defined, i.e., data sets, candidate algorithms, performance measures and a suitable resampling strategy are declared.(2) In the Execution level the defined setup is executed.Here, computational aspects play a major role; an important example is the parallel computation of the experiment on different computers.(3) And in the Analysis level the computed raw performance measures are analyzed using exploratory and inferential methods.This paper covers the Setup and Analysis level and is organized as follows: Section 2 reviews the theoretical framework for benchmark experiments defined by Hothorn et al. (2005) and extends it for sets of data sets.In Section 3 we first define how the local -single data set-based -benchmark experiments have been done.Given the computation of local results for each data set of the domain, Section 3.1 introduces visualization methods to present the results in their entirety.In Section 3.2 we take the design of the domain-based benchmark experiments into account and model it using mixed-effects models.This enables an analysis of the domain based on formal statistical inference.In Section 4 we demonstrate the methods on two problem domains: The UCI domain (Section 4.1) as a well-known domain; useful, for example, when one is developing a new algorithm.The Grasshopper domain (Section 4.2) as a black-box domain, where we simply want to find the best learning algorithm for predicting whether a grasshopper species is present or absent in a specific territory.Section 5 concludes the article with a summary and future work.All computational details are provided in the section "Computational details" on page 23.All methods proposed in this paper have been fully automated and will be made available as open source software upon publication of this manuscript.

Design of Experiments
The design elements of benchmark experiments are the candidate algorithms, the data sets, the learning samples (and corresponding validation samples) drawn with a resampling scheme from each data set, and the performance measures of interest.In each trial the algorithms are fitted on a learning sample and validated on the corresponding validation sample according to the specified performance measures.
Formally, following Hothorn et al. (2005), a benchmark experiment (for one data set and one performance measure) is defined as follows: Given is a data set L = {z 1 , . . ., z N }.We draw b = 1, . . ., B learning samples of size n using a resampling scheme, such as sampling with replacement (bootstrapping, usually of size n = N ) or subsampling without replacement (n < N ): We assume that there are K > 1 candidate algorithms a k , k = 1, . . ., K, available for the solution of the underlying learning problem.For each algorithm a k the function is the fitted model based on a learning sample L b (b = 1, . . ., B).This function itself has a distribution A k as it is a random variable depending on L: The performance of the candidate algorithm a k when provided with the learning data L b is measured by a scalar function p: The p bk are samples drawn from the distribution P k (L) of the performance measure of the algorithm a k on the data set L. This paper focuses on the important case of supervised learning tasks, i.e., each observation z ∈ L is of the form z = (y, x) where y denotes the response variable and x describes a vector of input variables (note that for readability we omit the subscript i = 1, . . ., N for x, y, and z).The aim of a supervised learning task is to construct a learner ŷ = a k (x | L b ) which, based on the input variables x, provides us with information about the unknown response y.The discrepancy between the true response y and the predicted response ŷ for an arbitrary observation z ∈ L is measured by a scalar loss function l(y, ŷ).The performance measure p is then defined by some functional µ of the loss function's distribution over all observations: Typical loss functions for classification are the misclassification and the deviance (or cross-entropy).The misclassification error for directly predicted class labels is l(y, ŷ) = I(y ̸ = ŷ) , and the deviance for predicted class probabilities ŷg The absolute error and the squared error are common loss functions for regression.Both measure the difference between the true and the predicted value; in case of the squared error this difference incurs quadratic: Reasonable choices for the functional µ are the expectation and the median (in association with absolute loss).With a view to practicability in real-world applications, further interesting performance measures are the algorithm's execution time and the memory requirements (for fitting and prediction, respectively).
The focus of benchmark experiments is on the general performance of the the candidate algorithms.Therefore, using the performance on the learning data set L b as basis for further analyses is not a good idea (as commonly known).Thus -as in most cases we are not able to compute µ analytically -we use the empirical functional μT based on a test sample T: This means we compute the performance of the model (fitted on the learning sample L b ) for each observation in the test sample T and apply the empirical functional μ to summarize over all observations.Due to resampling effects it is not given that Pk approximates P k ; for example cross-validation overestimates the true performance distribution.The most common example for the empirical functional μ is the mean, that is the empirical analogue of the expectation.Most further analysis methods require independent observations of the performance measure, therefore we define the validation sample T in terms of out-of-bag observations: T = L \ L b .Now, in real-world applications we are often interested in more than one performance measure (e.g., misclassification and computation time) within a domain of problems (e.g., the domain of patients' data in a clinic).A domain is specified with a collection of data sets.In detail, for the candidate algorithm a k we are interested in the j = 1, . . ., J performance distributions on the m = 1, . . ., M data sets which define the problem domain D = {L 1 , . . ., L M }: The p mbkj are samples drawn from the jth performance distribution P j k (L m ) of the algorithm a k on the data set L m .Analogously as above the performance is measured on a validation sample, i.e., pmbkj is computed and the empirical performance distribution Pj k (L m ) is estimated.

Analysis of Experiments
The execution of a benchmark experiment as defined in Section 2 results in M ×B×K ×J raw performance measures, i.e., M × K × J empirical performance distributions Pj mk .This allows to analyze a multitude of questions with a wide variety of methods -for example: computing an order of the algorithms based on some simple summary statistics from the empirical performance distributions; or more sophisticated, testing hypotheses of interest by modeling the performance measure distributions and using statistical inference.Additionally, each question can be answered on different scopes, i.e., locally for each data set, or globally for the domain.For the remainder of this paper we assume that the following systematic stepwise approach has been executed for each given data set L m : 1. Compare candidate algorithms: The candidate algorithms are pairwise compared based on their empirical performance distributions Pj mk by simple summary statistics or statistical tests (parametric or non-parametric); this results in J comparisons.
Example: The algorithms svm, rpart, and rf are compared; the pairwise comparisons according to their misclassification errors are {svm ≺ rf, rpart ≺ rf, svm ∼ rpart} (based on a statistical test), and according to their computation times {rpart ≺ rf, rf ≺ svm, rpart ≺ svm} (based on the mean statistics).

Compute performance relations:
The J comparisons are interpreted as an ensemble of relations R m = {R 1 , . . ., R J }.Each R j represents the relation of the K algorithms with respect to a specific performance measure and the data set's preference as to the candidate algorithms.

Example (cont.):
The data set's preferences are R 1 = svm ∼ rpart ≺ rf in case of the misclassification error and R 2 = rpart ≺ rf ≺ svm in case of the computation time.

Aggregate performance order relations:
The ensemble R m is aggregated to, for example, a linear or partial order Rm of the candidate algorithms.As a suitable class of aggregation methods we use consensus rankings.The individual relations R j can be weighted to express the importance of the corresponding performance measure.

Example (cont.):
The linear order with the weights w 1 = 1 and w 2 = 0.2 (i.e., computation time is much less important than the misclassification error) is then rpart ≺ svm ≺ rf.
These data of different aggregation levels are available for each data set L m of the problem domain D. The obvious extension of the local approach to compute a domainbased order relation is the further aggregation of the data sets' algorithm orders (Hornik and Meyer, 2007): Austrian Journal of Statistics, Vol. 41 (2012), No. 1, 5-26 4. Aggregate local order relations: The domain specific algorithms' order relation R is computed by aggregating the ensemble of consensus relations R = { R1 , . . ., RM } using consensus methods.
This approach allows the automatic computation of a statistically correct domain-based order of the algorithms.But the "strong" aggregation to relations does not allow statements on the problem domain to a greater extent.In the following we introduce methods to visualize and to model the problem domain based on the individual benchmark experiment results.On the one hand these methods provide support for the global order R, on the other hand they uncover structural interrelations of the problem domain D.

Visualizing the Domain
A benchmark experiment results in M × K × J estimated performance distributions Pj mk .The simplest visualizations are basic plots which summarize the distributions, like strip charts, box plots, and density plots, conditioned by the domain's data sets.So called Trellis plots (Becker et al., 1996) allow a relative comparison of the algorithms within and across the set of data sets.
Figure 1 shows a Trellis plot with box plots of six algorithms' misclassification errors (knn, lda, nnet, rf, rpart, and svm) on a problem domain defined by 21 UCI data sets (Section 4.1 provides the experiment details).Using this visualization we see that there are data sets in this domain which are equally "hard" for all candidate algorithms, like ttnc, mnk3 or BrsC; while the algorithms on other data sets perform much more heterogeneous, like on prmnt and rngn.From an algorithm's view, lda for example, has the highest misclassification error of the problem domain on data sets Crcl and Sprl (which are circular data).Moreover, whenever lda solves a problem well, other algorithms perform equally.
Further basic visualizations allowing relative comparisons of the estimated performance distributions Pj mk based on descriptive statistics are stacked bar plots, spine plots and mosaic plots.In all visualizations one axis contains the data sets and the other the stacked performance measure (either raw or relative).Figure 2a exemplarily shows the stacked bar plot of the UCI domain's mean misclassification errors (the order of the data sets is explained below).Notice, for example, that for the candidate algorithms the data set mnk3 is on average much "less hard" to solve than livr.This plot is an indicator for a learning problem's complexity; if all candidate algorithms solve the problem well, it is probably an easy one.(Figure 2b is explained further down.) Now, in addition to the empirical performance distributions Pj mk , the pairwise comparisons, the resulting set of relations R m , and the locally aggregated orders Rm are available.To incorporate these aggregated information into the visualizations we introduce an appropriate distance measure.Kemeny and Snell (1972) show that for order relations there is only one unique distance measure d which satisfies axioms natural for preference relations (we refer to the original publication for the definition and explanation of the axioms).The symmetric difference distance d ∆ satisfies these axioms; it is defined as the cardinality of the relations' symmetric difference, or equivalently, the number of pairs of objects being in exactly one of the two relations R 1 , R 1 (⊕ denotes the logical XOR operator): Computing all pairwise distances for the relations R m (m = 1, . . ., M ) results in a symmetric M × M distance matrix D representing the distances of the domain D based on the candidate algorithms' performances.An obvious way to analyze D is to hierarchically cluster it.Because detecting truly similar data sets within a domain is most interesting (in our point of view), we propose to use agglomerative hierarchical clustering with complete linkage (see, e.g., Hastie et al., 2009).Figure 3 shows the corresponding dendrogram for the UCI domain's relation R = {R 1 , . . ., R 21 } based on the algorithms' misclassification errors.Crcl and Sonr for example, are in one cluster -this means that the candidate algorithms are in similar relations (note that the raw performance measures are not involved anymore.Therefore, it is hard to see these similarities in basic visualizations like the stacked bar plot (Figure 2a), even if the data sets are ordered according to the data sets' linear order determined by the hierarchical clustering.
The benchmark summary plot (bsplot) overcomes these difficulties by adapting the  stacked bar plot and incorporating a set of relations R. Each bar uses the total width, and is evenly divided into as many partitions as candidate algorithms.The partitions are assigned to the candidate algorithms and their order is determined by the corresponding (linear or partial) relation Rm .A descriptive statistic of the corresponding empirical performance distribution Pj mk of interest is then plotted as bar from the bottom up in each partition; the values are scaled in a way that the domain's worst performance fills the partition.Color coding is used to simplify interpretation -partition backgrounds with light, performance bars with dark colors.Moreover, the relations Rm are visualized using borders between partitions.So, if there is for example a significant difference in the performance of two candidate algorithms, a (black) border is shown, otherwise no border is shown.The bars are sorted according to a linear order of the distance matrix D; just like the one computed by the hierarchical clustering.The axis representing the data sets is equivalent to the stacked bar plot, the other axis is a podium for the candidate algorithms.Obviously, this plot only works in case of linear orders.Visually interpreted, the aggregated global consensus relation R is the aggregation of the light colors over the data set axis.
Figure 2b shows the UCI domain's bsplot.In comparison with the stacked bar plot (Figure 2a) the individual benchmark experiment results are now more clearly visible.For example, svm (blue) has the most first places -13 times (6 times exclusively), and is never worse than a third place.lda (purple) is the algorithm with the highest misclassification (on data set Crcl) and knn (yellow) is the algorithm with the most last places.Based on lda the domain splits into two clusters, one where it performs well (i.e., a set of linearly separable data sets) and one where not (the non-linearly separable data sets).rf (orange) also performs well within this domain, while nnet (green) is in most cases of medium performance.
In the UCI problem domain the resulting relations are all transitive, this is not generally true for relations based on statistical tests (see Eugster, 2011) layout follows a spring model and is computed using the Kamada-Kawai algorithm (see, e.g., Gansner and North, 2000, for a description and software implementation).The layouted graph is then visualized with additional information available from the individual benchmark experiments.Our implementation shows the data sets' winner algorithms by filling the nodes with the corresponding colors; if there is no unique winner algorithm for a data set the node is unfilled.The edges' widths and colors correspond to the distances, i.e., the shorter the distance the wider and darker the edge.Our implementation allows showing only a subset of edges corresponding to a subset of distances to make the visualization more clear.
Figure 4 shows the UCI domain's bsgraph with edges visible which correspond to tenth smallest distance.Here, for example, it is clearly visible that subset A of the domain's data sets has similar algorithm performances (although only the winners are visualized).It is also visible that the domain splits into two sub-domains: sub-domain B where the algorithm svm (blue) performs best, and sub-domain C where the algorithms rf (orange) and lda (purple) perform best.In case of unfilled nodes the dominant subdomain algorithms are always winner as well together with other algorithms (compare with Figure 2b).
The benchmark summary graph defines the basis for more complex visualizations.One future work is an interactive version along the lines of the gcExplorer -an interac-tive exploration tool of neighbor gene clusters represented as graphs (Scharl and Leisch, 2009, cf. Figure 1).This tool enables the access to the nodes complex information using interactivity; the same idea can be used for the benchmark summary graph.See Eugster and Leisch (2010) for the interactive analysis of benchmark experiments based on a single data set.Furthermore, looking at the introduced visualizations raises the question "why" do some candidate algorithms perform similar on some data sets and not on others -which data set characteristics affect the algorithms' performances and lead to such clusters as seen in Figure 4? We investigate this question in Eugster et al. (2010), where we introduce a formal framework based on (recursively partitioning) Bradley-Terry models (the most widely used method to study preferences in psychology and related disciplines) for automatic detection of important data set characteristics and their joint effects on the algorithms' performances in potentially complex interactions.

Modeling the Domain
The analysis methods introduced so far -the aggregation of local relations to a domainbased order relation and the visualization methods -rely on locally (data set-based) computed results.In this section we take the design of domain-based benchmark experiments into account and model the M ×K ×J estimated performance distributions Pj mk for J = 1 accordingly.This enables a domain's analysis based on formal statistical inference.
A domain-based benchmark experiment with one performance measure of interest is a type of experiment with two experimental factors (the candidate algorithms and the domain's data sets), their interactions, and blocking factors at two levels (the blocking per data set and the replications within each data set).It is written with m = 1, . . ., M , b = 1, . . ., B, and k = 1, . . ., K. κ k represents the algorithms' mean performances, β m the mean performances on the domain's data sets, β mk the interactions between data sets and algorithms, β mb the effect of the subsampling within the data sets, and ϵ mbk the systematic error.Linear mixed-effects models are the appropriate tool to estimate the parameters described in Formula 1. Mixed-effects models incorporate both fixed effects, which are parameters associated with an entire population or with certain repeatable levels of experimental factors, and random effects, which are associated with individual experimental or blocking units drawn at random from a population (Pinheiro and Bates, 2000).The candidate algorithms' effect κ k is modeled as fixed effect, the data sets' effect β m as random effect (as the data sets can be seen as randomly drawn from the domain they define).Furthermore, β mk , β mb and ϵ mbk are defined as random effects as well.The random effects follow β m ∼ N (0, σ 2 1 ), β mk ∼ N (0, σ 2 2 ), β mb ∼ N (0, σ 2 3 ), and ϵ mbk ∼ N (0, σ 2 ).Analogous to single data set-based benchmark experiments, we can rely on the asymptotic normal and large sample theory (see Eugster, 2011).
The most common method to fit linear mixed-effects models is to estimate the "variance components" by the optimization of the restricted maximum likelihood (REML) through EM iterations or through Newton-Raphson iterations (see Pinheiro and Bates, 2000).The results are the estimated parameters: the variances σ2 random effects; and the K fixed effects.The model allows the following interpretation -of course conditional on the domain D -for an algorithm a k and a data set L m : κk is the algorithm's mean performance, βm is the data set's mean complexity, and βmk is the algorithm's performance difference from its mean performance conditional on the data set (coll., "how does the algorithm like the data set").
The parametric approach of mixed-effects models allows statistical inference, in particular hypothesis testing, as well.The most common null hypothesis of interest is "no difference between algorithms".A global test, whether there are any differences between the algorithms which do not come from the "randomly drawn data sets" or the sampling is the F-test.Pairwise comparisons, i.e., which algorithms actually differ, can be done using Tukey contrasts.The calculation of simultaneous confidence intervals enables controlling the experiment-wise error rate (we refer to Hothorn et al., 2008, for a detailed explanation).
Figure 5a shows simultaneous 95 % confidence intervals for the algorithms' misclassification error based on a linear mixed-effects model.Two algorithms are significantly different if the corresponding confidence interval does not contain the zero.The confidence intervals are large because of the very heterogeneous algorithm performances over the data sets (cf. Figure 2b; Section 4.1 describes the result in detail).Now, statistical significance does not imply a practically relevant difference.As commonly known, the degree of significance can be affected by drawing more or less samples.A possibility to control this characteristic of benchmark experiments is to define and quantify how large a significant difference has to be to be relevant.Let [∆ 1 , ∆ 2 ] be the area of equivalence (zone of non-relevance).The null hypothesis is rejected if the (1 − α) * 100% confidence interval is completely contained in the area of equivalence (equivalence tests are the general method which consider relevance; see, for example, Wellek, 2003).Figure 5b shows the UCI domain's pairwise comparisons with [−0.10, 0.10] as the area of equivalence.For example, the difference between rpart and rf is significant (the interval does not contain the zero) but is not relevant (the area of equivalence completely contains the interval); the difference between svm and lda is significant and relevant.Of course, the definition of the area of equivalence contains the subjective view of the practitioner -normally, it is based on domain-specific knowledge.
Finally, the pairwise comparisons of candidate algorithms' significant or relevant differences allow to establish a preference relation based on the performance measure for the domain D. From now on, the analysis of domain-based benchmark experiments proceeds analogously to the analysis of data set-based benchmark experiments.Following Eugster (2011, Chapter 2), we use the concept of preference relations based on statistical tests and their combination using consensus methods.Now, the methodology introduced above makes it possible to model the experiment for each performance measure; i.e., to fit J linear mixed-effects models, to compute the significant or relevant pairwise comparisons, and to establish a preference relation R j for each performance measure (j = 1, . . ., J). Consensus methods aggregate the J preference relations to single domain-based order relation of the candidate algorithms.This can be seen as a multi-criteria or multi-objective optimization and allows, for example, to select the best candidate algorithm with respect to a set of performance measures for the given domain D.

Benchmarking UCI and Grasshopper Domains
We present two domain-based benchmark experiments -one for each application scenario we sketch in the introduction.The UCI domain serves as a domain for the scenario when comparing a newly developed algorithm with other well-known algorithms on a wellknown domain.We already used the UCI domain in the previous sections to illustrate the presented methods and we now give more details on this benchmark experiment and complete the analysis.The Grasshopper domain serves as domain where we want to find the best candidate algorithm for predicting whether a grasshopper species is present or absent in a specific territory.The algorithm is then used as a prediction component in an enterprise application software system.

UCI Domain
The UCI domain is defined by 21 data sets binary classification problems available from Asuncion and Newman (2007).We are interested in the behavior of the six common learning algorithms linear discriminant analysis (lda, purple), k-nearest neighbor classifiers, (knn, yellow), classification trees (rpart, red), support vector machines (svm, blue), neural networks (nnet, green), and random forests (rf, orange); see all, for example, Hastie et al. (2009).The benchmark experiment is defined with B = 250 replications, bootstrapping as resampling scheme to generate the learning samples L b , and the out-of-bag scheme for the corresponding validation samples T b .Misclassification on the validation samples is the performance measure of interest.A benchmark experiment is executed and analyzed on each data set according to the local systematic stepwise approach (Steps 1-Austrian Journal of Statistics, Vol. 41 (2012), No. 1, 5-26 Table 1: UCI domain's chains of preference relations R = {R 1 , . . ., R 21 }.
3) given in the beginning of Section 3 (and defined in Eugster, 2011).The results are 21 × 6 × 1 estimated performance distributions P j mk , the corresponding pairwise comparisons based on mixed-effects models and test decisions for a given α = 0.05, and the resulting preference relations R = {R 1 , . . ., R 21 }.Note that we present selected results, the complete results are available in the supplemental material (see the section on computational details on page 23).
The Trellis plot in Figure 1 shows the box plots of the estimated performance distributions.Table 1 lists the resulting preference relations R m ; in this benchmark experiment all relations are transitive, therefore the listed chains of preferences can be built (a k ∼ a k ′ indicates no significant difference, a k ≺ a k ′ indicates a significantly better performance of a k ).The domain-based linear order relation R computed by the consensus method (Step 4) is: This order coincides with the impression given by the bsplot's light colors in Figure 2b: svm (blue) has the most first places, rf (orange) the most second and some first places, lda (purple) has some first places, rpart (red) and nnet (green) share the middle places, and knn (yellow) has the most last places.
Computing the linear mixed-effects model leads to a model with the estimated candidate algorithm effects: For data set mnk3, all algorithms except rf perform better than their mean performance; for livr only lda.These estimated parameters conform with the performance visualizations in figures 1 and 2. Figure 5 shows the (a) significant and (b) relevant pairwise comparisons.There is, for example, a significant difference between svm and rpart in favor of svm and no significant difference between svm and rf.The interpretation of the pairwise significant differences results in the incidence matrix shown in Figure 6a.The corresponding relation is no linear or partial order relation (as we can verify).However, plotting only the asymmetric part of its transitive reduction as Hasse diagram enables a visualization and an interpretation of the relation (Hornik and Meyer, 2010) -Figure 6b shows this Hasse diagram, nodes are ordered bottom-up.For the UCI domain and based on the mixed-effects model analysis we can state that rf and svm are better than knn, nnet, and rpart.In case of lda this analysis allows no conclusion.This result corresponds with the global linear order relation R computed by the aggregation of the individual preference relations.

Grasshopper Domain
In this application example we are interested in finding the best algorithm among the candidate algorithm as a prediction component of an enterprise application software system.The domain is the domain of grasshopper species in Bavaria (Germany), the task is to learn whether a species is present or absent in a specific territory.
The data were extracted from three resources.The grasshopper species data are available in the "Bavarian Grasshopper Atlas" (Schlumprecht and Waeber, 2003).In this atlas, Bavaria is partitioned into quadrants of about 40km 2 .Presence or absence of each species it is registered for each quadrant.The territory data consist of climate and land usage variables.The climate variables are available from the WorldClim project (Hijmans et al., 2005) in a 1km 2 resolution.These 19 bioclimate (metric) variables describe for example the seasonal trends and extreme values for temperature and rainfall.The data are primary collected between 1960 and 1990.The land usage variables are available from the CORINE LandCover project CLC 2000 (Federal Environment Agency, 2004).Based on satellite images the territory is partitioned into its land usage in a 100m 2 resolution using FRAGSTAT 3.3 (McGarigal et al., 2002).These 20 (metric and categorical) land usage variables describe the percentage of, for example, forest, town and traffic (we binarized a variable if not enough metric values are available).The climate and land usage variables are averaged for each quadrant for which the grasshopper data are available.Additionally, the Gauss-Krüger coordinates and the altitude are available for each quadrant.We use the Austrian Journal of Statistics, Vol. 41 (2012), No. 1, 5-26 Figure 7: Trellis graphic with box plot of the candidate algorithms' misclassification error on the Grasshopper domain.Each data set is one grasshopper species.standardized altitude but omit the coordinates as the candidate algorithms are not able to estimate spatial autocorrelation and heterogeneity.Now, to define the domain, we understand each grasshopper species as individual data set.The quadrants where a species is present are positively classified; as negatively classified quadrants we draw random samples from the remaining ones.If enough remaining quadrants are available we create a balanced classification problem, otherwise we use all remaining quadrants.We only use data sets with more than 300 positively classified quadrants -so, the Grasshopper domain is finally defined by 33 data sets.
The candidate algorithms of interest are linear discriminant analysis (lda, purple), knearest neighbor classifiers, (knn, yellow), classification trees (rpart, red), support vector machines (svm, blue), naive Bayes classifier (nb, green), and random forests (rf, orange); see all, for example, Hastie et al. (2009).The benchmark experiment is defined with B = 100 replications, bootstrapping as resampling scheme to generate the learning samples L b , and the out-of-bag scheme for the corresponding validation samples T b .Misclassification on the validation samples is the performance measure of interest.Note that we presents selected results, the complete results are available in the supplemental material (see the section on computational details on page 23).
Figure 7 shows the Trellis plot with box plots for the six algorithms' misclassification errors.We see that for most data sets the relative order of the candidate algorithms seems to be similar, but that the individual data sets are differently "hard" to solve.The locally computed preference relations R = {R 1 , . . ., R 33 } (using mixed-effects models; see Eugster, 2011) contains non-transitive relations; therefore, a visualization using the benchmark summary plot is not possible.Now, one possibility is to plot the asymmetric part of the transitive reduction (like in Figure 6b) for each of the 33 relations in a Trellis plot.However, such a plot is very hard to read and the benchmark summary graph provides a simpler visualization (albeit with less information).Figure 8a shows the bsgraph with the six smallest distance levels visible.The nodes show the color of the algorithm with the minimum median misclassification error.We see that for most data sets rf (orange) is the best algorithm.The nodes' cross-linking indicates that the relations do not differ much in general.The algorithms follow the general order pattern even for data sets where this plot indicates a big difference, for example the NEUS data set (cf. Figure 7).
A consensus aggregation of R results in the following linear order: This order confirms the exploratory analysis.To formally verify this order we compute the domain-based linear mixed-effects model and the resulting pairwise comparisons.Figure 8b shows the corresponding simultaneous 95% confidence intervals and the resulting order is: rf ≺ lda ≺ svm ≺ rpart ≺ nb ∼ knn .
All three analyses, exploratory, consensus-, and mixed-effect model-based, lead to the same conclusion: the random forest learning algorithm is the best algorithm (according to the misclassification error) for the Grasshopper domain.

Summary
The great many of published benchmark experiments show that this method is the primary choice to evaluate learning algorithms.Hothorn et al. (2005) define the theoretical framework for inference problems in benchmark experiments.Eugster (2011) introduce the practical toolbox with a systematic four step approach from exploratory analysis via formal investigations through to a preference relation of the algorithms.The present publication extends the framework theoretically and practically from single data set-based benchmark experiments to domain-based (set of data sets) benchmark experiments.
Given the computation of local -single data set-based -benchmark experiment results for each data set of the problem domain, the paper introduces two specialized visualization methods.The benchmark summary plot (bsplot) is an adaption of the stacked bar plot.It allows the visualization of statistics of the algorithms' estimated performance distributions incorporated with the data sets' preference relations.This plot only works in case of linear or partial order relations, while the benchmark summary graph (bsgraph) enables a general visualization.The problem domain is represented by a complete weighted graph with vertices for the domain's data sets.The edges' weights are defined by the pairwise symmetric difference distances.The layouted graph is visualized with additional information from the local benchmark experiments which allows to find patterns within the problem domain.
An analysis of the domain based on formal statistical inference is enabled by taking the experiment design -two experimental factors, their interactions, and blocking factors at two levels -into account.We use linear mixed effects models to estimate the parameters where the algorithms are defined as fixed effects, all others as random effects.The estimated model allows the interpretation -conditional on the domain -of the algorithms' mean performances, the data sets' mean complexities and how suitable an algorithm for a data set is.Furthermore, testing hypotheses of interest is possible as well.A global test of the most common hypothesis of "no difference between the algorithms" can be performed with an F-test, a pairwise comparison can be performed using Tukey contrasts.The definition of an area of equivalence allows to incorporate practical relevance instead of statistical significance.Finally, the pairwise comparisons establish a preference relation of the candidate algorithms based on the domain-based benchmark experiment.The two domain-based benchmark experiments show that this domain-based relation conforms with the exploratory analysis and the approach of aggregating the individual local relations using consensus methods.

Computational Details
All computations and graphics have been done using R 2.11.1 (R Development Core Team, 2010) and additional add-on packages.
Setup and Execution: For the candidate algorithms of the two benchmark experiments the following functions and packages have been used: Function lda from package MASS for linear discriminant analysis.Function knn from package class for the k-nearest neighbor classifier.The hyperparameter k (the number of neighbors) has been determined between 1 and √ n, n the number of observations, using 10-fold cross-validation (using the function tune.knn from package e1071, Dimitriadou et al., 2009).Function nnet from package nnet for fitting neural networks.The number of hidden units has been determined between 1 and log(n) using 10-fold cross-validation (using function tune.nnet from package e1071), each fit has been repeated 5 times.All three algorithms are described by Venables and Ripley (2002).Function rpart from package rpart (Therneau and Atkinson, 2009) for fitting classification trees.The 1-SE rule has been used to prune the trees.Functions naiveBayes and svm from package e1071 for fitting naive Bayes models and C-classification support vector machines.The two C-classification support vector machines hyperparameters γ (the cost of constraints violation) and C (the kernel parameter) have been determined using a grid search over the two-dimensional parameter space (γ, C) with γ from 2 −5 to 2 12 and C from 2 −10 to 2 5 (using function tune.svm from package e1071).And function randomForest from package randomForest (Liaw and Wiener, 2002) for fitting random forests.

Figure 1 :
Figure 1: Trellis graphic with box plot panels.The plot shows the misclassification error of the UCI domain benchmark experiment described in Section 4.1.

Figure 3 :
Figure 3: Dendrogram showing the clustering of the UCI domain's data sets based on their candidate algorithms' performance relations.

Figure 4 :
Figure 4: Benchmark summary graph visualizing the relation of the UCI domain's data sets based on the distance matrix.The color of the nodes indicate the unique winner algorithm; otherwise unfilled.The dashed circle highlights a subset A with similar algorithm performances.The dashed line indicates two sub-domains; B -where svm performs best, and C -where rf and lda perform best.

Figure 5 :
Figure 5: The UCI domain's simultaneous 95% confidence intervals for multiple (a) significant and (b) relevant comparisons for a fitted linear mixed-effects model on the algorithms' misclassification errors.

Figure 8 :
Figure 8: (a) The Grasshopper domain's benchmark summary graph; the color of the nodes indicate the algorithm with the minimum median misclassification error.(b) The Grasshopper domain's simultaneous 95 % confidence intervals for multiple significant comparisons for a fitted linear mixed-effects model on the algorithms' misclassification errors.