Monitoring Robust Estimates for Compositional Data

In a number of recent articles Riani, Cerioli, Atkinson and others advocate the technique of monitoring robust estimates computed over a range of key parameter values. Through this approach the diagnostic tools of choice can be tuned in such a way that highly robust estimators which are as efficient as possible are obtained. This approach is applicable to various robust multivariate estimates like Sand MM-estimates, MVE and MCD as well as to the Forward Search in which monitoring is part of the robust method. Key tool for detection of multivariate outliers and for monitoring of robust estimates is the Mahalanobis distances and statistics related to these distances. However, the results obtained with this tool in case of compositional data might be unrealistic since compositional data contain relative rather than absolute information and need to be transformed to the usual Euclidean geometry before the standard statistical tools can be applied. Various data transformations of compositional data have been introduced in the literature and theoretical results on the equivalence of the additive, the centered, and the isometric logratio transformation in the context of outlier identification exist. To illustrate the problem of monitoring compositional data and to demonstrate the usefulness of monitoring in this case we start with a simple example and then analyze a real life data set presenting the technological structure of manufactured exports. The analysis is conducted with the R package fsdaR, which makes the analytical and graphical tools provided in the MATLAB FSDA library available for R users.


Introduction
In many cases the data sets are characterized by multivariate observations (vectors) containing relative contributions of parts to a whole. Examples are geochemical composition of rocks, household budget patterns, time budget, ceramic compositions. A plethora of further examples can be found in Aitchison (1986Aitchison ( , 2005 and the hundreds of papers published on this topic. One example which was the motivation for this contribution is in order. Since 2002 the United Nations Industrial Development Organization (UNIDO) publishes the Competitive Industrial Performance (CIP) Index and accompanying report (Todorov and Pedersen, 2017), see http://stat.unido.org/cip. Through this index monitoring the industrial competitiveness of countries will, to a great extent, reflect how well they manage to adapt to these new challenges and embrace the opportunities. The CIP Index is an essential tool for countries to view and compare their industrial competitiveness with that of others. The CIP Index is composed of eight sub-indicators defined within the framework of three key dimensions that capture different aspects of a country's industrial competitive performance. One of these sub-indicators is the technological structure of manufactured exports representing their "quality". There exists a well established decomposition analysis by technology level of the export structure (Lall, 2000) presenting the manufactured exports in four categories: Resource-based, Low technology, Medium technology and High technology (about the source of data and how these categories are defined see Todorov and Pedersen, 2017). Figure 1 is an excerpt from one country profile based on CIP edition 2018. The bar chart in the top right corner represents the export structure of the manufacturing exports of the country. The percentages shown sum up to 100. However, the country does not export only manufacturing goods, also agricultural, mineral, energy goods or services can compose the total exports. This is shown by the four circles above the corresponding bars -the percentages shown in the circles are the shares of the respective manufacturing category in the total exports. A vast number of works in the economics and development literature show that exports and their composition, particularly in terms of their technological intensity, are one of the most important economic growth conditions in a country or region (Crespo Cuaresma and Woerz (2005); Hausmann, Hwang, and Rodrik (2007) ;Raiher, Souza do Carmo, and Stege (2017)). Hausmann et al. (2007) show evidence that the economic growth is influenced by the composition of the exports agenda, and the more sophisticated exports (which they characterize as high productivity exports) are associated to higher economic growth rates. Crespo Cuaresma and Woerz (2005) investigate the hypothesis of qualitative differences between high and low tech exports with respect to output growth and show that the superior performance of high tech exports stems from their positive productivity differential to the domestic sector. However, to our knowledge, none of the studies on export structure consider them as compositional data although the focus is on the structure of the exports and the absolute values of the exports are not relevant for the analysis. This motivation example will be analyzed in more detail in Section 4.2 as a first step in this direction in which we will look in the following two points: (1) are there outliers in the data and (2) can we identify the presence of group patterns.
Compositional data have generally been defined as a vector of proportions, a vector with strictly positive components whose sum is a constant (Aitchison (1986); Aitchison (2005); Pawlowsky-Glahn, Egozcue, and Tolosana-Delgado (2008); Pawlowsky-Glahn and Egozcue (2006)), however, this definition has changed and nowadays we refer to any set of multivariate observations with strictly positive components where relative rather than absolute information is relevant for the analysis (Pawlowsky-Glahn, Egozcue, and Lovell (2015a); Pawlowsky-Glahn, Egozcue, and Tolosana-Delgado (2015b); Egozcue and Pawlowsky-Glahn (2019)). Due to the relative character of compositional data, application of standard statistical multivariate methods, which mostly rely on Euclidean geometry, might lead to misleading results, therefore, it is usual to apply a suitable transformation. A first remedy would be a log-transformation which will often reduce data skewness, but will not accommodate the compositional nature of the data. Aitchison (1986) suggested several possible transformations from the family of logratio transformations: the additive log-ratio (alr) and the centered log-ratio (clr) transformations, but these present certain inconvenience for the multivariate analysis (Argote-Espino, Lopez-García, and Facevicova (2018) ;Filzmoser, Hron, and Reimann (2012)). This disadvantages have led to the development of the isometric log-ratio (ilr) transformation which relates the geometry on the simplex directly to the Euclidean geometry (Egozcue, Pawlowsky-Glahn, Mateu-Figueras, and Barceló-Vidal, 2003).
After transforming suitably the compositional data we have a sample from a single population in which outliers are (possibly) present. To make the analysis insensitive to the influence of the outliers (if present) robust methods are called for. Many robust estimators for location and covariance have been introduced in the literature. Maybe the most popular is the Minimum Covariance Determinant (MCD), introduced by Rousseeuw (1985) for which a fast computational algorithm exists (Rousseeuw and Van Driessen, 1999). The Minimum Volume Ellipsoid (MVE) was also introduced by Rousseeuw (1985) and was even more popular than the MCD in the past. Nowadays it is recommended only as an initial estimator for S-estimation (Maronna, Martin, Yohai, and Salibian-Barrera, 2019). The multivariate S-estimates, which are a smooth version of the MVE estimator, were introduced by Davies (1987) and further studied by Lopuhaä (1989) (see also Rousseeuw and Leroy, 1987, p.263). The MM-estimator (Tatsuoka and Tyler, 2000) is an extension of the S estimator, that has high efficiency under multivariate normality. The orthogonalized Gnanadesikan-Kettenring (OGK) estimator of Maronna and Zamar (2002) is much faster than MCD, has high breakdown point and can work even with data containing more variables than observations. If we perform a very robust analysis with 50 % breakdown point the analysis will be safeguarded against outliers, however this will result in an unnecessary low efficiency for clean data without any outliers. The usual approach to improve efficiency is to apply reweighting (for MCD) or to use MM-estimates instead of S-estimates. The forward search (FS) and related methods provide a tool for adaptively choosing a suitable combination of breakdown point and efficiency by monitoring a series of fits over a range of values of these quantities to the data, i.e. repeating the estimation process for different choices of the tuning parameters  ;Atkinson, Riani, and Cerioli (2004);  ;Cerioli, Riani, Atkinson, and Corbellini (2018); Greco and Farcomeni (2015)). In FS we start from small subsets of data and observations which are close to the fitted model are added sequentially to the observations used in parameter estimation. At each step of this process, while the subset grows we monitor parameter estimates, test statistics and measures of fit such as the squared Mahalanobis distance (MD). The FS process is presented in monitoring plots which display the squared Mahalanobis distance of all observations at each step (at each intermediate, growing, subset of observations). Thus, for each observation in the data set a trajectory of MD is obtained and abrupt (visual) changes in these trajectories will indicate an abrupt change in the model (i.e. outliers are entering the model and distorting the estimates). In the case of S-estimates, in a similar manner we vary the breakdown point (bdp) from 0.5 (maximal breakdown point) to 0 (maximum likelihood estimation) by a step of, say, 0.01 and for each value of bdp we estimate the location and covariance and with these estimates compute the Mahalanobis distances for each observation-again these can be presented in a monitoring plot of squared Mahalanobis distances against bdp which will highlight any abrupt change in the MD trajectories. Continuing further with this analogy we obtain in the case of MM-estimates monitoring plots of squared MDs against efficiency which is varied from 0.5 to 0.99 by a suitable step. Such plots are presented in the examples, in Figure 8 and later.
The rest of the paper is structured as follows. Section 2 discusses the need of robust methods and presents briefly the forward search method, the monitoring of various methods and the available software for doing this. In Section 3 the specifics of robust methods in the case of compositional data are considered and are illustrated with a simple example. Section 4 presents the monitoring of compositional data on two examples and Section 5 concludes.

Forward search and monitoring of robust estimates
In statistical modeling and estimation assumptions like normal distribution or independence are used, however, the practice is usually different: practical data sets often do not follow these strict assumptions. There might be several different processes inherent in the data generating process or other effects that cannot be controlled. It is then often unclear how reliable the results are, if the model assumptions are violated. The multivariate aspect of the data used makes the task of outlier identification particularly challenging. The outliers can be completely hidden in one or two dimensional views of the data. This underlines that univariate outlier detection methods are useless, although they are often favored by researchers because of their simplicity.
Outlier detection and robust estimation are closely related (see Hampel, Ronchetti, Rousseeuw, and Stahel, 1986;Hubert, Rousseeuw, and van Aelst, 2008) and the following two problems are essentially equivalent: • Robust estimation: develop statistical techniques which are inherently insensitive to the presence of outliers and find an estimate which is not influenced by these outliers, even if their amount is large (many good robust techniques can tolerate up to 50% contamination). The ability of the estimators to cope with large amount of outliers is measured by their breakdown point (bdp) which can reach the maximum of 50%. Estimators which can cope with this maximum amount of contamination in the sample are known as high breakdown point estimators (HBDP estimators) and examples of popular HBDP estimators are Minimum Covariance Determinant (MCD) estimator (Hubert, Debruyne, and Rousseeuw, 2017), S-and MM-estimators (Maronna et al., 2019) as well as the Forward Search estimator (Cerioli, Farcomeni, and Riani, 2014).
• Outlier detection: find all outliers, which could distort the estimate. A classical approach to detecting multivariate outliers in a given sample X = (x 1 , x 2 , . . . , x n ) of n observations in the D-dimensional real space R D would be to compute the Mahalanobis distance for each x i . Herex and S are the sample mean and covariance matrix of the data set X. Outliers may be identified by large values of M D(x i ). Since for multivariate normally distributed data, the squared Mahalanobis distances computed with the sample mean x and covariance matrix S, approximate a chi-square distribution χ 2 with D degrees of freedom (Seber, 1984), as a cut-off value can be taken the square root of a certain quantile β of the χ 2 distribution with D degrees of freedom. Then the customary adopted rule is to take β = 0.975 and points with Mahalanobis distances higher than this cut-off value χ 2 D;0.975 will be flagged as potential outliers (Maronna and Zamar, 2002). Unfortunately this approach suffers from two problems: (a) Masking: multiple outliers can distort the classical estimates of the meanx and the covariance S in such a way (attractingx and inflating S) that they do not get necessarily large values of M D(x i ) and (b) Swamping: multiple outliers can distort the classical estimates of meanx and covariance S in such a way that observations which are consistent with the majority of the data get large values of M D(x i ). To cope with these undesirable effects it is necessary to base the diagnostic tools on high breakdown point methods and replace the classical Mahalanobis distances by their robust alternative where (T, C) is a HBDP robust estimator of multivariate location and scatter (Rousseeuw and van Zomeren (1990); Maronna et al. (2019)).
A solution to the first problem allows us to identify the outliers using their robust distances while on the other hand, if we know the outliers we could remove or downweight them and then use the classical estimation methods. In many research areas the first approach is the preferred one but often the second one is more appropriate.
The exact distribution of the robust distances (2) is not known for finite sample sizes but the popular approach is to compare them to the same threshold χ 2 D;0.975 used for the classical Mahalanobis distance given in Equation (1) (Rousseeuw and Van Driessen (1999); Pison, Van Aelst, and Willems (2002)). Apart from the arbitrary choice of the 0.975 quantile, the χ 2 approximation is known to flag too many observations as outliers in data sets with small to moderate size. We will refer here to several alternatives to this approach for outlier identification. Hardin and Rocke (2005) suggested an improved approximation based on the F -distribution. Their results are based on the robust distances computed with the MCD estimator and do not consider the supposedly more efficient reweighted alternatives. A more advanced approach for choosing the cut-off value was proposed by Filzmoser, Garrett, and Reimann (2005) which accounts for the actual numbers of observations and variables in the data set, and also tries to distinguish among extremes of the data distribution and outliers coming from a different distribution. Based on this approach Filzmoser et al. (2012) proposed graphical tools for interpretation of multivariate outliers. Cerioli et al. (2009) proposed to calibrate the asymptotic cut-off values by Monte Carlo simulation. Calibration is first performed for some selected values of n and D and then extended to any n and D by parametric non-linear interpolation. Cerioli (2010) introduced an accurate approximation to the distribution of one-step reweighted robust distances. This approximation is based on a scaled Beta-distribution for the observations not suspected of being outliers, and on a scaled F -distribution for the units which are trimmed in the reweighting step.
When applying robust methods we want they to behave as similar as possible to the classical methods in case of clean data, therefore a first requirement of robust estimators is high efficiency at the specified model distribution. The loss of efficiency can be evaluated by the relative efficiency which is the ratio of variance of the robust and classical estimators. The second most important requirement for robust estimates is a high breakdown point: robust estimators need to resist a large amount of contamination before they become useless. Another important and desirable feature of statistical estimates is the affine equivariance which guarantees that the estimate will have a predictable behavior if the data were subjected to an affine transformation. For an n × D data matrix X, a location estimate T(X) and a scatter estimate S(X) are called affine equivariant if, for any D × 1 vector a and any nonsingular square D × D matrix B: The vector 1 n is (1, . . . , 1) with n elements. If a Mahalanobis distance-like measure is computed using affine equivariant location vector T and covariance matrix C then the resulting distance measure is also affine equivariant. Most of the well known multivariate location and scatter estimates like MCD, MVE, S and MM are affine equivariant, while OGK is not.
In our study we assume that we have a single multivariate population possibly containing outliers. The straightforward approach is to use an estimator with 50% breakdown point, like MCD or S-estimator, however, in the case of clean data the results will be with very low efficiency. A first remedy for this is to use reweighting (for MCD) or MM-estimates instead of S-estimates. For MCD the breakdown point and the efficiency are directly connected through the parameter h, 1/2 ≤ h < 1-the number of observations on which the estimator is based (Maronna et al., 2019). The larger h the lower breakdown point and the higher efficiency. For S estimates the results of Riani, Cerioli, and Torti (2014b) show asymptotic relationship between the breakdown point and efficiency: as one increases the other decreases. A solution to this dilemma should provide the MM-estimation introduced by Yohai (1987) which is a two step extension of the S-estimation. In the first step the breakdown point of the scale estimator is fixed to its maximum of 0.5 and in the second step the already obtained highly robust estimate is used for a new estimate of µ and Σ for which the tuning constant of the ρ function can be chosen to provide high efficiency.
An alternative approach is the use of adaptive methods based on monitoring a series of fits to the data that indicate good choices of efficiency or breakdown point (Cerioli et al., 2018;Riani, Atkinson, Cerioli, and Corbellini, 2019). The forward search for multivariate analysis is an algorithm for avoiding outliers by recursively constructing subsets of "good" observations, thus providing an automatic form of monitoring. Formal description of the forward search method can be found in many papers, see for example Riani et al. (2019). We start by choosing (by some robust criterion) a small subset of observations with size m 0 (usually m 0 = D + 1 or slightly larger) and then repeatedly extend it in such a way that outliers and other influential observations enter only toward the end of the search, arriving to the final fit that corresponds to the classical statistical estimates. During this process we monitor a suitable diagnostic measure and the inclusion of outliers is typically signalled by a sharp increase in this measure. Denote S (m) the subset used by the forward search at step m with m = m 0 , . . . , n. At step m the outlyingness of the observation x i is evaluated by its squared Mahalanobis distance computed by Equation 1 with the current location and covariance estimatesμ andΣ instead ofx and S.μ andΣ are the sample estimates computed from the observations in S (m) . The squared distances d 2 1 (m), . . . , d 2 n (m) are ordered and the first m + 1 observations are taken to form the subset S (m+1) for the next step. As long as S (m) does not contain any outlier neither masking nor swamping will impact the distances d 2 i (m) and they can be taken as robust estimates of the population Mahalanobis distances. To detect outliers at step m we will look at the minimum Mahalanobis distance amongst observations not in the subset S (m) : d min (m) = min d i (m), i ∈ S (m) . To conduct the corresponding tests a reference distribution for d i (m) and d min (m) is required. SinceΣ is estimated from a subset of all observations the variability is underestimated and a scaling factor c(m, n) is used in order to obtain an asymptotically unbiased estimate of Σ (Cerioli et al., 2018). The scaled Mahalanobis distances are obtained by replacingΣ(m) by c(m, n)Σ(m) in Equation 1. The distributional results derived in Riani et al. (2009) lead to the distribution of d min (m) for a given m and allow to create forward plots of this quantity during the search and to compare with the envelopes (confidence bands) formed by the forward plots of several quantiles. If at some step m * during the search the observation nearest to the observations already in the subset appears to be an outlier according to an appropriate envelope of the distribution of the test statistics, this event is called "signal". If a "signal" occurs at observation m * , this means that the observation m * and all observations following it may be outliers. To precisely identify the outliers (after a "signal" occurs at step m * ), envelopes for a series of smaller sample sizes (starting with m * − 1 onwards until an outlier is recognized) are superimposed. This process is automatic and all details are given in Riani et al. (2009). Further, in Section 4.2 these steps will be illustrated in the case of compositional data.
These underlying ideas can be extended to many other techniques like S-and MM-estimates. The subsequent estimations are presented in monitoring plots of all n squared Mahalanobis distances which can be combined with brushing to relate Mahalanobis distances to data points exhibited in the scatterplot matrix. In this way a straight relationship between statistical results and individual observations is established.
From a practical standpoint in data analysis the availability of such tools and their soft-ware implementation is essential to make their applicability to a wide range of data analysis problems. All the methods discussed in a number of papers on forward search and monitoring are implemented in the Flexible Statistics and Data Analysis (FSDA) toolbox (Riani, Perrotta, and Torti, 2012), freely available for users with a MATLAB license at hand from http://rosa.unipr.it. It features robust and efficient statistical analysis of data sets, not only in multivariate context but also in regression and cluster analysis problems. An R package interfacing to the FSDA toolbox, fsdaR is available at CRAN (Todorov and Sordini, 2020). Todorov (2018) presents a brief study of the computational efficiency of the different monitoring methods and states that it is still an open issue and further work is necessary to make the monitoring of S-estimation for different consecutive values of breakdown point efficient for large data sets.

Compositional data and robust methods
Compositional data were defined traditionally as multivariate data with positive values that sum up to a constant (1 or 100 per cent or any other constant), i.e. constrained data (Aitchison, 1986). Nowadays this definition is generalized in more practical terms to any set of multivariate observations with strictly positive components where relative rather than absolute information is relevant for the analysis (Pawlowsky-Glahn et al.  (2019)). Mathematically a D−part compositional data set is a matrix X of n compositional vectors where κ > 0 is an arbitrary constant. This vector space is characterized by its own geometry called Aitchison geometry (Egozcue, Barceló-Vidal, Martín-Fernández, Jarauta-Bragulat, Díaz-Barrero, and Mateu-Figueras, 2011). Each compositional vector x can be rescaled by a constant c and then the compositions x and y = cx are compositionally equivallent. This rescaling can be defined formally by the so called closure operator given by Even if the data are not subject to a constant sum constraint, standard statistical methods could lead to doubtful results if they are directly applied to the original data (Filzmoser, Hron, and Templ, 2018). A common practice to analyze compositions is to first project them onto the unconstrained real space by expressing them in logarithms of ratios between parts. Several types of transformations have been proposed in the literature and the choice of the most appropriate transformation depends on the data at hand and on the type of envisaged statistical analysis. Then standard statistical methods can be applied to the transformed data and the results are back transformed to the original space. The additive logratio transformation (alr) introduced by Aitchison (1986) maps a D-part composition in the simplex non-isometrically to a D − 1 dimensional vector in R D , treating one part (usually the last one) as a common denominator of the others. While very easy to interpret, the disadvantage of this transformation is that distances are not preserved (because the corresponding basis on the simplex is not orthonormal with respect to the Aitchison geometry (Pawlowsky-Glahn et al., 2008)). Another transformation, also proposed by Aitchison (1986) is the centered logratio (clr) transformation which maps a D−part composition from the simplex isometrically to R D , using the geometric mean as a common denominator. While very useful in terms of interpretation, this transformation produces observations with components which sum up to zero and thus the obtained data matrix has not full rank. The covariance matrix of the transformed data is singular which does not allow application of most of the multivariate robust statistical methods. The desired orthonormality is fulfilled by the isometric logratio (ilr) transformations from S D to R D−1 which, for a given basis can be defined as (Egozcue et al., 2003): Key tool for detection of multivariate outliers and for monitoring in the approach described in the previous Section are the Mahalanobis distances (see Equations 1 and 2) and statistics related to these distances. However, the results obtained with this tool in the case of compositional data might be unrealistic. If a transformation is applied, it is not clear how it will affect the Mahalanobis distances used for ranking the data points according to their outlyingness. Filzmoser and Hron (2008) considered these three well known transformations and showed how the additive, the centered and the isometric logratio transformations, will affect the Mahalanobis distances computed by classical and robust methods. They show that in case of classical location and covariance estimators all three transformations lead to the same Mahalnobis distances, however, only alr and ilr extend this property to any affine equivariant estimator. The former is simple and could be used in the context of outlier detection. In many of the examples in the next Sections alr produces the same results as ilr. However, the use of alr is not recommended because it does not result in an orthogonal basis system, which might be necessary for diagnostic tools following outlier detection or any further analysis which involves distances (Filzmoser et al., 2012). If multivariate normal distribution on the simplex is assumed, i.e., the orthonormal coordinates are normally distributed (Mateu-Figueras and Pawlowsky-Glahn, 2008), the distribution of the (classical) squared Mahalanobis distances can be approximated by a chi-square distribution with D degrees of freedom as discussed in Section 2 and the same distribution might be used for the robust distances.
To illustrate the problem of applying robust methods to compositional data we start with a simple example based on the data set Vegetables. The data set, available in the R package easyCODA (Greenacre, 2019) contains the compositions of protein, carbohydrate and fat as a percentage of their respective totals for ten different vegetables and is shown in Table 1.
Based on different types of logratios Greenacre (2019, p. 22, 31) shows that the vegetables fall into two groups: {Potatoes, Onions, Peas, Asparagus, Spinach, Broccoli} and {Beans (soya), Carrots, Corn, Mushrooms}. This is visible in a number of plots as well as can be confirmed by formal tests. As pointed out by Filzmoser and Hron (2008) for other compositional data sets, we cannot apply standard outlier detection based on Mahalanobis distances, neither classical nor robust, directly to the data set, because the covariance matrix estimated from the original data is singular due to the fact that the data are subject to a constant sum constraint. Applying the outlier detection methods from the R package rrcov (Todorov and Filzmoser, 2009) as well as the methods from the MATLAB toolbox FSDA (Riani et al., 2012) results in an error. After applying ilr transformation to the data bivariate structure is revealed as shown in the distance-distance plot (Rousseeuw and Van Driessen, 1999) in Figure 2 (robust Mahalanobis distances computed by MCD are plotted against classical Mahalanobis distances). Four observations, namely mushrooms, carrots, corn and beans which form the second group, are identified as potential outliers (using the cutoff χ 2 2:0.975 = 2.7162). The classical Mahalanobis distances, computed with the sample mean vector and covariance matrix do not identify any outlier.
As already described in Section 2 the MCD is not the only robust estimator which can be used for outlier detection. There exist other estimators which can be applied to data where the number of variables is larger than the number of observations and thus the covariance matrix is singular, like for example the Orthogonalized Gnanadesikan-Kettenring (OGK) estimator (Maronna and Zamar (2002); Todorov and Filzmoser (2009)). We can apply this estimator directly to the original data and obtain a robust estimate of the covariance matrix, however, this will not help for solving the outlier detection problem because the estimated robust covariance will again be singular and cannot be used for computing the Mahalanobis distances.
Another approach to outlier detection, especially when the number of variables is larger than the number of observations (and the covariance matrix of the data is singular) is to use robust principal component analysis (PCA). In the literature on robust compositional data analysis usually PCA based on MCD is used (Filzmoser et al. (2018)) but nothing could stop us from using ROBPCA (Hubert, Rousseeuw, and Vanden Branden (2005); Todorov and Filzmoser (2009)) which combines ideas from projection pursuit and MCD estimation and thus can work on data with singular covariance matrix. It should be noted that a disadvantage of the PCA for outlier detection is that it is not affine equivariant, only rotation equivariant. Robust PCA on the original data is shown in the right hand panel of Figure 2. It presents the so called orthogonal distances against the score distances computed from the PCA (see Hubert et al. (2005) for details about the definition of these distances and the cutoff values represented by the horizontal and vertical lines in the plot). Although ROBPCA works on data with singular covariance matrix and we could carry out the analysis on the original data in spite of the fact that the data are subject to constant sum constraint, the identified outliers are not the correct ones (i.e. ROBPCA identifies four vegetables as outliers but these observations actually belong to the larger group). It should be noted, however that if ROBPCA is carried out on the ilr -transformed data, it correctly identifies the four outlier.
One could be tempted to look at each variable separately and apply robust univariate methods for outlier detection. However, first of all, the compositional data can never be seen as truly univariate data because when we are looking at a single variable we actually observe not absolute values but relative information of this variable to all the rest in the composition. Therefore we talk not about variables but about parts of the composition. Discussion of the problems and possibilities of univariate statistical analysis of compositional data can be found in Filzmoser, Hron, and Reimann (2009).
Since the original data in this example are three-dimensional they can conveniently be presented in a ternary diagram (right hand panel of Figure 3). To better visualize the multivariate data structure we superimpose 0.975 tolerance ellipses of the Mahalanobis distances computed by the sample mean and covariance (blue) and by MCD (red) respectively. The ellipses are back-transformed to the original space using the inverse ilr transformation as proposed in Filzmoser and Hron (2008). The ellipse corresponding to the classical estimates (blue) covers all data points, while the robust one (red), based on MCD, excludes the four points identified as potential outliers.

Monitoring for compositional data
Now we will illustrate the methods and ideas presented in Sections 2 and 3 on two extensive  examples. Both data sets were not analyzed previously in the literature in the context of outlier detection and we do not have any information about the presence of outliers. Therefore we start by the standard outlier detection methods in the R package rrcov based on MCD and robust Mahalanobis distances. Since both data sets are not subject to a constant sum constraint, the MCD (as well as other robust estimators) can be computed on the original data but the results might be doubtful as often pointed out in the literature. Therefore we first suitably transform the data using the ilr transformation. After demonstrating the outlier detection with MCD on the ilr -transformed data, we continue with S-and MM-estimation, as well as the Forward search, and the corresponding monitoring functions from the R package fsdaR. Finally we demonstrate the brushing and linking functionalities for establishing a straight relationship between statistical results obtained and the individual observations.

Example 1: Fish morphology data set
For our first example of illustrating the problem of monitoring compositional data we use the FishMorphology data set from the package easyCODA (Greenacre (2019), see also Greenacre and Primicerio (2010)). The data set consists of 26 morphometric measurements, in millimeters, on a sample of 75 fish of the species Arctic charr (Salvelinus alpinus). Additionally, to each observation, sex (male or female), habitat (littoral, close to the shore and pelagic, in deeper water far from the shore) and the body mass are recorded. For the sake our example we select only the former habitat (59 observations) and take the first 10 (out of the 26) morphometric measures because we want a single population of moderate size. It is a rule of thumb for the methods we are considering that for each dimension should be at least five observations. Of course we could choose any other set of variables and then the results might be different from the shown here but the principle of the illustration would not change.
Thus we remain with a data set of 59 observations on the following 10 variables: {Bg, Bd, Bcw, Jw, Jl, Bp, Bac, Bch, Fc, Fdw}. It should be noted that in this data set there is one observation which is "obvious" outlier, the observation with ID = 51 which in our data set is number 16, i.e. it can be identified with any outlier detection method and it will appear in all graphs shown below. The data set is not subject to a constant sum constrained but nevertheless it is compositional in nature according to the more general definition (see Section 3). Greenacre (2019), the source of our data set, first closed the data, before starting the analysis, by applying the closure operator (Equation 5) but for our example this is not necessary.
Compositional data are by definition multivariate data, therefore it does not make sense to apply univariate outlier detection methods on the original data. We illustrate this by the boxplot of the original data set shown in the left-hand panel of Figure 4. A single outlier is visible, in variable F dw, and this is observation #5. Univariate methods for outlier detection in compositional data can be used only on log-ratios or on coordinates, after transforming the data with a suitable transformation (Filzmoser et al., 2018, p. 94).
Since the data are not subject to a constant sum constraint, as in the example in the previous Section, outlier detection methods based on robust estimates of location and covariance can be applied. A plot of the robust Mahalanobis distance based on MCD versus classical Mahalanobis distance is shown in the middle panel Figure 4-it identifies six outliers. A plot of the outlier detection using robust PCA is shown in the right-hand panel of Figure 4-only a single outlier is detected, observation #16. And finally, using OGK for outlier detection (not shown here) identifies eight observations as potential outliers. The drastic differences shown in these three plots are already cause of concern. As often mentioned in the literature, even when standard statistical analysis methods can be computed on the original compositional data (when the required covariance matrices are not singular) the results can be doubtful.
After applying ilr transformation, the multivariate structure is revealed as shown in the distance-distance plot in Figure  q q q qqq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q 55 20 16 Robust 2 3 4 q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qqq q q q q q q 16 Classical Figure 5: Fish Morphology data set (littoral habitat), ilr transformed. Robust Mahalanobis distance based on MCD versus classical Mahalanobis distance in the left hand panel. The horizontal and vertical lines are at the cut-off used (square root of the 0.975 quantile of the χ 2 9 distribution). A χ 2 QQ-plot, classical and robust, in the right hand panel.
potential outliers by the robust distances while observation 16 is flagged by both the classical and robust distances (using the 0.975 quantile of the χ 2 9 distribution as a cut-off in both cases). The right hand panel of Figure 5 shows the robust and classical χ 2 QQ-plots which present the squared (robust) Mahalanobis distances against the quantiles of the χ 2 9 distribution. The χ 2 QQ-plot is useful for visualizing the deviation of the data distribution from multivariate normality in the tails and is often used for outlier detection. For example Garrett (1989) and later Filzmoser et al. (2005) use these ideas to define adaptive outlier detection methods. In our case the result is identical to the plot shown in the left-hand panel, observation 16 is identified by both classical and robust distances and 20 and 55 are visible only in the robust method. Except for these observations, all the rest lie more or less on the diagonal line, confirming the χ 2 -approximation. Since the original data are in more than three dimensions they cannot be conveniently presented in a ternary diagram as the Vegetables data set was presented in the right hand panel of Figure 3.
Computing the S-estimates of the multivariate location and covariance matrix of the ilr transformed data with 50% breakdown point and Tukey's biweight function (Fig. 6, left panel) produces similar result as the reweighted MCD, identifying the three outliers, however the S-estimates with reduced breakdown point (with the hope to obtain better efficiency) does not identify any outliers (right panel in Fig. 6). Similarly, computing MM-estimates with 80% efficiency and Tukey's biweight function (Fig. 7, left panel) produces similar result as the reweighted MCD, however the MM-estimates with the efficiency 95% (which is often advocated and is the default value in most software packages) does not identify any outliers except observation 16, which, as we already noted is an obvious outlier and is identified by any method (right panel in Fig. 7). As Cerioli et al. (2018) point out, the recommended default efficiency of 95% or 99% for the MM-estimates might be too optimistic, also in our case. Following their approach for data driven balance between robustness and efficiency, which we apply to the case of compositional data, we present in the following the monitoring of the estimation parameters (breakdown point and efficiency) resulting in plots of the squared Mahalanobis distances of the ilr transformed data. Figure 8 presents the monitoring of the MM-estimation. A series of robust MM fits are conducted by varying the efficiency from 0.5 to 0.99 by step of 0.01 (this is the default setup, but other configurations can be selected) and for each of them the robust distances (2) of all observations in the data set are computed. The estimated location and covariance matrix, the distances and other related statistics are stored for each step in the process. The left-hand panel shows the squared Mahalanobis distances presented against the values of efficiency at each step. Each line running horizontally from 0.5 to 0.99 represents the trajectory of one observation. The color of the lines varies from light blue to dark blue. The color becomes darker as the maxima of the individual trajectories increase. Thus the eye is drawn to the behaviour of the most outlying units represented by darker color. The red horizontal lines show the 0.975 and 0.99 quantiles of the squared χ 2 distribution with 9 degrees of freedom. The trajectories are stable until the efficiency reaches 0.54 when the fit changes abruptly and remains so until 0.77 when it changes again and becomes similar to the maximum likelihood and remains stable until the efficiency reaches 0.99. It reveals why the index plot of the MM-estimates in the right hand panel of Figure 7 did not show any outliers-for efficiency higher than 0.77 the fit is identical to the maximum likelihood. This is also clearly seen from the correlation monitoring in the right hand panel of Figure 8. It shows the monitoring of the three correlation measures which summarize the structure of the plot in the left-hand side by the correlation between the squared Mahalanobis distances at adjacent monitoring values. The three standard measures of correlation are: 1. Spearman: This is the correlation between the ranks of the two sets of observations. 3. Pearson: The product-moment correlation coefficient.
All three correlations indicate the abrupt change at 54% and then a change at 77%. As we have already seen in Figure 7, for efficiency = 80% the analysis is still robust but increasing the efficiency to 95% and more results in a non-robust analysis.
Using the brushing functionality of the package, we can identify the outlying units, as shown in Figure 9: in the right hand panel the outliers are shown as red circles. A more advanced version of the brushing function allows to do the brushing in several steps, in each selecting different points. The points selected at each step are added to the points selected in the previous step but are presented in different pattern/color. In the first step we scan elect the outliers which affect the model results before 90% efficiency and they will be shown as red circles. In the second step the "obvious" outlier which can be identified also by the maximum likelihood estimates (MLE) can be selected and it will be shown as a light blue star.
Computing S-estimates with 50% (asymptotic) breakdown point and Tukey's biweight func- tion produces similar result as the reweighted MCD, however, this is not the case if the breakdown point is reduced to say 25%. As we have already seen in Figure 6, for breakdown point of 50% the analysis is robust but reducing the breakdown point to say 25% (with the hope to increase the efficiency), the result is identical to the maximum likelihood analysis. That is, with bdp 50% three outliers are detected and when we reduce the bdp, we expect that these three outliers will still be identified. However, this does not happen and thus, by reducing the bdp we destroyed the robustness of the estimator. This problem of S-estimates, that with increasing dimension, although having high breakdown point, the ability to detect outliers decreases, was pointed out already by Rocke (1996), see also (Maronna et al., 2019, p. 190). The forward search methodology and its extension to S-estimates provides a solution: in this case it tells us that we cannot reduce the breakdown point without loosing the robustness of the estimator. Alternatively, to achieve higher efficiency, we should use MM-estimates but tuning their efficiency as shown above in our example.

Example 2: Technology intensity of exports
We turn now to the motivation example about the technological structure of manufactured exports presented in the introduction. The data set is available in the R package rrcov3way (Todorov, Simonacci, Di Palma, and Gallo, 2020). For our example we select only one year, 2017 and remove any country with missing data, remaining with 129 observations. Needles to say that applying the outlier detection methods from the R package rrcov or the methods from the MATLAB toolbox FSDA to the original data are meaningless: the reweighted MCD, for example, identifies 63 outliers out of 129 observations (49%). The data set is not subjected to a constant sum constraint and the covariance matrix is not singular; thus a robust covariance matrix (MCD, S or MM) can be computed, however, it is ill-conditioned and the outlier detection results in flagging almost half of the data as outliers. After applying ilr transformation the structure is revealed as shown in the distance-distance plot in Figure 10. Now 18 observations are identified as outliers by the reweighted MCD estimator. This is definitely a compositional data set (the four categories are parts of one whole) but the closure is not visible when inspecting the row sums. This is due to the fact that we consider only the manufactured exports while the countries also export agricultural, mining and other products. This demonstrates the problem of the so called subcompositions (Aitchison, 1986)we cannot hope that the effect of the closure will disappear if not all parts are included in the analysis and an appropriate transformation is needed.
The original data in this example are four-dimensional and if we want to present them in a ternary diagram we need to get rid of one of the four dimensions. We have two options: (i) to remove the resource based category and remain with LT, MT and HT-actually the first version of the technology classification of the exports (Lall (1998);Lall (2000)) contained only these three categories and the processed foods like sugar, cheese, vegetable preparations were classified as primary products, not as resource based manufactures as in the present classification; (ii) alternatively, we can combine RB and LT into one category, at some risk of simplification, calling this category "easy" technologies, with the main drivers of competitiveness being natural resource endowments in the former case and low wages in the latter. We show in the right hand panel of Figure 10 the first option, with LT, MT and HT but the picture with option (ii) is similar, with the difference that the "LT+RB" corner looks a bit "heavier". As in Figure 3, to better visualize the multivariate data structure we superimpose 0.975 tolerance ellipses of the Mahalanobis distances computed by the sample mean and covariance (blue) and by MCD (red) respectively. The ellipses are back-transformed to the original space using the inverse ilr transformation as proposed in Filzmoser and Hron (2008). A legend shows the classification of the countries according to their development (Upadhyaya, 2013). The three Least developed countries identified by both classical and robust methods are Belize, Maldives and Cambodia. There are three industrialized countries identified as outliers -from the left hand panel of Figure 10 we see that these are Qatar, Bahrain and Kuwait, all oil producing countries. Only Qatar is identified by both the classical and the robust method. The two outlying observations from the category of emerging industrial economies visible by the lower left corner (LT) are Vietnam and Saudi Arabia. With Nigeria and Algeria we have six outlying countries which are among the top 20 oil producing countries, and two more, Congo and Vietnam come close.
We continue by running the automatic outlier detection procedure based on forward search and the top left panel of Figure 11 represents the corresponding plot. The magenta-colored line is the trajectory of the minimum Mahalanobis distance d min (m) amongst observations not in the subset S (m) for m = m 0 , . . . , n. The dashed lines present the envelopes computed for different quantiles at each step m of the search as described in Section 2, see also Riani et al. (2009) where a two stage procedure is described in detail. Through monitoring the bounds of all n observations we look if a "signal" will be obtained, say at step m * , meaning that the value of the statistic lies beyond our threshold given by the envelope with significance level 99.99%. This will indicate that observation m * , and therefore all succeeding observations, may be outliers. As visible in our plot in Figure 11 the "signal" for outliers occurs at observation 106, Figure 11: Technological structure of manufactured exports, ilr transformed. The top lefthand panel shows the forward search plot of minimum Mahalanobis distance d min (m), with a "signal" for the presence of outliers around step 106. The bottom for panels show resuperimpositions of envelopes computed for different sample size. The top right-hand panel shows the scatter plot of the data with the 15 observations identified as outliers by FS as red circles.
indicating that it and the succeeding observations might be outliers. In the second stage of the procedure we superimpose envelopes for values of n (the subsample size) from the point of the signal until the first time that we introduce an observation we recognize as an outlier. In the four bottom panels of Figure 11 we show the superimposition of envelopes for n = 105 (just before the signal) followed by n = 106 and n = 114 in which no outlier is detected. However, for n = 115 we see that d min (106) goes beyond the 99.9% envelope computed with n = 115. This means that there is a outlier-free subset with 114 observations and the remaining 15 observations are outliers. The identified outliers are shown in the top right panel of Figure 11 as red circles.
It turns out that these 15 outliers are quite similar but still different from the outliers detected by the MCD (Saudi Arabia, Vietnam, Pakistan, Madagascar and the Central African Republic are now not outliers but Namibia and Mongolia are identified additionally). Since in this automatic outlier detection procedure multiple outlier tests are conducted and we do not rely on the quite arbitrary threshold based on the 0.975 quantile as in the MCD outlier detection, we can conclude that the outlier list obtained is much more precise than the one obtained by MCD and shown in Figure 10. Performing the same analysis on the original, not transformed data (not shown here) indicates a "signal" at observation 79 and identifies 51 observations as outliers (40%) which is similar to the result of MCD applied on the original data.

Conclusions
Robust methods are not only useful but also a required tool when analyzing real life data which very often are plagued by the presence of outliers. It does not matter if the robust methods are used directly to fit models or indirectly to identify outliers, some arbitrarily chosen parameters can have a destructive effect on the results. In a number of recent articles Riani, Cerioli, Atkinson and others advocate the technique of monitoring robust estimates computed over a range of key parameter values. Through this approach the diagnostic tools of choice can be tuned in such a way that highly robust estimators which are as efficient as possible are obtained. This approach is applicable to different robust multivariate estimates like S-and MM-estimates, MVE and MCD as well as to the Forward Search in which monitoring is part of the robust method. We show that in order to apply these adaptive methods to compositional data which are parts of some whole and often they are recorded as data subject to a constant-sum constraint, it is necessary to transform the compositional data. The affine equivariance of the key measure for detecting outliers, the Mahalanobis distance (when computed with affine equivariant robust estimates of location and scatter), ensures the invariance of the identified irregularities from the choice of transformation used (Filzmoser and Hron, 2008). We demonstrate on several examples how the monitoring can be conducted, providing highly efficient estimates and demonstrate the role of advanced dynamic graphics like brushing and linking for establishing a straight relationship between statistical results and individual observations. All computations were performed with the fsdaR package available at CRAN which brings almost all the functions of the MATLAB toolbox FSDA to the R user. Since the scope of these functions covers also robust regression (Riani, Cerioli, Atkinson, and Perrotta, 2014a) and robust clustering (Riani et al., 2019) a natural extension of this study is to consider monitoring for robust regression and clustering for compositional data in the future.