A Logistic Normal Mixture Model for Compositional Data Allowing Essential Zeros

The usual candidate distributions for modeling compositions, the Dirichlet and the logistic normal distribution, do not include zero components in their support. Methods have been developed and refined for dealing with zeros that are rounded, or due to a value being below a detection level. Methods have also been developed for zeros in compositions arising from count data. However, essential zeros, cases where a component is truly absent, in continuous compositions are still a problem. The most promising approach is based on extending the logistic normal distribution to model essential zeros using a mixture of additive logistic normal distributions of different dimension, related by common parameters. We continue this approach, and by imposing an additional constraint, develop a likelihood, and show ways of estimating parameters for location and dispersion. The proposed likelihood, conditional on parameters for the probability of zeros, is a mixture of additive logistic normal distributions of different dimensions whose location and dispersion parameters are projections of a common location or dispersion parameter. For some simple special cases, we contrast the relative efficiency of different location estimators.


Introduction
An essential zero in compositional data is a zero component which is not caused by rounding or some other difficulty in measurement, but rather, is genuinely believed to be zero.This is fundamentally a different problem than that addressed by recent work on rounded zeros, or below-detection level zeros, such as in Palarea-Albaladejo and Martín-Fernández (2015) and references therein.Although there are recent workable Bayesian approaches to zeros in compositions from count data, Martín-Fernández, Hron, Templ, Filzmoser, and Palarea-Albaladejo (2014) and references therein, essential zeros in continuous compositions still present a problem.
We develop an approach proposed by Aitchison and Kay (2003) to extend the logistic normal distribution to accommodate essential zeros.Aitchison (1986) and Aitchison and Kay (2003) note that a key feature compositional data is that ratios of the components contain all pertinent information about the composition.Essential zeros complicate this feature in that they contain no information about the other components of the composition.In addition, an observation containing an essential zero is at the boundary of the simplex and is a composition of smaller dimension.

Previous work
In addition to the work mentioned above, there have been other approaches to zeros in compositions.Work by Butler and Glasbey (2008) mapped a latent Gaussian variable to a composition, but seems only to work for two and three-part compositions.An additional concern is that it does not preserve ratios of parts in subcompositions.In contrast, Leininger, Gelfand, Allen, and Silander Jr (2013) have developed a more practical treatment of compositions as coming from a latent Gaussian random variable where the compositional component is zero when the latent Gaussian component is less than or equal to zero.They develop a hierarchical power model with the transformation X k = (max(0,Z k )) γ 1+ d k =1 (max(0,z k )) γ where Z k is the k th normal component and X k is the corresponding compositional component.D is the number of parts in the composition, d = D − 1, and The corresponding inverse transformation is Z k = (X k /X D ) 1/γ if X k > 0, and Z k ≤ 0 (latent) if X k = 0, for k = 1, 2, . . ., d.To estimate parameters they use MCMC.One limitation of their approach is also a limitation of ours: we require one component of the composition to be strictly positive.
Work by Stewart and Field (2011) uses a multiplicative logistic normal mixture model that allows them to consider the univariate log odds for the i th component to be normally distributed where the i th component is not zero.It works well for their applications, in particular regression, but does not capture covariance easily.Scealy and Welsh (2011) transform compositions into directional data on the hypersphere, and develop a regression model using the Kent distribution, Kent (1982), which tolerates zeros, though they write, "When any of the components of u are distributed too close to 0, boundary issues arise and in this case we need to pursue alternative approaches since the fitted Kent model (and the von Mises-Fisher model) may have significant support outside the positive orthant."A further issue with their approach is that their square root transformation does not preserve ratios of parts in subcompositions.
Our goal here is to extend the additive logistic normal distribution to handle essential zeros for continuous data.

Motivating example
Suppose we have compositional data on how much Bill spends on rice, lentils, and spices when he buys food.Suppose he buys in bulk, and occasionally the store is out of either the spices or lentils, but they always have plenty of rice.Table 1 shows a set of such compositions where some of the entries, for spices or lentils, are zero.Our goal is to develop a model for data like these by extending the additive logistic normal distribution.Aitchison, 1986) Definition: The d-dimensional simplex embedded in D-dimensional real space is the set of compositions, x, defined by

Definitions (from
The additive logratio transformation, alr, is defined as follows: We define the shorthand log(x Since alr is one-to-one, its inverse exists.It is called the logistic transformation, alr −1 , defined as , where for (i = 1, . . ., d),

Simplifying assumption
In this section we outline our method for building a mixture distribution for dealing with compositions containing essential zeros, but leave most of the details about the weights for later.A key simplifying assumption we make throughout is that one of the parts of the composition, the D th component, is never zero.We allow zeros anywhere else but not in the last component.
In a set of logistic normal data without zeros, the likelihood has been shown to be permutation invariant (Aitchison 1986).In our extension which allows zeros, if some parts are never zero, the likelihood is invariant to the choice of which one of those nonzero parts is chosen as the reference provided the same reference part is used throughout the data set.
Let x = (x 1 , x 2 , . . ., x d , x D ) T be a composition with x i < 1 for all i ∈ {1, 2, . . ., d, D} and x D > 0. For i ∈ {1, 2, . . ., d}, consider two possibilities.Either That is, W is the set of indices for the parts of x (other than x D ) which are nonzero (positive).For any given composition x, W is the set of all the indices of the nonzero components of x.There are 2 d − 1 possible sets W .There are 2 d − 1 and not 2 d because W cannot be empty.If W were empty that would require that x D = 1 in order for x to be a composition, but we have already said we require all x i < 1 including x D .Each pattern of zeros corresponds to a different set W .We index them as W with ∈ {1, 2, . . ., 2 d − 1}.They are elements of the power set, W ∈ P({1, 2, . . ., d}).Sometimes we refer to these sets with incidence vectors where the i th component Each W has some probability of occurrence, P (W ).Although some pattern can be not present P (W ) = 0, the probabilities must sum to one, We use the probabilities P (W ) as the weights in a mixture distribution.For the other distributions making up our mixture, we use logistic normal distributions L(x; µ W , Ω W ) derived from a single parent logistic normal distribution L(x; µ, Ω) .They are in fact projections from the parent.We will call the distributions derived from the parent distribution subdistributions once we define them.So the mixture distribution will be denoted as follows, once we define a few more terms, In the parent distribution, L(x; µ, Ω), µ is a d-part location parameter vector, µ ∈ R d , and Ω is a d × d positive definite dispersion matrix.To ease the discussion we will refer to µ and Ω as mean vector and variance-covariance matrix respectively, although they are not moments of the distribution.For the distributions derived from the logistic normal parent distribution, the parameters µ W and Ω W are defined in terms of the parameters µ, and Ω, and the set of indices of nonzero components of x, W , and a selection matrix B W .
• ||W || refers to the cardinality of the set W .
• µ is a d-part vector in R d .
• µ W is a subvector of µ corresponding to the W pattern of zeros.
• Ω W is a submatrix of a d × d positive definite covariance matrix corresponding to the W pattern of zeros.

Multivariate normal foundation
Now we extend the notation for the inverse of the additive logratio transformation, alr −1 , from Aitchison (1986).We use the new symbols, ãlr and ãlr −1 .We define them in terms of W and D, the maximum index.Let W ⊂ {1, 2, . . ., d} be a pattern of zeros, i.e., a set of indices of nonzero components of x and denote them: W = {i 1 , i 2 , . . ., i r }, and let j ∈ {1, 2, . . ., d, D}.
In our approach there is a tight correspondence between the y i variables of a multivariate normal vector and the x i parts of a composition, possibly one containing essential zeros.
When there is an essential zero in the composition in one of the x i parts, e.g., in x 2 , we use as a placeholder so things line up, for example: (3) When we have an essential zero in the i th part of the composition, we have a selected subvector of the y's which does not contain an element corresponding to y i .The requirement that x D > 0 is what allows us to maintain this strict correspondence between x i and y i .

Now we define ãlr
Next we turn to defining an extension to the logistic normal distribution.Let x = (x 1 , x 2 , x 3 , . . ., x d , x D ) T be a composition, and let y = log(x −D /x D ) = (y 1 , y 2 , y 3 , . . ., y d ) T .Then x is defined to have a logistic normal distribution with location µ and dispersion Ω, written x∼L(µ, Ω), if y∼N (µ, Ω).Note also that if y∼N (µ, Ω), and B is a selection matrix as mentioned in Section 5, then By is also multivariate normal with distribution: And finally, recall that also in Section 5 we used W to represent the set of indices of the nonzero components of x.B W is the selection matrix that selects those components when we perform these matrix multiplications: , ).With all these definitions in place we are now in a position to define a logistic normal distribution with essential zeros.Definition: Let B W be the corresponding selection matrix.
Let y = log(B W x −D /x D ) = (y i 1 , y i 2 , . . ., y ir ) T = ãlr(x, W , D) be the logratios of the nonzero components of x.
If for every set W of indices of nonzero components of x, we have y∼N (µ W , Ω W ), then x has a logistic normal distribution with essential zeros, written x∼L(µ, Ω), with probability density function where µ W is a subvector of µ corresponding to the W pattern of zeros.
Ω W is a square submatrix of Ω, corresponding to the W pattern of zeros.

Common expectations and variances
The definition of ãlr −1 enables compositions from different subdistributions to be used to estimate parameters of their shared parent distribution.Let x 1 = (x 11 , x 21 , . . ., x D1 ) T , and let x 2 = (x 12 , x 22 , . . ., x D2 ) T with The two sets of nonzero indices, W 1 , W 2 need not have any elements in common, nor do they need to have the same number of elements, though x 1 and x 2 both have D elements.Suppose they have an index, m, in common: m ∈ W 1 ∩ W 2 .By properties of the logistic normal distribution (Aitchison 1986, p. 116), and the definition of ãlr −1 in Equation 4we have: And similarly, Thus, compositions from different subdistributions of the same logistic normal distribution can be used to estimate the parameters of their shared parent distribution.

Data blocks
Now that we have a correspondence between multivariate normal variables and compositions with zeros, we could derive a density function using the standard formula for transformed variables, analogous to Aitchison (1986, chapter 6).However, for estimating parameters it is more convenient to work in the space of the transformed variables (multivariate normal projections).
Here we apply the techniques and notation of block matrices and matrix calculus to do some preparation in order to build a likelihood and attack the problem of finding estimators for the parameters.We discuss two sets of estimators, a general maximum likelihood estimator, and a simpler pair of estimators reminiscent of method of moments estimators.

Block matrices of compositions
We write a collection of compositional data with zeros, X, as a column of blocks of compositions where each block, X , has a particular pattern of zeros throughout.That is, for a particular block, X , and i ∈ {1, 2, . . ., d}, the i th column of X is either all positive, or all zero.Let The dimensions of the blocks are: and the sum of their vertical dimensions is r 1 + r 2 + . . .+ r b = n, where n is the number of data points.

Transformations -ratios and logratios
We have already defined the alr transformation for the case where there are no zeros in (1).Next we extend alr to ãlr for a block matrix of compositions, X r ×D which may contain zeros.We do this by defining a selection matrix B W corresponding to set W .We still have W ⊂ {1, 2, 3, . . ., d} being a nonempty set of indices of the nonzero components of x, and without loss of generality we can order the indices from least to greatest: Now we define our (J + 1) ].We use J + 1 here because we construct the selection matrix so that the final, D th , component of the data is always selected.This is slightly different than before.Previously we constructed B to conform to the parameters µ(d × 1) and Ω(d × d).
We define ãlr(X , W , D) = alr(X Each row vector in Y is a vector of reals, all potentially from the multivariate normal distribution corresponding to the th pattern of zeros.Note that we cannot form a single block matrix, Y, from the collection of Y because they can have different numbers of columns.

Illustration -spices, lentils, and rice
In our example about compositions of spending on spices, lentils, and rice (Table 1), there are three patterns of zeros.Tables 2-4 show the result of applying the ãlr transformation.
X 1 corresponds to rows 1-3 and its set of indices is W 1 = {1}.X 2 corresponds to rows 4-6 and its set of indices is W 2 = {2}.X 3 corresponds to rows 7-12 and its set of indices is

Means
The matrix Y contains rows of compositions with the same pattern of zeros.We refer to the t th row vector of Y as y T t .We refer to the mean as the vector ȳ , and define it as: Here we are using 1 r to represent an r × 1 column vector of ones.ȳ is a column vector.We define it this way because of how we intend to use it in quadratic forms from the multivariate normal density.

Mean
It is also possible to construct simpler estimators relying on properties of the normal distribution.For the location, if By the assumption of normality of the logratios, the estimator * µ is unbiased.

Variance
Here we show how to find estimators for variances and covariances using maximum likelihood estimators for normal random variables.For a single random composition, x, with components x 1 , x 2 , . . ., x D , we substitute log(x i /x D ) into the MLE for variances of normal random variables.We use * σ 2 ii for the estimator of the variances of the logratios log(x i /x D ), for i ∈ {1, 2, . . ., d}, and t ∈ {1, 2, . . ., If we want an unbiased estimator, we can divide by (n i − 1) instead of n i .As with means, the different * σ ii are based on different numbers of observations, n i .

Covariance
It only makes sense to talk about estimating the covariance of the variables log(x i /x D ) and log(x j /x D ) when both x ti and x tj are not 0 so we define n ij = ||{t : x ti = 0 & x tj = 0}||.That is, n ij is the number of data points where both x ti and x tj are not 0. As we did with variance, we can start with the canonical maximum likelihood formula for estimating covariance among normally distributed variables, and substitute in appropriate logratios.* Note that * σ ij is based on n ij observations, while * µ i and * µ j are based on n i and n j observations, respectively.The formula in Equation 20 is based on the maximum likelihood estimator for covariance of normal variables.For unbiased estimators we would divide by (n ij − 1) instead of n ij .
Our estimator for the d × d variance-covariance matrix is There are two potential problems with this approach.There could be i, j, i = j, such that whenever x i > 0, x j = 0.In that case we cannot estimate the covariance.Also, irrespective of that, the estimate of the variance-covariance matrix, * Ω, might not be positive definite.

Maximum likelihood estimators
For the case where there are no zeros, the location estimator described earlier is a maximum likelihood estimator (MLE), but in general the estimator we found earlier is not an MLE.
From now on we will call that estimator the simple estimator, to contrast it with the MLE, which we derive next.
We start by finding the location MLE given Ω for 3-part compositions, show it is unbiased, and then show the relative efficiency of the simple estimator with to the MLE.Assume we have a set of logistic normal compositional data with b different patterns of zeros as in ( 9).
In a block of data, as in (21), we use x t to refer to the t th compositional observation with W pattern of zeros.We define y t = ãlr(x t , W , D), and to ease notation, we write in terms of y t .

Likelihood
First we write the full likelihood and log likelihood for D-part compositions, and then restrict ourselves to 3-part compositions.The full likelihood is: is independent of µ, so for purposes of maximizing the likelihood with respect to µ, we can treat it as a single constant, C.
L(µ, Ω|r 1 , . . ., r b , y 11 , . . ., Taking the log gives: log L(µ, Ω|r 1 , . . ., r b , y 11 , . . ., For the simple case of three-part compositional data with some zeros in component one, and some zeros in component two, the parent distribution for the transformed data is bivariate normal, , and Ω = s 11 s 12 s 12 s 22 (28) For the full bivariate normal distribution, For the two univariate normal distributions, the inverses of the variances are: 1 s 11 and 1 s 22 .In these formulas, y 1j1 is the j th data point among the univariate data from the first component.y 2j2 is the j th data point among the univariate data from the second component.y 3j is a 2-part vector with data from both components, In the example, the y 1j1 correspond to elements of Table 2, log(spices/rice).The y 2j2 correspond to elements of Table 3, log(lentils/rice), and y 3j1 correspond to elements of Table 4, both log(spices/rice) and log(lentils/rice).
We define the means of these matrices in the usual way.
Partial derivatives MLE for location, given Ω We set the partial derivatives equal to zero, replace µ with μ, and solve.The result is: In the case where there are no univariate data from the second component, i.e, r 2 = 0, we have: That shows that when we have r 2 = 0, the MLE (μ 1 |Ω, r 1 , r 2 , r 3 ) is equal to our simple estimator for µ 1 .Similarly, when r 1 = 0, (μ 2 |Ω, r 1 , r 2 , r 3 ) is equal to our simple estimator for µ 2 .It also turns out that (μ 1 |Ω, r 1 , r 2 , r 3 ) = ȳ11 , and when r 3 = 0, the simple estimator is also ȳ11 , so they are equal in that case as well.

Unbiasedness of conditional MLE for 3-part composition
To show that μ|Ω, r 1 , r 2 , r 3 is unbiased, we start by pointing out the expectations of the various means: When we take the expectation in expression(34), the term with (ȳ 22 − ȳ32 ) vanishes because That leaves only terms with E[ȳ 11 ] = µ 1 and E[ȳ 31 ] = µ 1 , which we can factor: This shows that μ1 is unbiased.By symmetry we get that μ2 is unbiased.

General maximum likelihood estimators
For the general case of MLE for higher dimensions than shown here, the log likelihood can be differentiated, and the score functions can be solved with a computer algebra system.In addition, the Hessian can be checked to verify the solution is a maximum.We have done this for the case of 3-part compositions and do not anticipate any obstacles to extending the program to handle the general case of D-dimensions.

Variances of location estimators
Next we find variances of the two location estimators, the MLE, and the simple estimator.Both are unbiased.A question we need to answer is, what is the efficiency of the simple estimator relative to the MLE.We have been using μ for the MLE.We continue to use * µ for the simple estimator (of the location).In our discussion, Evaluate at s 12 = 0.
Var(μ 1 |Ω, r 1 , r 2 , r 3 ) Similarly, Var(μ 2 |Ω, r 1 , r 2 , r 3 ) We have already shown in Section 8.1.2that when r 2 = 0, μ1 = * µ 1 , and when r 1 = 0, μ2 = * µ 2 ; and when r 3 = 0, μ1 = * µ 1 , and μ2 = * µ 2 .Next we need to compare the variance of * µ with the variance of μ in cases where the estimators are not obviously the same.Relative Efficiency: Covariance=0.2We consider a sample of 100 compositions from a logistic normal distribution with the number of zeros in part 1 ranging from 0 to 100, and similarly for part 2. We calculate the relative efficiency.These are not simulations; they are calculations based on the expressions for the variances of the estimators.We consider all possible combinations of r 1 , r 2 , r 3 such that r 1 + r 2 + r 3 = 100.A larger sample would give roughly the same picture, just with finer granularity.In addition, while we want to understand the effect of the covariance term s 12 for every possible value between −1 and 1, we get a feel for the space by choosing three values, s 12 ∈ {0, 0.2, 0.8}.For simplicity we choose s 11 = s 22 = 1.
Figure 1 shows the relationship between efficiency of * µ 1 and * µ 2 and the relative sizes of r 1 and r 2 .In the worst case, when r 1 >> r 2 , the efficiency of * µ 1 approaches 1, and the efficiency of * µ 2 falls off toward 0.97.A point to note here is that for a relatively small covariance, 0.2, the simple estimator, * µ has a variance almost as small as that of μ.We will save discussion of the bands or striations for Figure 3.
Figure 2, which shows efficiency based on a covariance of 0.8, has the same pattern as Figure 1, but with larger variances for * µ, smaller efficiency.Here the worst cases can have an efficiency of less than 0.5 for either component of * µ, though when the efficiency of * µ 1 is that small, the efficiency of * µ 2 is very near 1.
Figure 3 shows the same points, for a covariance of 0.8, but shaded by the value of r 3 .To Relative Efficiency: Covariance=0.8help decipher it, we show a subset of the points in Figure 4.
Figure 4 shows a subset of the points, only the points where r 3 ∈ {1, 2, 3, 4, 61, 62, 63, 64}.When r 3 is very small there is a wide range of possibilities for r 1 and r 2 .The four leftmost points in the upper left of Figure 4 are points where r 1 is 1 or 2; r 2 is somewhere between 94 and 97, and r 3 is 2, 3, or 4. In these cases, the sample for estimating µ 1 is very small, from 3 to 6 points, some from univariate data and some from the bivariate data.In that case, the MLE has a much smaller variance than the simple estimator.In that same case, there is a much larger sample from univariate data for estimating µ 2 , upwards of 90 points, plus a handful of points from the bivariate data.In that case, the difference between the variance of * µ 2 and μ2 is very small.
Graphs with negative covariances, −0.2, and −0.8 look the same as with positive covariances, and are omitted for the sake of brevity.

Summary of relative efficiency
Both the simple estimator for the location, * µ, and the maximum likelihood estimator, μ, are unbiased given Ω.The simple estimator's efficiency relative to the MLE tends to decrease as the covariance component of Ω increases.We say "tends" because even with a covariance of 0.8, there are cases where the efficiency of both components of * µ relative to μ is very close to one.
When there are relatively few zeros, and they are balanced, * µ has a variance almost as small as μ.The more zeros there are, or the more unbalanced their distribution is, the larger the variance of one or more components of the simple estimator.

Subcompositional coherence
One of the reasons for using the logistic normal approach is that, in the base case without zeros, it preserves subcompositional coherence, described by Aitchison and Egozcue (2005) p. 831, as follows, "Subcompositional coherence demands that two scientists, one using full compositions and the other using subcompositions of these full compositions, should make the same inference about relations within the common parts."This implies that the subcomposition of a location estimator equals the location estimator for the subcomposition.
In the presence of zeros, do we maintain this property?It depends on which estimators are used.We have shown that in general when there are zeros, the MLE for the mean is not the same as the simple estimator for the mean.The MLE does not preserve subcompositional coherence when we have zeros.The simple estimators, by construction, do preserve subcompositional coherence provided the same D th component is in both.Thus for inference, there is a choice to be made between maintaining subcompositional coherence and maximizing likelihood.
The issue of the relationship between compositions containing zeros, and subcompositional coherence, has been addressed from other points of view as well.Greenacre (2011) introduced a measure of subcompositional incoherence and suggested ways of getting it small enough for practical purposes in the paradigm of correspondence analysis.Scealy and Welsh (2014) argue more generally that although logratio methods for analyzing compositions have their uses, some of the principles that have been used to motivate them, such as subcompositional coherence, should not be taken to be as important as has been argued, e.g., by Aitchison (1994).

Discussion
The goal has been to extend the additive logistic normal distribution to cope with essential zeros.We have done that by requiring that the final component of each composition be nonzero, and by projecting compositions with zeros onto smaller dimensional subspaces, thereby addressing the issues of division by zero, and the log of zero.We arrive at a mixture of logistic normals where each distribution has a mean and a covariance parameter which are projections from a common mean and covariance.
We construct two sets of estimators, simple estimators, * µ, * Ω, and maximum likelihood estimators, μ, Ω.These are estimated using all of the compositions in the data, regardless of where the zeros occur, assuming that the D th component is always nonzero.The simple estimators preserve subcompositional coherence, while the maximum likelihood estimators do not.
There are some limitations to this approach.In addition to the assumption that the D th part is always nonzero, we assume that each composition has at least one more nonzero part, i.e., the vertices of the simplex are not in the support of the distribution.We assume a common mean and variance.Obviously, for a data set where different zero patterns have different means or variances or both, this model would not be appropriate.It is possible for the simple estimator of the covariance to produce a nonpositive definite matrix.If that happens, one possible approach is to estimate the covariance matrix using only the compositions that do not contain zeros.Another possible approach, once more work is done, would be to use the MLE.Currently, though, we do not have a general software solution for finding the MLE.One last concern is that a data set might have two parts which are never positive at the same time, in which case, the simple estimator for the covariance cannot be found.

Figure 3 :
Figure 3: Efficiency of * µ relative to μ with high covariance (0.8) and relative to r 3

Table 1 :
Composition of food expense