How to keep the Type I Error Rate in ANOVA if Variances are Heteroscedastic

Abstract: One essential prerequisite to ANOVA is homogeneity of variances in underlying populations. Violating this assumption may lead to an increased type I error rate. The reason for this undesirable effect is due to the calculation of the correspondingF -value. A slightly different test statistic keeps the level α. The underlying distribution of this alternative method is Hotelling’s T . As Hotelling’sT 2 can be approximated by a Fisher’s F -distribution, this alternative test is very similar to an ordinary analysis of variance.


Introduction
ANOVA is one of the most frequently used methods in statistics.A correct application of this method depends on three preconditions: (i) independence of samples; (ii) normal distributed populations, and (iii) homoscedasticity.Dependence can be eliminated by an appropriate model.The effects of non-normal distributed data on significance level are low (see Box and Andersen, 1955) and can be ignored in most cases (see Lindman, 1992).Inhomogeneity of variances however infects α as well as test efficiency.Although Box (1954a) reported only little influence on this error rate with small differences in variances, Box and Andersen (1955) found the effect of unequal variances to be appreciable even when the ratio of block variances is moderate.In a second study Box (1954b) investigated effects of inequality of variance in the two-way classification.For an assumed variance ratio of main effects 1 : • • • : 1 : 3 a type I error rate of about 7% was found.In many practical trials variance ratio is much broader and exceeds this values.As for example in Figure 1.
A method proposed by Nelson and Dudewicz (2002) is applicable to such situations, but hypothesis differs from that of analysis of variance and a new test statistic has to be used.Transformation of data (e.g.log-, arcsin-, . . ., transformation) is another often used practice in situations where variances are inhomogeneous.In a one factorial experiment, Figure 1: Boxplots for refraction index of apple juice (6 apples per variety), gathered at Landesversuchszentrum Haidegg (s 1 : • • • : s 5 = 4.4 : 2.1 : 1.7 : 4.9 : 1).
this may be useful, if standard deviations are bound to the height of the means.In multi factorial analysis of variance transformations of that kind are not appropriate because of problems with interpretation of parameters and probabilities.In this article a method similar to the analysis of variance with identical hypothesis is introduced and the impacts of inhomogeneous variances on the test of main effects are examined.Type I error rate as well as test efficiency is checked by means of a simulation study.

Another View on the F-Ratio
As a very simple case of ANOVA a block analysis is used (although the method is applicable to more complicated situations).An appropriate model looks like where Here x ij is the observation on factor A at level i and factor B at level j, α i denotes the effect of level i of factor A (treatment effect), β j the effect of level j of factor B (block effect), e ij is a random effect associated with x ij , I stands for the number of levels of A (number of treatments) and J for the number of levels of B (number of blocks).An appropriate test statistic for the hypothesis of interest H 0 : degrees of freedom.M S A = SS A /df A is the mean square value for the interesting factor A, SS A its sum of squares value and df A its degrees of freedom, M S E = SS E /df E is the mean square value for the error term, SS E its sum of squares value and df E its degrees of freedom.
Let x .. = I i=1 x i. /I, then SS A can be calculated as . . .
where dii * .denotes the mean difference between level i and i * of factor A. This means that SS A is the sum of squares for each possible difference between means of factor A. Now let x .j= I i=1 x ij /I, then the sum of squares for factor B (blocks) can be calculated as where SP ii * is the sum of crossproducts for level i and i * of the factor A. Utilizing we further get for the total sum of squares Finally, the sum of squares for the error term (SS E ) can be calculated as This means that SS E is calculated as a squared sum of all individual differences between observations of two samples each, minus the according mean difference for all combinations of samples.
The interesting F -value results as a pooled estimation of all squared mean differences divided by a pooled value of individual differences for observations of two samples each for all combinations of samples.In a heteroscedastic situation this pooling is responsible for an enhanced type I error rate.
For the paired t-test homogeneity of variances is of no interest, as there is only one variable created from two dependent ones.By replacing the pooled sum of differences with a sum of individual paired differences, we find a test statistic which is Hotelling's T 2 distributed (see Hotelling, 1947) as the counterpart of Student's paired t-value.

Hotelling's T 2
Hotelling's T 2 for a single group of samples is calculated as where J is the number of observations within each sample, X denotes the sample mean vector of I elements and µ the mean parameter vector of I elements, and S is the (I × I) sample covariance matrix.
The null hypothesis is formulated as H 0 : µ = µ 0 .As mentioned above, the analysis of variance is a test based on all possible pairwise mean differences.For this situation Hotelling's T 2 can be easily applied: 1. µ is replaced by 0, a vector of zeros.
2. X is replaced by D, a vector of all I(I −1)/2 possible pairwise mean differences of a factor, respectively the vector of all I − 1 independent pairwise mean differences leading to equivalent results, i.e.D = ( 3. S is calculated from all samples of individual differences corresponding to the mean difference vector D. As individual differences include covariances between particular samples, it is not necessary to calculate off-diagonal elements in S. Thus, it is sufficient to calculate S as a matrix of individual variances of sample differences.For independent differences, S looks like S = diag(s 2 d 12 , . . ., s 2 d I−1,I ), where s 2 is the variance of the differences between all mean adjusted observations in samples i − 1 and i, for i = 2, . . ., I.
As a consequence, Hotelling's T 2 simplifies to Probability levels for T 2 can be found by approximating

Simulation Results
By means of a simulation study the impacts of inhomogeneous variances on the empirical type I error rate with a given α = 0.05 were investigated.Figures 2 to 8 are based on 8 × 8 = 64 simulation configurations with 10000 runs each (treatment factor i = 1, . . ., I with number of levels I = 3, . . ., 8, block factor j = 1, . . ., J with number of replications J = 3, . . ., 8).Software packages R and SAS were used for these purposes.The errors e ij were generated from N (0, σ 2 i ).With homoscedastic variances both procedures meet the α-level.This is not true for the analysis of variance as soon as there are differences in the σ 2 i -levels.From Figure 2 we find that the empirical significance level when the ratio of the true standard deviations is σ 1 : σ 2 : • • • : σ v = 3 : 1 : • • • : 1 rises up to 12% (depending on the number of factor levels), whereas the alternative test keeps the predefined value of α.
Figure 3: Empirical significance levels when the ratio of the true standard deviations is For Figure 3 the ratio of standard deviations is wider (6 : 1 : • • • : 1) than for Figure 2. As a consequence the type I error rate rises up to 18% for ANOVA.Maybe the results in Figure 1 reflect this situation, as for ANOVA the null hypothesis is rejected (p = 0.0136), whereas the alternative method does not reject the null (p = 0.1727).

Power Comparison
An important question which arises with all kinds of tests concerns test efficiency.In the following figures several situations for a true alternative hypothesis with different variance ratios were investigated.
Figure 4 shows a higher power for analysis of variance if all variances are homogeneous, especially with a low number of replications.In Figure 5 the power of ANOVA seems to be superior to that of the alternative method.The apparent advantages are partly due to an enhanced type I error rate.This means, that a lot of significant results are not caused by differences in factor levels, but on random influences.Figure 6 shows comparable results to those in Figure 5, but it is difficult to find any differences in the factor levels even if the number of observations is high.Whereas in Figures 5 and 6 the factor level with the largest effect was bound to the largest standard deviation this is not true in the following.
If the largest level of the factor does not correspond to the level with the largest standard deviation as in Figure 7, the alternative method is superior to ANOVA in most situations.
If heteroscedasticity is high (as in Figure 8), significant results of ANOVA are similar to that of Figure 3.This means that it is almost impossible to find differences in factor levels even if the sample size is large.However, the alternative method shows a large power.

Tests on Homogeneity of Variances
There are various different tests on homogeneity of variances available (Conover, Johnson, and Johnson, 1981).Levene's test (Levene, 1960) is one of the most popular ones.O'Brien's test (see O'Brien, 1979) is a modification for Levene's test, which is believed to be one of the most sensitive ones (Abdi, 2007) especially with platycurtic distributions (Algina, Olejnik, and Ocanto, 1989).In a simulation study with 1000 runs each, these tests are investigated.
The simulation is performed in such a way, that with 3 levels of the factor the ratio of standard deviations is 1 : 7 : 5.When there are 4 levels this ratio is set to 1 : 7 : 5 : 3. Following this strategy the ratio of standard deviations for 7 factor levels is set to 1 : 7 : 5 : 3 : 2 : 4 : 6.The power functions of these tests are shown in Figure 9.In case of normal distributed data Levene's test performs better than O'Briens.But no matter which of these tests is used, there is a relatively high risk to oversee inhomogeneous variances even with a wide ratio of standard deviations.

Conclusions
Heteroscedasticity can be found in a lot of practical trials.The consequences of such a situation in concern to analysis of variance are subsumed in the following: • ANOVA leads to an enhanced type I error rate, if variances are non-homogeneous.
• An alternative test based on Hotelling's T 2 keeps the α-level independently from the variance ratio.• As soon as a factor effect comes with an enhanced standard deviation, the power of each test is very low.• If the enhanced standard deviation is not bound to an enhanced factor effect, the alternative method shows very large power compared to ANOVA.• If variances are homogeneous, ANOVA shows larger power than the alternative.
• Tests on homogeneity of variances show only low power.If there are doubts concerning homogeneity of variances, an alternative procedure is preferable.