SEARCH

SEARCH BY CITATION

Abstract

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Description of tests
  5. 3. Design of the simulation
  6. 4. Simulation results
  7. 5. Concluding remarks
  8. References

Of the several tests for comparing population means, the best known are the ANOVA, Welch, Brown–Forsythe, and James tests. Each performs appropriately only in certain conditions, and none performs well in every setting. Researchers, therefore, have to select the appropriate procedure and run the risk of making a bad selection and, consequently, of erroneous conclusions. It would be desirable to have a test that performs well in any situation and so obviate preliminary analysis of data. We assess and compare several tests for equality of means in a simulation study, including non-parametric bootstrap techniques, finding that the bootstrap ANOVA and bootstrap Brown–Forsythe tests exhibit a similar and exceptionally good behaviour.

1. Introduction

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Description of tests
  5. 3. Design of the simulation
  6. 4. Simulation results
  7. 5. Concluding remarks
  8. References

Comparing treatment groups is a common task in the behavioural sciences. The analysis of variance (ANOVA) and the Welch (1951) test are the main procedures used for comparing population means and are commonly included in standard statistical packages.

The ANOVA F-test is known to be the best under the classical assumptions (normality, homoscedasticity and independence). However, when one or more of these basic assumptions are violated, especially independence and/or homoscedasticity, the F-test is overly conservative or liberal. The only instance in which the researcher may be able to obtain a valid test under heteroscedasticity is when the degree of variance heterogeneity is small and group sizes are equal. Numerous papers have investigated the behaviour of the ANOVA F-test under assumption violations and under various degrees of each violation (e.g., Akritas & Papadatos, 2004; Bathke, 2004; Glass, Peckham, & Sanders, 1972; Harwell, Rubinstein, Hayes, & Olds, 1992; Kenny & Judd, 1986; Keselman, Rogan, & Feir-Walsh, 1977; Krutchkoff, 1988; Rogan, & Keselman, 1977; Scheffé, 1959; Weerahandi, 1995; De Beuckelaer, 1996). The balancedness of the design is also a problem in heteroscedastic settings, in particular positive/negative pairings of unequal sample sizes and unequal variances (Keselman et al., 1977).

Heteroscedastic alternatives to the ANOVA F-test that have received most attention are Welch's (1951) test, James's (1951) second-order method, Brown and Forsythe's (1974) test, and Alexander and Govern's (1994) test. All these procedures have been investigated in empirical studies. The evidence suggests that these methods can generally control the rate of Type I error when group variances are heterogeneous and the data are normally distributed (Dijkstra & Werter, 1981; Wilcox, 1990; Oshima & Algina, 1992; Alexander & Govern, 1994; Lix, Keselman, & Keselman, 1996; Gastwirth, Gel, & Miao, 2009). However, the literature also indicates that these tests may be liberal when the data are heterogeneous and non-normal, particularly when the design is unbalanced.

The search continues for a good procedure to test the equality of several normal means when the variances are unknown and arbitrary. Even though several tests are available, none performs well in terms of the Type I error probability under every degree of heteroscedasticity and non-normal populations. Indeed, Type I errors can be highly inflated or deflated for some of the commonly used tests. So, finding statistical techniques that perform well under all conditions, especially relative to probability distributions and sample sizes, is a research goal.

Works based on the parametric bootstrap have recently appeared. Krishnamoorthy, Lu, and Mathew (2007) propose a parametric bootstrap procedure for comparing the means of independent groups when the variances of the groups are unequal. This procedure has been studied and compared with the Welch and James tests by Cribbie, Fiksenbaum, Keselman, and Wilcox (2012). Chang, Pal, Lim, and Lin (2010) test the equality of several means but under homoscedastic normal population. They introduce a recently developed computational approach test (CAT), which is essentially a parametric bootstrap method. Chang, Lin, and Pal (2011) study it for the gamma distribution. Li, Wang, and Liang (2011) propose a fiducial-based test to study the equality of several normal means when the variances are unknown and unequal. They compare it with the generalized F-test by Weerahandi (1995), the trimmed mean-based test by Lix and Keselman (1998), and the parametric bootstrap test by Krishnamoorthy et al. (2007). In this paper the focus is on the non-parametric bootstrap, that is, it involves no assumptions about any of the parent populations. It has been shown that non-parametric bootstrap methods lead to better Type I error control than non-bootstrap methods when testing trimmed means (Wilcox, Keselman, & Kowalchuk 1998; Luh & Guo, 1999).

Our interest lies in finding a method for comparing means that controls the Type I error rate under heteroscedasticity and non-normality. The purpose of this paper is to compare Type I error rates of the usual tests used in applied research for comparing population means, such as ANOVA, the Welch test and the Brown-Forsythe test, with the bootstrap versions of ANOVA, the Welch test, the Brown-Forsythe test and the James test. Bootstrapping James's test may overcome the complex calculations needed to obtain the critical values. A competitor of the James test is the Alexander and Govern (1994) test, although we omit it due to the complexity of the test statistic for applied researchers.

We do not consider non-parametric alternatives (e.g., the Wilcoxon–Mann–Whitney and Kruskal–Wallis tests) to compare means, since they are known to be sensitive to differences in the shapes and variances of the distributions. They compare a location parameter only when population distributions are the same shape and variances are identical (Vargha & Delaney, 1998; Fagerland & Sandvik, 2009); otherwise they contrast stochastic homogeneity. When such conditions are not present, the Welch test is preferable in many situations (Skovlund & Fenstad, 2001; Fagerland & Sandvik, 2009). The Wilcoxon–Mann–Whitney test shows a lack of robustness (small differences in variances and moderate degrees of skewness can produce large deviations from the nominal Type I error rate), even in large samples (Pratt, 1964; Skovlund & Fenstad, 2001; Fagerland & Sandvik, 2009). In fact, the same population mean and shape, but a different scale, are enough to lead to a considerable Type I error rate inflation (Reiczigel, Zakariás, & Rózsa, 2005).

This simulation study differs from others in the use of non-parametric bootstrap techniques to test equal population means; see Lix et al. (1996), Chang et al. (2010, 2011) and Cribbie et al. (2012). We also study the behaviour of tests when population distributions are dissimilar. Cribbie et al. (2012) likewise study the effect of dissimilar distribution shapes across treatment groups, noting that this fact has received very little attention in the methodological literature, despite its being a not uncommon situation in applied research. The main differences between our research and that of Cribbie et al. (2012) are that we investigate more standard tests (we include the ANOVA and Brown–Forsythe tests) and we use non-parametric bootstrap version of tests.

The simulation results show that bootstrap tests have a better control of the Type I error rate than non-bootstrap tests. The bootstrap ANOVA and bootstrap Brown–Forsythe tests in particular perform very well.

2. Description of tests

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Description of tests
  5. 3. Design of the simulation
  6. 4. Simulation results
  7. 5. Concluding remarks
  8. References

Let Yij, i = 1, …, k and j = 1, …, ni, denote the jth observation from the ith group, where ni is the sample size of ith group and k is the number of groups or treatments.

The one-way ANOVA (A) test statistic is

  • display math

where

  • display math
  • display math
  • display math

The null hypothesis, that the population means are equal, is rejected if the computed F statistic is higher than Fk−1,N-k;α, the 1−α quantile of the F distribution with k−1 and Nk degrees of freedom.

The Welch (W) test statistic is given by

  • display math

where

  • display math
  • display math
  • display math

  

  • display math

The Welch statistic is approximately F-distributed with k−1 and v degrees of freedom, where

  • display math

The null hypothesis, that the population means are equal, is rejected if the computed Q statistic is higher than Fk−1,v;α, the 1−α quantile of the F distribution with k−1 and v degrees of freedom.

The Brown–Forsythe (BF) test statistic is

  • display math

F* is approximately distributed as an F variable with v1 and v2 degrees of freedom, where

  • display math

  

  • display math

and

  • display math

Mehrotra (1997) shows that if the numerator degrees of freedom of F* are modified such that

  • display math

the Type I error rate is inflated less often. This gives rise to the Brown–Forsythe–Mehrotra (BFM) test.

Finally, the James statistic test is given by

  • display math

The idea of the bootstrap method is to approximate the distribution of the test statistic under the null hypothesis. Following the usual method of bootstrap hypothesis testing (Efron & Tibshirani, 1993), the samples of inline image should be transformed so as to satisfy the null hypothesis; let them be inline image. The null distribution of the test statistic is obtained by making B draws of k bootstrap samples, one from each pseudo-population inline image and calculating the test statistic for each group of k bootstrap samples. This leads to B observations of the test statistic based on the bootstrap samples, that is, the bootstrap distribution of the test statistic. A bootstrap procedure when comparing means would be as follows. We can shift samples so that they have mean 0, and then resample each pseudo-population separately. The transformed observations are given by inline image. The T statistic is calculated for each group of k bootstrap samples, T *(b) inline image. Let R be the number of times that T ≥ T *(b); then the bootstrap p-value is given by inline image. We set = 1,000.

We bootstrap the ANOVA, Welch, Brown–Forsythe and James tests, denoted as B-A, B-W, B-BF and B-J, respectively.

3. Design of the simulation

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Description of tests
  5. 3. Design of the simulation
  6. 4. Simulation results
  7. 5. Concluding remarks
  8. References

The Type I error rate of the tests is investigated in a simulation study at different settings: with and without homoscedasticity; normal and non-normal distributions; symmetric and asymmetric distributions; identical or dissimilar distributions; light- and heavy-tailed distributions; negative and positive pairing of variances; and small, large, equal and unequal sample sizes. The nominal 5% significance level is used throughout. The simulation results are based on 5,000 replications in non-bootstrap tests and on 1,000 bootstrap replications for 5,000 Monte Carlo replications in bootstrap tests.

For non-normal distributions we use Student's t with 4 degrees of freedom, the uniform (symmetric and light-tailed), the chi square with 4 d.f. (skewed and light-tailed), the exponential with mean 1/3 (skewed and light-tailed) and the g-and-h distribution. Depending on the values of g and h (see Table 1) this may be symmetric or asymmetric, with low, medium or high kurtosis.

Table 1. Characteristics of the g-and-h distribution used in the simulation study
 SymmetricAsymmetric
  1. a

    Heavy-tailed distributions

g0000.811
h0.140.20.2200
μ0000.480.65
σ1.281.471.541.652.16
γ 10003.86.2
γ 25.733.2a103a33.3a111a

In order to generate data from populations with the desired variances inline image and so maintain equal population means, the samples from Student's t (t4) distributions are transformed by multiplying by inline image . To obtain data from the chi square distribution, the transformation is inline image, where Xi is a chi square variable with 4 d.f. Data from an exponential distribution are transformed by inline image, assuming an exponential distribution with mean 1/3 for Gi. For the uniform distribution, appropriate parameters are used with the same objective. Data are generated from a uniform, with minimum and maximum values given by inline image and inline image, respectively.

To generate data from the g-and-h distribution, we use Cribbie et al. (2012) and Hoaglin (1985). The values of g and h selected for the investigation are included in Table 1 and the g-and-h distributions are denoted in subsequent tables as (gh). Tables 1 and 2 include the mean (μ), standard deviation (σ), asymmetry (γ1) and excess kurtosis (γ2) of the distributions used in this study:

  • display math

where

  • display math
Table 2. Characteristics of the normal, t, uniform, chi square and exponential distributions used in the simulation study
 SymmetricAsymmetric
 N(0,1)inline image inline image inline imageExp(3)
μ00041/3
σ11.41 inline image2.831/3
γ 10001.412
γ 201.836

Unequal group sizes, when paired with unequal variances, can affect Type I error control for tests that compare the typical score across groups (Keselman et al., 1998; Keselman, Cribbie, & Holland, 2002; Othman, Keselman, Padmanabhan, Wilcox, Algina, & Fradette; 2004; Syed Yahaya, Othman, & Keselman, 2006). Therefore, we pair the sample sizes and variances positively and negatively. A positive pairing occurs when the largest sample size is generated from the population with the largest variance, while the smallest sample size is associated with the smallest population variance. In a negative pairing, the largest sample size is paired with the smallest population variance and the smallest sample size with the largest.

Seven configurations of small samples are studied. Three are of equal sample sizes: (10, 10, 10), (15, 15, 15), and (25, 25, 25). Four are of unbalanced design: (10, 10, 15), (10, 10, 25), (10, 15, 15), and (15, 20, 25). Four configurations of large samples are also studied: (30,30,30) and (60, 60, 60), (30, 40, 60), and (50, 60, 70).

We use a homoscedastic setting with all variances equal to 1, and different heteroscedastic settings. In particular, the following combinations of standard deviations (σ1σ2σ3) from mild to extreme heteroscedasticity are applied: inline image inline image and inline image Seven combinations are included, taking into account the reverse of the above (for negative pairing) and homoscedasticity.

Unequal sample size combinations are studied with each of the seven combinations of variances, giving a total of 42 cases. Equal sample size combinations are studied with four variance combinations, giving a total of 20 cases. High heteroscedasticity is represented by inline image and inline image. The first combination is studied with all sample size combinations (11 cases) and the second with unequal sample size combinations (6 cases). So, of 62 cases considered (for each combination of distributions), 17 are cases of high heteroscedasticity and the rest are of low heteroscedasticity, including homoscedasticity.

Empirical Type I error rates were recorded for all tests. The robustness of a procedure, with respect to Type I error control, was determined using Bradley's (1978) liberal criterion. That is, a procedure is deemed robust with respect to Type I errors if the empirical rates of the Type I error fall within the range inline image If α = .05, then the interval is given by inline image. An empirical rate of over .075 would indicate a liberal test; and one below .025 a conservative test.

4. Simulation results

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Description of tests
  5. 3. Design of the simulation
  6. 4. Simulation results
  7. 5. Concluding remarks
  8. References

Tables 3-6 show the results obtained for the Type I error rate in the simulation study. Tables 3 and 5 show the percentages of cases where tests do no exhibit control of the Type I error rate under different combinations of population distributions, that is, when the Type I error rate falls outside the interval inline image. These percentages are calculated separately for low (45 cases) and high (17 cases) heteroscedasticity. Tables 4 and 6 summarize the corresponding mean Type I error rates.

Table 3. Percentage of cases where the Type I error rate falls outside the interval [.025, .075] when parent populations are identical
   Non-bootstrap testsBootstrap tests
 DistributionHeteroscedasticityAWBFBFMB-AB-WB-BFB-J
Symmetric distributions3 ×  NormalLow4.4008.90004.4
High41.20017.60000
3 ×  Student's tLow6.7000.002.2037.8
High47.1005.9023.5017.6
3 ×  UniformLow4.4006.70004.4
High47.10047.10000
3 ×  (0, 0.2)Low4.4000.006.7051.1
High47.1005.9017.6029.4
Asymmetric distributions3 ×  Chi squareLow8.9008.90002.2
High52.958.8070.60000
3 ×  ExponentialLow4.411.10004.402.2
High35.370.617.641.20000
3 ×  (0.81, 0)Low4.48.90042.242.242.20
High58.810052.970.6064.705.9
Table 4. Mean Type I error rates when parent populations are identical
   Non-bootstrap testsBootstrap tests
 DistributionHeteroscedasticityAWBFBFMB-AB-WB-BFB-J
Note
  1. Shaded cells are those situations where the test controls the Type I error rate in all cases considered according to Bradley's criterion (see Table 3). Liberal tests are in bold.

Symmetric distributions3 ×  NormalLow.053.051.051.067.049.046.047.038
High .077.051.063.070.050.043.049.044
3 ×  Student's tLow.050.044.047.061.041.036.040.029
High.073.045.057.063.042.033.041.035
3 ×  UniformLow.053.053.052.068.049.046.048.039
High .079.055.064.071.050.043.050.050
3 ×  (0, 0.2)Low.049.042.045.059.038.032.038.025
High.072.043.055.061.038.031.038.032
Asymmetric distributions3 ×  Chi squareLow.052.055.049.063.043.044.042.038
High .083 .077.067.073.051.055.050.047
3 ×  ExponentialLow.051.058.047.059.038.040.037.037
High.066 .088.063.071.048.059.047.042
3 ×  (0.81, 0)Low.048.056.042.053.029.031.030.035
High .096 .129 .078 .084.054 .080.054.053
Table 5. Percentage of cases where the Type I error rate falls outside the interval [.025, .075] when parent populations are dissimilar
    Non-bootstrap testsBootstrap tests
DistributionsHeteroscedasticityAWBFBFMB-AB-WB-BFB-J
Note
  1. HT = heavy-tailed.

Symmetric distributions
3 symmetricNormal, t, UnifLow11.1004.400015.6
High52.90017.60000
3 symmetric (1 HT)

Normal, Unif,

(0, 0.2)

Low8.9004.400011.1
High47.10011.80005.9
3 symmetric (2 HT)

Normal, (0, 0.14),

(0, 0.2)

Low6.7002.200028.9
High41.2005.9011.805.9
3 symmetric (3 HT)

(0, 0.14), (0, 0.2),

(0, 0.22)

Low6.7002.204.4046.7
High41.20011.8029.4017.6
Two symmetric and one asymmetric distributions

2 symmetric

1 asymmetric

Normal, Unif, ChiLow8.90011.10002.2
High47.15.9047.10000
Normal, Unif, ExpLow8.90015.60000
High52.935.323.558.80000

2 symmetric

1 asymmetric and HT

Normal, Unif,

(1, 0)

Low13.36.708.90002.2
High70.664.758.864.7035.300.0

2 symmetric (1 HT)

1 asymmetric

Normal, Exp,

(0, 0.2)

Low8.98.906.70002.2
High47.147.1000000
One symmetric and two asymmetric distributions

1 symmetric

2 asymmetric

Normal, Chi, ExpLow4.40011.10000
High58.852.9047.10000

1 symmetric

2 asymmetric (1 HT)

Normal, Exp,

(1, 0)

Low13.315.606.70000
High64.776.535.352.9047.1011.8

1 symmetric and HT

2 asymmetric

Chi, Exp,

(0, 0.2)

Low6.711.106.70000
High41.241.2023.500017.6

1 symmetric and HT

2 asymmetric (1 HT)

(0, 0.2), Chi,

(0.81, 0)

Low8.94.402.20004.4
High52.964.729.441.2017.605.9
Asymmetric distributions
3 asymmetricChi, Chi, ExpLow4.4002.20000
High52.964.7070.60000
Chi, Exp, ExpLow4.42.202.202.200
High52.970.617.664.705.900
3 asymmetric (1 HT)

Chi, Exp,

(0.81, 0)

Low8.92.202.24.46.74.40
High58.888.217.658.8023.500
3 asymmetric (3 HT)

2 ×  (0.81, 0),

(1, 0)

Low4.411.10037.835.637.86.7
High64.710076.588.2070.600
Table 6. Mean Type I error rates when parent populations are dissimilar
   Non-bootstrap testsBootstrap tests
DistributionsHetero- scedasticityAWBFBFMB-AB-WB-BFB-J
Note
  1. Shaded cells are those situations where the test controls the Type I error rate in all cases considered according to Bradley's criterion (see Table 5). Liberal tests are in bold.

Symmetric distributions
3 symmetricNormal, t, UnifLow.052.049.049.065.047.042.045.035
High .079.048.062.067.047.038.046.046
3 symmetric (1 HT)

Normal, Unif,

(0, 0.2)

Low.054.050.050.066.046.043.045.035
High.073.050.058.065.044.038.044.038
3 symmetric (2 HT)

Normal, (0, 0.14),

(0, 0.2)

Low.053.046.048.064.043.039.043.032
High .075.047.060.066.044.035.043.039
3 symmetric (3 HT)

(0, 0.14), (0, 0.2),

(0, 0.22)

Low.051.044.046.060.040.034.039.028
High.073.043.057.063.040.030.040.032
Two symmetric and one asymmetric distributions

2 symmetric

1 asymmetric

Normal, Unif, ChiLow.056.054.051.067.048.046.047.039
High .079.060.064.071.050.047.050.045
Normal, Unif, ExpLow.057.056.052.068.047.046.047.039
High .085.068.068 .076.052.050.051.047

2 symmetric

1 asymmetric and HT

Normal, Unif,

(1, 0)

Low.064.062.052.068.045.049.045.040
High .094 .092.073 .081.056.067.056.051

2 symmetric (1 HT)

1 asymmetric

Normal, Exp,

(0, 0.2)

Low.055.059.051.065.044.047.044.037
High.075.074.059.064.041.052.041.039
One symmetric and two asymmetric distributions

1 symmetric

2 asymmetric

Normal, Chi, ExpLow.056.057.050.065.044.045.044.038
High .085 .080.067.074.049.056.049.049

1 symmetric

2 asymmetric (1 HT)

Normal, Exp,

(1, 0)

Low.060.062.050.063.040.044.040.038
High .092 .113.072 .079.050.072.049.054

1 symmetric and HT

2 asymmetric

Chi, Exp,

(0, 0.2)

Low.054.060.050.064.043.045.042.037
High .076 .080.062.068.043.055.043.040

1 symmetric and HT

2 asymmetric (1 HT)

(0, 0.2), Chi,

(0.81, 0)

Low.057.058.049.063.040.043.040.035
High .085 .087.066.073.049.059.048.044
Asymmetric distributions
3 asymmetricChi, Chi, ExpLow.052.055.048.062.041.043.041.037
High .087 .083.069 .075.051.057.051.048
Chi, Exp, ExpLow.052.056.048.061.039.042.039.037
High .087 .096.070 .077.051.063.051.049
3 asymmetric (1 HT)

Chi, Exp,

(0.81, 0)

Low.053.055.045.058.036.038.036.036
High .089 .104.069 .076.050.067.050.049
3 asymmetric (3 HT)

2 ×  (0.81, 0),

(1, 0)

Low.050.058.043.053.028.031.028.033
High .101 .142 .081 .089.056 .085.055.055

4.1. Identical distributions

When distributions are symmetric and identical (Table 3) the W and the BF tests always control the Type I error rate, even if kurtosis is high. However, for asymmetric distributions the performance of both tests deteriorates as kurtosis increases. The BF test can be considered more robust than the W test. In particular, in asymmetric distributions the W test needs low heteroscedasticity and very low kurtosis. However, the BF test controls the Type I error rate only if heteroscedasticity is low. High heteroscedasticity is not a problem if the excess of kurtosis is low (in our simulation study an excess of kurtosis of 6 seems to be high – the case of the exponential distribution).

Simulation results seem to be better for bootstrap tests than for non-bootstrap tests. The bootstrap versions of the ANOVA and Brown–Forsythe tests control the Type I error rate in symmetric distributions. If distributions are identical and asymmetric then there is a situation where both tests show a lack of control, and tend to be conservative. This occurs with high kurtosis (heavy-tailed distributions) and low heteroscedasticity. However, if heteroscedasticity is high, both, surprisingly, show a good performance. To understand what is happening, see Table 4. It is observed that the bootstrap technique produces a reduction of the Type I error rate of the non-bootstrap test. When distributions are heavy-tailed, the reduction leads in many cases to conservative tests if heteroscedasticity is low, obtaining a mean Type I error rate of .029 and .030, respectively, with a minimum of .018 and a maximum of .049 for the B-A test, and .018 and .047 for B-BF. When heteroscedasticity is high, the non-bootstrap tests (A and BF tests) tend to be liberal, with a mean Type I error rate of .096 and .078, respectively. The reduction induced by the bootstrap technique yielded appropriate rates in all cases for both tests.

4.2. Dissimilar distributions

We also study situations where the population distributions are dissimilar. Percentages of cases where tests do not exhibit control of the Type I error rate under different combinations of dissimilar parent populations are in Table 5. The corresponding mean Type I error rates are included in Table 6.

When distributions are dissimilar, of the non-bootstrap tests the BF test again seems to be the most level-robust. However, there are some situations where it fails to control the Type I error rate. The behaviour of the BF test is comparable to that of identical distributions. That is, it is level-robust if distributions are symmetric, regardless of the kurtosis. The same behaviour is observed in the W test.

When one or two distributions are asymmetric, the BF test may fail only if heteroscedasticity is high and one of the asymmetric distributions is heavy-tailed. In these cases, the maximum value of the estimated Type I error rate was .092. If heteroscedasticity is low, it always controls the Type I error rate and kurtosis does not seem to be a problem. For the W test, however, asymmetry seems to be a problem even with low heteroscedasticity.

When the three distributions are asymmetric, low heteroscedasticity is a guarantee for the BF test to be level-robust, once again regardless of the level of kurtosis. When at least two distributions have an excess kurtosis of about 6 or more, the test tends to be liberal. It seems that percentages increase as the highest excess kurtosis increases, although the maximum value of the estimated Type I error rate is .095.

The simulation results are much better for the bootstrap tests. Now there are two tests with similar behaviour – the B-A and the B-BF tests – that perform exceptionally well. If at least one of the dissimilar parent populations is symmetric, then both control the Type I error rate, whatever the level of kurtosis or heteroscedasticity. When the three distributions are asymmetric and heteroscedasticity is high, then both tests control the Type I error rate, regardless of the level of kurtosis. Problems appear when two conditions are verified, that is, there is low heteroscedasticity and at least one distribution is heavy-tailed. In this case both tests tend to be conservative. This is illustrated by two settings, one with one heavy-tailed distribution [Chi, Exp, (0.81,0)] and the other with three heavy-tailed distributions [2 × (0.81,0), (1,0)]. The minimum estimated values in each scenario are .021 and .018, respectively (for both tests). The maximum values are .057 and .047, respectively (for both tests).

In general, simulation results show that bootstrap tests tend to reduce the Type I error rate of the corresponding non-bootstrap test, as is observed in Tables 4 and 6. In all cases, the mean value in the bootstrap test is lower than that in the corresponding non-bootstrap test, with reduction ranging from 8% to 47%.

5. Concluding remarks

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Description of tests
  5. 3. Design of the simulation
  6. 4. Simulation results
  7. 5. Concluding remarks
  8. References

When comparing means, the applied researcher has various test procedures to choose from. The best known are the ANOVA, Welch, Brown–Forsythe, Brown–Forsythe–Mehrotra, and James tests. The first requires homoscedasticity and the others can be used under heteroscedasticity. In this paper we evaluate the performance of these test procedures with and without bootstrapping (except for the James test) under different settings: symmetric and asymmetric distributions, light- and heavy-tailed distributions, identical and dissimilar distributions, equal and unequal sample sizes, homoscedasticity and heteroscedasticity. The main purpose is to offer a general view of performance of tests in order to compare them and choose those tests that guarantee good results under any situation, if they exist.

The simulation results reveal that the B-A and B-BF tests perform very well. These tests offer identical results and fail to control the Type I error rate if three circumstances occur simultaneously: the three distributions are asymmetric (identical or dissimilar); there is low heteroscedasticity; and at least one distribution is heavy-tailed (an excess kurtosis above 33). In this case both tests tend to be conservative with a minimum estimated Type I error rate about .018. Both tests control the Type I error rate in the rest of the settings considered in the simulation study.

References

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Description of tests
  5. 3. Design of the simulation
  6. 4. Simulation results
  7. 5. Concluding remarks
  8. References
  • Akritas, M. G., & Papadatos, N. (2004). Heteroscedastic one-way ANOVA and lack-of-fit tests. Journal of the American Statistical Association, 99, 368382.
  • Alexander, R. A., & Govern, D. M. (1994). A new and simpler approximation and ANOVA under variance heterogeneity. Journal of Educational Statistics, 19, 91101.
  • Bathke, A. (2004). The ANOVA F test can still be used in some balanced designs with unequal variances and nonnormal data. Journal of Statistical Planning and Inference, 126, 413422.
  • Bradley, J. V. (1978). Robustness? British Journal of Mathematical and Statistical Psychology, 31, 144152.
    Direct Link:
  • Brown, M. B., & Forsythe, A. B. (1974). The small sample behavior of some statistics which test the equality of several means. Technometrics, 16, 129132.
  • Chang, C. H., Lin, J. J., & Pal, N. (2011). Testing the equality of several gamma means: a parametric bootstrap method with applications. Computational Statistics, 26, 5576.
  • Chang, C. H., Pal, N., Lim, W. K., & Lin, J. J. (2010). Comparing several population means: a parametric bootstrap method, and its comparison with usual ANOVA F test as well as ANOM. Computational Statistics, 25, 7195.
  • Cribbie, R. A., Fiksenbaum, L., Keselman, H. J., & Wilcox, R. R. (2012). Effect of non-normality on test statistics for one-way independent groups designs. British Journal of Mathematical and Statistical Psychology, 65, 5673.
  • De Beuckelaer, A. (1996). A closer examination on some parametric alternatives to the ANOVA F-test. Statistical Papers, 37, 291305.
  • Dijkstra, J. B., & Werter, P. S. (1981). Testing the equality of several means when the population variances are unequal. Communications in Statistics – Simulation and Computation, 10, 557569.
  • Efron, B., & Tibshirani, R. J. (1993). An introduction to the bootstrap. New York, NY: Chapman & Hall.
  • Fagerland, M. W., & Sandvik, L. (2009). The Wilcoxon–Mann–Whitney test under scrutiny. Statistics in Medicine, 28, 14871497.
  • Gastwirth, J. L., Gel, Y. R., & Miao, W. (2009). The impact of Levene's test of equality of variances on statistical theory and practice. Statistical Science, 24, 343360.
  • Glass, G. V., Peckham, P. D., & Sanders, J. R. (1972). Consequences of failure to meet assumptions underlying the fixed effects analyses of variance and covariance. Review of Educational Research, 42, 237288.
  • Harwell, M. R., Rubinstein, E. N., Hayes, W. S., & Olds, C. C. (1992). Summarizing Monte Carlo results in methodological research: The one- and two-factor fixed effects ANOVA cases. Journal of Educational Statistics, 17, 315339.
  • Hoaglin, D. C. (1985). Summarizing shape numerically: The g- and h-distributions. In D. C. Hoaglin, F. Mosteller & J. W. Tukey (Eds.), Exploring data tables, trends, and shapes (pp. 461513). New York, NY: Wiley.
  • James, G. S. (1951). The comparison of several groups of observations when the ratios of the population variances are unknown. Biometrika, 38, 324329.
  • Kenny, D. A., & Judd, C. M. (1986). Consequences of violating the independence assumption in analysis of variance. Psychological Bulletin, 99, 422431.
  • Keselman, H. J., Cribbie, R. A., & Holland, B. (2002). Controlling the rate of Type I error over a large set of statistical tests. British Journal of Mathematical and Statistical Psychology, 55, 2739.
  • Keselman, H. J., Huberty, C., Lix, L. M., Olejnik, S., Cribbie, R. A., Donahue, B., Kowalchuk, R. K., Lowman, L. L., Petoskey, M. D., & Keselman, J. C. (1998). Statistical practices of educational researchers: An analysis of their ANOVA, MANOVA, and ANCOVA analyses. Review of Educational Research, 68, 350386.
  • Keselman, H. J., Rogan, J. C., & Feir-Walsh, B. J. (1977). An evaluation of some non-parametric and parametric tests for location equality. British Journal of Mathematical and Statistical Psychology, 30, 213221.
    Direct Link:
  • Krishnamoortthy, K., Lu, F., & Mathew, T. (2007). A parametric bootstrap approach for ANOVA with unequal variances: Fixed and random models. Computational Statistics & Data Analysis, 51, 57375742.
  • Krutchkoff, R. G. (1988). One-way fixed effects analysis of variance when the error variances may be unequal. Journal of Statistical Computation and Simulation, 30, 259271.
  • Li, X., Wang, J., & Liang, H. (2011). Comparison of several means: A fiducial based approach. Computational Statistics and Data Analysis, 55, 19932002.
  • Lix, L. M., & Keselman, H. J. (1998). To trim or not to trim: Tests of location equality under heteroscedasicity and nonnormality. Educational and Psychological Measurement, 58, 409429.
  • Lix, L. M., Keselman, J. C., & Keselman, H. J. (1996). Consequences of assumption violations revisited: A quantitative review of alternatives to the one-way analysis of variance F test. Review of Educational Research, 66, 579619.
  • Luh, W. M. & Guo, J.H. (1999). A powerful transformation trimmed mean method for one-way fixed effects ANOVA model under non-normality and inequality of variances. British Journal of Mathematical and Statistical Psychology, 52, 303320.
  • Mehrotra, D. V. (1997). Improving the Brown-Forsythe solution to the generalizied Behrens-Fisher problem. Communications in Statistics – Simulation and Computation, 26, 11391145.
  • Oshima, T. C., & Algina, J. (1992). Type I error rates for James's second-order test and Wilcox's Hm test under heteroscedasticity and non-normality. British Journal of Mathematical and Statistical Psychology, 45, 255263.
    Direct Link:
  • Othman, A. R., Keselman, H. J., Padmanabhan, A. R., Wilcox, R. R., Algina, J., & Fradette, K. (2004). Comparing measures of the ‘typical’ score across treatment groups. British Journal of Mathematical and Statistical Psychology, 57, 215234.
  • Pratt, J. W. (1964). Robustness of some procedures for the two-sample location problem. Journal of the American Statistical Association, 59, 665680.
  • Reiczigel, J., Zakariás, I., & Rózsa, L. (2005). A bootstrap test of stochastic equality of two populations. American Statistician, 59, 16.
  • Rogan, J. C., & Keselman, H. J. (1977). Is the ANOVA F-test robust to variance heterogeneity when samples sizes are equal? American Educational Research Journal, 14, 493498.
  • Scheffé, H. (1959). The analysis of variance. New York, NY: Wiley.
  • Skovlund, E., & Fenstad, G. U. (2001). Should we always choose a nonparametric test when comparing two apparently nonnormal distributions? Journal of Clinical Epidemiology, 54, 8692.
  • Syed Yahaya, S. S., Othman, A. R., & Keselman, H. J. (2006). Comparing the ‘typical score’ across independent groups based on different criteria for trimming. Metodološki Zvezki, 3, 4962.
  • Vargha, A., & Delaney, H. D. (1998). The Kruskal-Wallis test and stochastic homogeneity. Journal of Educational and Behavioral Statistics, 23, 170192.
  • Weerahandi, S. (1995). ANOVA under unequal error variances. Biometrics, 51, 589599.
  • Welch, B. L. (1951). On the comparison of several mean values: An alternative approach. Biometrika, 38, 330336.
  • Wilcox, R. R. (1990). Comparing the means of two independent groups. Biometrics Journal, 32, 771780.
  • Wilcox, R. R., Keselman, H. J., & Kowalchuk, R. K. (1998). Can tests for treatment group equality be improved? The bootstrap and trimmed means conjecture. British Journal of Mathematical and Statistical Psychology, 51, 123134.
    Direct Link: