Abstract
 Top of page
 Abstract
 1. Introduction
 2. Description of tests
 3. Design of the simulation
 4. Simulation results
 5. Concluding remarks
 References
Of the several tests for comparing population means, the best known are the ANOVA, Welch, Brown–Forsythe, and James tests. Each performs appropriately only in certain conditions, and none performs well in every setting. Researchers, therefore, have to select the appropriate procedure and run the risk of making a bad selection and, consequently, of erroneous conclusions. It would be desirable to have a test that performs well in any situation and so obviate preliminary analysis of data. We assess and compare several tests for equality of means in a simulation study, including nonparametric bootstrap techniques, finding that the bootstrap ANOVA and bootstrap Brown–Forsythe tests exhibit a similar and exceptionally good behaviour.
1. Introduction
 Top of page
 Abstract
 1. Introduction
 2. Description of tests
 3. Design of the simulation
 4. Simulation results
 5. Concluding remarks
 References
Comparing treatment groups is a common task in the behavioural sciences. The analysis of variance (ANOVA) and the Welch (1951) test are the main procedures used for comparing population means and are commonly included in standard statistical packages.
The ANOVA Ftest is known to be the best under the classical assumptions (normality, homoscedasticity and independence). However, when one or more of these basic assumptions are violated, especially independence and/or homoscedasticity, the Ftest is overly conservative or liberal. The only instance in which the researcher may be able to obtain a valid test under heteroscedasticity is when the degree of variance heterogeneity is small and group sizes are equal. Numerous papers have investigated the behaviour of the ANOVA Ftest under assumption violations and under various degrees of each violation (e.g., Akritas & Papadatos, 2004; Bathke, 2004; Glass, Peckham, & Sanders, 1972; Harwell, Rubinstein, Hayes, & Olds, 1992; Kenny & Judd, 1986; Keselman, Rogan, & FeirWalsh, 1977; Krutchkoff, 1988; Rogan, & Keselman, 1977; Scheffé, 1959; Weerahandi, 1995; De Beuckelaer, 1996). The balancedness of the design is also a problem in heteroscedastic settings, in particular positive/negative pairings of unequal sample sizes and unequal variances (Keselman et al., 1977).
Heteroscedastic alternatives to the ANOVA Ftest that have received most attention are Welch's (1951) test, James's (1951) secondorder method, Brown and Forsythe's (1974) test, and Alexander and Govern's (1994) test. All these procedures have been investigated in empirical studies. The evidence suggests that these methods can generally control the rate of Type I error when group variances are heterogeneous and the data are normally distributed (Dijkstra & Werter, 1981; Wilcox, 1990; Oshima & Algina, 1992; Alexander & Govern, 1994; Lix, Keselman, & Keselman, 1996; Gastwirth, Gel, & Miao, 2009). However, the literature also indicates that these tests may be liberal when the data are heterogeneous and nonnormal, particularly when the design is unbalanced.
The search continues for a good procedure to test the equality of several normal means when the variances are unknown and arbitrary. Even though several tests are available, none performs well in terms of the Type I error probability under every degree of heteroscedasticity and nonnormal populations. Indeed, Type I errors can be highly inflated or deflated for some of the commonly used tests. So, finding statistical techniques that perform well under all conditions, especially relative to probability distributions and sample sizes, is a research goal.
Works based on the parametric bootstrap have recently appeared. Krishnamoorthy, Lu, and Mathew (2007) propose a parametric bootstrap procedure for comparing the means of independent groups when the variances of the groups are unequal. This procedure has been studied and compared with the Welch and James tests by Cribbie, Fiksenbaum, Keselman, and Wilcox (2012). Chang, Pal, Lim, and Lin (2010) test the equality of several means but under homoscedastic normal population. They introduce a recently developed computational approach test (CAT), which is essentially a parametric bootstrap method. Chang, Lin, and Pal (2011) study it for the gamma distribution. Li, Wang, and Liang (2011) propose a fiducialbased test to study the equality of several normal means when the variances are unknown and unequal. They compare it with the generalized Ftest by Weerahandi (1995), the trimmed meanbased test by Lix and Keselman (1998), and the parametric bootstrap test by Krishnamoorthy et al. (2007). In this paper the focus is on the nonparametric bootstrap, that is, it involves no assumptions about any of the parent populations. It has been shown that nonparametric bootstrap methods lead to better Type I error control than nonbootstrap methods when testing trimmed means (Wilcox, Keselman, & Kowalchuk 1998; Luh & Guo, 1999).
Our interest lies in finding a method for comparing means that controls the Type I error rate under heteroscedasticity and nonnormality. The purpose of this paper is to compare Type I error rates of the usual tests used in applied research for comparing population means, such as ANOVA, the Welch test and the BrownForsythe test, with the bootstrap versions of ANOVA, the Welch test, the BrownForsythe test and the James test. Bootstrapping James's test may overcome the complex calculations needed to obtain the critical values. A competitor of the James test is the Alexander and Govern (1994) test, although we omit it due to the complexity of the test statistic for applied researchers.
We do not consider nonparametric alternatives (e.g., the Wilcoxon–Mann–Whitney and Kruskal–Wallis tests) to compare means, since they are known to be sensitive to differences in the shapes and variances of the distributions. They compare a location parameter only when population distributions are the same shape and variances are identical (Vargha & Delaney, 1998; Fagerland & Sandvik, 2009); otherwise they contrast stochastic homogeneity. When such conditions are not present, the Welch test is preferable in many situations (Skovlund & Fenstad, 2001; Fagerland & Sandvik, 2009). The Wilcoxon–Mann–Whitney test shows a lack of robustness (small differences in variances and moderate degrees of skewness can produce large deviations from the nominal Type I error rate), even in large samples (Pratt, 1964; Skovlund & Fenstad, 2001; Fagerland & Sandvik, 2009). In fact, the same population mean and shape, but a different scale, are enough to lead to a considerable Type I error rate inflation (Reiczigel, Zakariás, & Rózsa, 2005).
This simulation study differs from others in the use of nonparametric bootstrap techniques to test equal population means; see Lix et al. (1996), Chang et al. (2010, 2011) and Cribbie et al. (2012). We also study the behaviour of tests when population distributions are dissimilar. Cribbie et al. (2012) likewise study the effect of dissimilar distribution shapes across treatment groups, noting that this fact has received very little attention in the methodological literature, despite its being a not uncommon situation in applied research. The main differences between our research and that of Cribbie et al. (2012) are that we investigate more standard tests (we include the ANOVA and Brown–Forsythe tests) and we use nonparametric bootstrap version of tests.
The simulation results show that bootstrap tests have a better control of the Type I error rate than nonbootstrap tests. The bootstrap ANOVA and bootstrap Brown–Forsythe tests in particular perform very well.
2. Description of tests
 Top of page
 Abstract
 1. Introduction
 2. Description of tests
 3. Design of the simulation
 4. Simulation results
 5. Concluding remarks
 References
Let Y_{ij}, i = 1, …, k and j = 1, …, n_{i}, denote the jth observation from the ith group, where n_{i} is the sample size of ith group and k is the number of groups or treatments.
The oneway ANOVA (A) test statistic is
The null hypothesis, that the population means are equal, is rejected if the computed F statistic is higher than F_{k−1,Nk;α}, the 1−α quantile of the F distribution with k−1 and N−k degrees of freedom.
The Welch (W) test statistic is given by
The Welch statistic is approximately Fdistributed with k−1 and v degrees of freedom, where
The null hypothesis, that the population means are equal, is rejected if the computed Q statistic is higher than F_{k−1,v;α}, the 1−α quantile of the F distribution with k−1 and v degrees of freedom.
The Brown–Forsythe (BF) test statistic is
F* is approximately distributed as an F variable with v_{1} and v_{2} degrees of freedom, where
Mehrotra (1997) shows that if the numerator degrees of freedom of F^{*} are modified such that
the Type I error rate is inflated less often. This gives rise to the Brown–Forsythe–Mehrotra (BFM) test.
Finally, the James statistic test is given by
We bootstrap the ANOVA, Welch, Brown–Forsythe and James tests, denoted as BA, BW, BBF and BJ, respectively.
3. Design of the simulation
 Top of page
 Abstract
 1. Introduction
 2. Description of tests
 3. Design of the simulation
 4. Simulation results
 5. Concluding remarks
 References
The Type I error rate of the tests is investigated in a simulation study at different settings: with and without homoscedasticity; normal and nonnormal distributions; symmetric and asymmetric distributions; identical or dissimilar distributions; light and heavytailed distributions; negative and positive pairing of variances; and small, large, equal and unequal sample sizes. The nominal 5% significance level is used throughout. The simulation results are based on 5,000 replications in nonbootstrap tests and on 1,000 bootstrap replications for 5,000 Monte Carlo replications in bootstrap tests.
For nonnormal distributions we use Student's t with 4 degrees of freedom, the uniform (symmetric and lighttailed), the chi square with 4 d.f. (skewed and lighttailed), the exponential with mean 1/3 (skewed and lighttailed) and the gandh distribution. Depending on the values of g and h (see Table 1) this may be symmetric or asymmetric, with low, medium or high kurtosis.
Table 1. Characteristics of the gandh distribution used in the simulation study  Symmetric  Asymmetric 


g  0  0  0  0.81  1 
h  0.14  0.2  0.22  0  0 
μ  0  0  0  0.48  0.65 
σ  1.28  1.47  1.54  1.65  2.16 
γ _{1}  0  0  0  3.8  6.2 
γ _{2}  5.7  33.2a  103a  33.3a  111a 
To generate data from the gandh distribution, we use Cribbie et al. (2012) and Hoaglin (1985). The values of g and h selected for the investigation are included in Table 1 and the gandh distributions are denoted in subsequent tables as (g, h). Tables 1 and 2 include the mean (μ), standard deviation (σ), asymmetry (γ_{1}) and excess kurtosis (γ_{2}) of the distributions used in this study:
where
Table 2. Characteristics of the normal, t, uniform, chi square and exponential distributions used in the simulation study  Symmetric  Asymmetric 

 N(0,1)     Exp(3) 

μ  0  0  0  4  1/3 
σ  1  1.41   2.83  1/3 
γ _{1}  0  0  0  1.41  2 
γ _{2}  0  ∞  1.8  3  6 
Unequal group sizes, when paired with unequal variances, can affect Type I error control for tests that compare the typical score across groups (Keselman et al., 1998; Keselman, Cribbie, & Holland, 2002; Othman, Keselman, Padmanabhan, Wilcox, Algina, & Fradette; 2004; Syed Yahaya, Othman, & Keselman, 2006). Therefore, we pair the sample sizes and variances positively and negatively. A positive pairing occurs when the largest sample size is generated from the population with the largest variance, while the smallest sample size is associated with the smallest population variance. In a negative pairing, the largest sample size is paired with the smallest population variance and the smallest sample size with the largest.
Seven configurations of small samples are studied. Three are of equal sample sizes: (10, 10, 10), (15, 15, 15), and (25, 25, 25). Four are of unbalanced design: (10, 10, 15), (10, 10, 25), (10, 15, 15), and (15, 20, 25). Four configurations of large samples are also studied: (30,30,30) and (60, 60, 60), (30, 40, 60), and (50, 60, 70).
Unequal sample size combinations are studied with each of the seven combinations of variances, giving a total of 42 cases. Equal sample size combinations are studied with four variance combinations, giving a total of 20 cases. High heteroscedasticity is represented by and . The first combination is studied with all sample size combinations (11 cases) and the second with unequal sample size combinations (6 cases). So, of 62 cases considered (for each combination of distributions), 17 are cases of high heteroscedasticity and the rest are of low heteroscedasticity, including homoscedasticity.
Empirical Type I error rates were recorded for all tests. The robustness of a procedure, with respect to Type I error control, was determined using Bradley's (1978) liberal criterion. That is, a procedure is deemed robust with respect to Type I errors if the empirical rates of the Type I error fall within the range If α = .05, then the interval is given by . An empirical rate of over .075 would indicate a liberal test; and one below .025 a conservative test.