### Abstract

- Top of page
- Abstract
- 1. Introduction
- 2. Simulation method
- 3. Results of simulations
- 4. Results of explicit selection of samples
- 5. Power curves with and without selection
- 6. Correlation of sample heterogeneity and probability of rejecting
*H*_{0} - 7. Decisions based on preliminary tests of equality of variances
- 8. Approximate
*t* tests without pooled variance estimates - 9. Some practical implications
- References

The two-sample Student *t* test of location was performed on random samples of scores and on rank-transformed scores from normal and non-normal population distributions with unequal variances. The same test also was performed on scores that had been explicitly selected to have nearly equal sample variances. The desired homogeneity of variance was brought about by repeatedly rejecting pairs of samples having a ratio of standard deviations that exceeded a predetermined cut-off value of 1.1, 1.2, or 1.3, while retaining pairs with ratios less than the cut-off value. Despite this forced conformity with the assumption of equal variances, the tests on the selected samples were no more robust than tests on unselected samples, and in most cases substantially less robust. Under conditions where sample sizes were unequal, so that Type I error rates were inflated and power curves were atypical, the selection procedure produced still greater inflation and distortion of the power curves.

### 1. Introduction

- Top of page
- Abstract
- 1. Introduction
- 2. Simulation method
- 3. Results of simulations
- 4. Results of explicit selection of samples
- 5. Power curves with and without selection
- 6. Correlation of sample heterogeneity and probability of rejecting
*H*_{0} - 7. Decisions based on preliminary tests of equality of variances
- 8. Approximate
*t* tests without pooled variance estimates - 9. Some practical implications
- References

Numerous studies over the years have examined the extent to which familiar hypothesis tests of location, such as the Student *t* test, the ANOVA *F* test, and non-parametric counterparts of those methods, are robust to heterogeneity of variance of treatment groups, also called *heteroscedasticity* (Boneau, 1960; Hsu, 1938; Overall, Atlas & Gibson, 1995; Posten, Yeh & Owen, 1982; Scheffé, 1959, 1970). Typically, studies have found those widely used tests to be very sensitive to variance differences, with unfavourable effects on Type I error rates and power, especially when sample sizes are unequal.

Modifications of the familiar methods have been successful under some conditions (for example, Aspin, 1948; Satterthwaite, 1946; Welch, 1938, 1947), but often are not robust under other conditions that frequently arise in practice. For example, a modified test that successfully protects the Type I error rate for normal distributions may not have sufficient power or may fail entirely for non-normal distributions (see, for example, Bradstreet, 1997; Cressie & Whitford, 1986; Fagerland & Sandvik, 2009; Stonehouse & Forrester, 1998).

The purpose of the present paper is to examine the consequences of making a decision about heterogeneity of variance by inspecting sample data, under specific assumptions about the distributions of populations and sample sizes. If population distributions do in fact have unequal variances, what is the outcome when a researcher decides, by inspecting samples with only slightly different variances, that a Student *t* test is sufficiently robust to proceed? On the other hand, if population variances are in fact equal, how often will sample variances turn out to be noticeably different, to the extent that a researcher would not proceed?

An example of the problem can be seen from the data in Table 1, which illustrates that samples with heterogeneous variances can be drawn from homogeneous populations, and homogeneous samples can be drawn from heterogeneous populations. In both parts of the table, samples were randomly drawn from populations with no differences between means. In Table 1(a), the variances were the same in both populations, that is, σ_{1}/σ_{2} = 1.0. However, as a result of random sampling, it turned out that the sample standard deviations were decidedly unequal, *s*_{1}/*s*_{2} = 5.496. Normally, researchers would be inclined to reject such data as not suitable for statistical analysis using the Student *t* test. However, the *t* statistic is actually 0.433 for these data, which would yield a correct statistical decision.

Table 1. Examples of samples with (a) heterogeneous variances drawn from a homogeneous population and (b) homogeneous samples drawn from a heterogeneous population(a) | (b) |
---|

*x* _{1} | *x* _{2} | *x* _{1} | *x* _{2} |
---|

8.9 | 10.1 | 7.4 | 11.9 |

11.5 | 9.6 | 9.1 | 9.6 |

12.1 | 10.3 | 8.6 | 11.1 |

11.2 | 10.5 | 6.8 | 6.9 |

8.7 | 9.7 | 5.5 | 9.3 |

7.0 | 9.8 | | 9.9 |

10.6 | 10.1 | | 10.7 |

9.9 | 10.3 | | 10.8 |

8.6 | 10.0 | | 9.5 |

9.6 | 9.7 | | 8.8 |

| | | 12.3 |

| | | 11.0 |

μ_{1} − μ_{2} = 0 | | μ_{1} − μ_{2} = 0 |

σ_{1}/σ_{2} = 1.0 | | σ_{1}/σ_{2} = 5.0 |

*s*_{1} = 2.439 | | *s*_{1} = 2.090 |

*s*_{2} = 0.081 | | *s*_{2} = 2.180 |

*s*_{1}/*s*_{2} = 5.496 | | *s*_{1}/*s*_{2} = 1.021 |

*t* = 0.433 | | *t* = 3.415 |

*p* < .357 | | *p* < .003 |

For the scores in Table 1(b), the variances of the populations were heterogeneous, with σ_{1}/σ_{2} = 5.0. Nevertheless, it turned out that the sample standard deviations indicated homogeneity, that is, *s*_{1}/*s*_{2} = 1.021, which would be suitable for a *t* test. But, because the population means are equal, the *t* statistic of 3.415 would yield an incorrect statistical decision. These examples are extreme in order to indicate the problem and are not representative of hypothesis testing in practical research. Nevertheless, the extent to which similar discrepancies occur as a result of random sampling is largely unknown. The present study employed simulations to provide an indication of how likely those outcomes may be and how Type I error rates and power are affected under various assumptions about distributions and sample sizes.

Pairs of samples were repeatedly drawn from normal and non-normal population distributions having predetermined parameters. On each replication of the sampling procedure, a record was made of the ratio of the sample standard deviations. In addition, a record was made of whether or not the difference in means of the two samples on that particular occasion exceeded the cut-off value for significance at the .05 level. That data from individual replications made it possible to compare the probability of rejecting the null hypothesis on trials where the ratio exceeded a certain cut-off value and trials on which it did not. The data also indicated how that probability depended on the nature of the distributions, the sample sizes, and the extent of the inequality of variances in the populations. Some results of transformation to ranks, preliminary tests equality of variances, and using separate-variances tests such as the Welch test, were also investigated.

### 2. Simulation method

- Top of page
- Abstract
- 1. Introduction
- 2. Simulation method
- 3. Results of simulations
- 4. Results of explicit selection of samples
- 5. Power curves with and without selection
- 6. Correlation of sample heterogeneity and probability of rejecting
*H*_{0} - 7. Decisions based on preliminary tests of equality of variances
- 8. Approximate
*t* tests without pooled variance estimates - 9. Some practical implications
- References

The simulations were programmed using *Mathematica*, version 4.1, together with *Mathematica* statistical add-on packages (Wolfram, 1999).1 With the exception of the mixed normal distribution, random deviates were obtained from built-in functions in the statistical add-on packages, and the programming of the simulation procedures was also done with the *Mathematica* programming language. The main concern of the study was finding accurate estimates of Type I error probabilities and power for significance tests of location (the *t* test on initial scores and the *t* test on ranks), for normal and eight non-normal distributions, and for both selected and unselected samples.

The number of iterations of the sampling procedure for each condition was 10,000 for Tables 6 and 4, 50,000 for Table 4, and 20,000 for Table 5. Each point representing a probability in Figures 1-4 and 7-9 was obtained from 10,000 iterations, and each frequency distribution in Figures 5 and 6 from 100,000 iterations. Those numbers were chosen to be sufficiently large to give accurate estimates of conditional probabilities, when only a fraction of the total number of iterations contributed to an estimated probability of interest.

Non-parametric rank tests are often used to protect the significance level and to increase power for non-normal data. It is known that a rank transformation, in which a Student *t* test is performed on ranks replacing the original scores, has essentially the same outcome as the normal approximation form of the Wilcoxon–Mann–Whitney rank-sum test performed on the same data (Conover & Iman, 1980). In the present study, some comparisons were made between the *t* test on initial scores resulting from the simulations and the *t* test on corresponding ranks of the scores in the combined samples. Therefore, some of the findings in the study can be regarded as comparisons between the parametric *t* test and the non-parametric Wilcoxon–Mann–Whitney test. Throughout the study, the hypothesis tested, for both *t* and *t*_{R}, was *H*_{0 }: |μ_{1}* *−* *μ_{2}| = 0 against the alternative |μ_{1} −* *μ_{2}| > 0.

The parameters of the nine distributions used in the simulations are described in Table 2 (see also Evans, Hastings & Peacock, 2000). Random deviates from all distributions except the mixed normal were obtained from the *Mathematica* statistical add-on package, using the parameters given in Table 2. A mixed normal distribution, used to represent samples with outliers, consisted of samples from *N*(0, 1) with probability .95 and from *N*(8, 1) with probability .05. To make comparisons possible, all distributions were standardized to have mean 0 and variance 1.

Table 2. Skewness and kurtosis of distributions. All distributions were standardized to μ = 0 and σ = 1Distributions | Parameters | Skewness | Kurtosis |
---|

Normal | μ = 0, σ = 1 | 0 | 3 |

Exponential | Scale parameter 1 | 2 | 9 |

Laplace | Location parameter 1, scale parameter 1 | 0 | 6 |

Lognormal | Derived from *N*(0,1) | 6.185 | 113.936 |

Extreme value (Gumbel) | Location parameter 1, scale parameter 1 | 1.140 | 5.400 |

Half-normal | Scale parameter 0.75 | 0.995 | 3.869 |

Mixed normal | *N*(0,1) with prob. .95, *N*(8,1) with prob. .05 | 2.697 | 11.539 |

Rayleigh | Scale parameter 1 | 6.325 | 3.250 |

Logistic | Location parameter 0, scale parameter 0.5 | 0 | 4.2 |

### 3. Results of simulations

- Top of page
- Abstract
- 1. Introduction
- 2. Simulation method
- 3. Results of simulations
- 4. Results of explicit selection of samples
- 5. Power curves with and without selection
- 6. Correlation of sample heterogeneity and probability of rejecting
*H*_{0} - 7. Decisions based on preliminary tests of equality of variances
- 8. Approximate
*t* tests without pooled variance estimates - 9. Some practical implications
- References

Table 6 indicates that both the *t* test and its non-parametric counterpart based on ranks affect the various normal and non-normal distributions in Table 1 in ways that are consistent with previous simulation studies. At the same time, the data reveal some unexpected effects when samples were selected so that variances were homogeneous. First, it is well known that non-normality often severely disrupts both Type I error rates and power of the *t* test for many distributions, whereas non-parametric tests, such as the Wilcoxon–Mann–Whitney test and the Wilcoxon signed-ranks test, restore the probabilities to values typical of a normal distribution.

However, many studies over the years have indicated that heterogeneity of variance, in combination with unequal sample sizes, disrupts the Type I error rates and power of the *t* test (see, for example, Hsu, 1938; Scheffe′, 1959, 1970; Stonehouse & Forrester, 1998). When a larger variance is associated with a smaller sample size, both the Type I error probability and power of the *t* test are spuriously elevated, and when a larger variance is associated with a larger sample size, the probabilities are depressed. When sample sizes are equal, the influence of variance heterogeneity is considerably smaller.

The data in Table 6 consistently exhibit those familiar effects for the various distributions and combinations of sample sizes. When *n*_{1} = *n*_{2} = 30, the *t* test on scores performed well for normal distributions despite inequality of variances. However, when sample sizes were unequal and the larger sample size was associated with the smaller variance (the first two pairs of columns in the table), both the Type I error rates and power of *t* were inflated, and the same was true of *t*_{R} to a somewhat lesser extent. The influence of unequal variances was consistently greater when *n*_{1} = 10 and *n*_{2} = 50 than when *n*_{1} = 20 and *n*_{2} = 40.

In a similar way, when sample sizes were unequal and the larger sample size was associated with the larger variance (the last two pairs of columns in the table), both Type I error rates and power were depressed below the nominal significance level. Moreover, the effect was larger when *n*_{1} = 50 and *n*_{2} = 10 than when *n*_{1} = 40 and *n*_{2} = 20. The changes associated with unequal variances were not eliminated when the *t* test on ranks replaced the *t* test on scores.

### 4. Results of explicit selection of samples

- Top of page
- Abstract
- 1. Introduction
- 2. Simulation method
- 3. Results of simulations
- 4. Results of explicit selection of samples
- 5. Power curves with and without selection
- 6. Correlation of sample heterogeneity and probability of rejecting
*H*_{0} - 7. Decisions based on preliminary tests of equality of variances
- 8. Approximate
*t* tests without pooled variance estimates - 9. Some practical implications
- References

A picture quite different from what might be expected emerged when explicit selection ensured nearly equal variances in those pairs of samples retained for the hypothesis test. In the case of the normal distribution, the results were almost the same for selected and unselected samples, and that was true for both *t* and *t*_{R}. There was little or no effect on Type I error rates or power. In cases where variances were unequal and the larger variance was associated with the smaller sample, selection of homogeneous samples resulted in a change for the worse. The Type I error rates were inflated still further above the nominal significance level, and the probability of rejecting *H*_{0} for non-zero differences between means was spuriously increased still more. Again, the effect was similar for both *t* and *t*_{R}, although the changes were somewhat less for *t*_{R}. When a larger variance was associated with a larger sample size, selection brought the Type I error probability slightly closer to the significance level.

In the case of non-normal distributions, when variances were equal, selection typically reduced the Type I error probabilities for *t*_{R} substantially below the nominal significance level. That is, the non-parametric method no longer maintained the Type I error rate close to the significance level despite non-normality. The probabilities were close to zero in many instances. For a difference between means of 0.4, selection of homogeneous samples typically reduced the probability of rejecting *H*_{0} slightly, and for differences of 0.8 and 1.2 selection increased the probability of rejecting *H*_{0}, in many cases to a much larger extent.

The table indicates that both *t* and *t*_{R} were biased in the case of several non-normal distributions, more severely for *t*_{R} than for *t*. That is, the probability of rejecting *H*_{0} actually decreased as the difference between means increased. For example, consider *t*_{R} performed on the exponential and lognormal distributions when σ_{1}/σ_{2} = 2.0. For those distributions, the probability of rejecting *H*_{0} when μ_{1} − μ_{2} = 0.4 was considerably less than the Type I error probability. The same bias also occurred for the extreme value, half-normal, and mixed normal distributions.

For non-normal distributions with unequal variances, the selection process often resulted in extensive bias where it did not exist before, for both *t* and *t*_{R}. For example, consider the exponential distribution for σ_{1}/σ_{2} = 1.6 and μ_{1} − μ_{2} of 0, 0.4, and 0.8. For all combinations of sample sizes, the probability decreased when the difference was 0.4 and then increased again when it was 0.8. The same occurred for the lognormal, extreme value, and half-normal, distributions.

### 5. Power curves with and without selection

- Top of page
- Abstract
- 1. Introduction
- 2. Simulation method
- 3. Results of simulations
- 4. Results of explicit selection of samples
- 5. Power curves with and without selection
- 6. Correlation of sample heterogeneity and probability of rejecting
*H*_{0} - 7. Decisions based on preliminary tests of equality of variances
- 8. Approximate
*t* tests without pooled variance estimates - 9. Some practical implications
- References

A more detailed picture of some of these relationships is given in Figures 1 and 2, which plot the probability of rejecting *H*_{0} as a function of μ_{1} − μ_{2}, over a range from 0 to 1, in increments of 0.1. The upper part of Figure 1 shows power curves for the normal distribution when variances were equal but sample sizes were unequal (square symbols), and the function was typical of a power curve of a normal distribution. When variances were unequal (filled circles), the Type I error probability was inflated, and the increase in power as μ_{1} − μ_{2} increased was considerably smaller than usual. When only pairs of samples having a ratio of standard deviations less than 1.2 were retained, the Type I error probability was inflated still more, and the curve was slightly elevated above the one for unselected samples with unequal variances.

The lower part of Figure 1 shows power functions of the *t* test performed on samples from an exponential distribution under the same conditions The two curves for unselected samples with equal and unequal variances were much the same as for the normal distribution, but the curve for the selected samples declined abruptly and showed pronounced bias over an extensive range. Figure 2 shows similar results for *t*_{R} performed on samples from exponential and lognormal distributions. When variances were unequal, the curves for both selected and unselected samples again were close together, and both exhibited extreme bias. For the exponential and lognormal distributions, and for both *t* and *t*_{R}, selection of samples brought no improvement and distorted the form of the curves to an even greater extent.

Table 5 provides a somewhat different perspective by comparing the probability of rejection of *H*_{0} on those occasions when the ratio of sample standard deviations exceeded a cut-off value and on those occasions when the ratio was below that value. The three cut-off values were 1.1, 1.2, and 1.3, and the two conditions are represented in the columns labelled *r* > *c* and *r* < *c*. The first column in each group of three columns, labelled *p*(*r* > *c*), is the proportion of all samples in which the ratio of sample standard deviations exceeded the cut-off value. In the first part of the table, the population ratio σ_{1}/σ_{2} was 1.0, and in the second part it was 1.5.

It is clear that, for all distributions, the proportion of the sample ratios exceeding the cut-off value declined as the ratio increased. However, the probability of rejecting *H*_{0} remained about the same for all cut-off ratios, although it varied depending on the type of distribution. For the normal, Laplace, and logistic distributions the probability was nearly the same above and below the cut-off ratio. However, for all other distributions, the probability was somewhat larger below the cut-off ratio, which is consistent with the results in Table 6.

Figures 3 and 4 provide more detailed power curves for these conditions for normal, exponential, lognormal, and extreme value distributions. For the normal distribution, there was almost no difference in the proportions for any value of μ_{1} − μ_{2}. For the other three distributions, the curves representing selection of samples differed in ways consistent with Tables 6 and 5. However, the curves representing more heterogeneous samples (*r* > *c*) were close to those representing all samples. Again, for the exponential and lognormal distributions, the selection procedure resulted in bias where it did not exist before, and the same was true for the extreme value distribution.

### 6. Correlation of sample heterogeneity and probability of rejecting *H*_{0}

- Top of page
- Abstract
- 1. Introduction
- 2. Simulation method
- 3. Results of simulations
- 4. Results of explicit selection of samples
- 5. Power curves with and without selection
- 6. Correlation of sample heterogeneity and probability of rejecting
*H*_{0} - 7. Decisions based on preliminary tests of equality of variances
- 8. Approximate
*t* tests without pooled variance estimates - 9. Some practical implications
- References

When considering heterogeneity of variance of individual pairs of samples in practical hypothesis testing, the data in Table 3 are instructive. The table shows two correlation coefficients obtained from 20,000 iterations of the sampling procedure under the conditions that produced the frequency distributions in Figures 5 and 6 (to be discussed below). The correlation in the first column of each pair of columns (labelled *r*_{PB}) is the point-biserial correlation of the ratio of the standard deviations of samples, *s*_{1}/*s*_{2}, and the rejection of *H*_{0} (denoted by 1), or non-rejection (denoted by 0). The second correlation in each pair (labelled *r*) is the Pearson correlation of the ratio of standard deviations and the value of the *t* statistic. Two population ratios of standard deviations, 1.0 and 1.5, were included.

Table 3. Probability of rejecting *H*_{0} by *t* tests on all pairs of samples of scores (*t*) and ranks (*t*_{R}), and also on pairs of samples of scores (*t*_{H}) and ranks (*t*_{RH}) selected for homogeneity of varianceStudent *t*, σ_{1}/σ_{2} = 1.0 cut-off = 1.2 | Sample sizes |
---|

Distribution | μ_{1} − μ_{2} | *n* = 10 *n* = 50 | *n* = 20 *n* = 40 | *n* = 30 *n* = 30 | *n* = 40 *n* = 20 | *n* = 50 *n* = 10 |
---|

*t* | *t* _{H} | *t* | *t* _{H} | *t* | *t* _{H} | *t* | *t* _{H} | *t* | *t* _{H} |
---|

Normal | 0 | .051 | .051 | .052 | .050 | .053 | .050 | .053 | .056 | .048 | .050 |

0.4 | .199 | .206 | .293 | .289 | .327 | .330 | .301 | .303 | .207 | .200 |

0.8 | .615 | .624 | .826 | .825 | .869 | .864 | .821 | .817 | .616 | .606 |

1.2 | .926 | .930 | .993 | .993 | .994 | .996 | .990 | .991 | .930 | .930 |

Exponential | 0 | .044 | .011 | .047 | .007 | .044 | .008 | .046 | .007 | .044 | .011 |

0.4 | .222 | .191 | .318 | .307 | .356 | .318 | .332 | .259 | .225 | .100 |

0.8 | .608 | .776 | .810 | .924 | .858 | .930 | .822 | .884 | .661 | .647 |

1.2 | .923 | .991 | .985 | .999 | .989 | .998 | .984 | .996 | .920 | .950 |

Laplace | 0 | .057 | .048 | .050 | .051 | .052 | .045 | .048 | .050 | .051 | .055 |

0.4 | .221 | .213 | .311 | .320 | .349 | .355 | .316 | .318 | .205 | .218 |

0.8 | .627 | .640 | .827 | .819 | .863 | .873 | .821 | .828 | .638 | .638 |

1.2 | .924 | .923 | .986 | .986 | .992 | .993 | .988 | .985 | .919 | .926 |

Lognormal | 0 | .050 | .006 | .040 | .006 | .041 | .004 | .043 | .004 | .050 | .004 |

0.4 | .265 | .315 | .394 | .474 | .431 | .480 | .415 | .400 | .281 | .163 |

0.8 | .669 | .906 | .822 | .961 | .854 | .964 | .843 | .926 | .724 | .744 |

1.2 | .900 | .994 | .958 | .997 | .967 | .998 | .958 | .992 | .904 | .957 |

Extreme value | 0 | .048 | .024 | .049 | .029 | .051 | .027 | .049 | .024 | .050 | .027 |

0.4 | .205 | .201 | .305 | .301 | .331 | .332 | .316 | .294 | .208 | .169 |

0.8 | .626 | .676 | .815 | .868 | .856 | .898 | .829 | .848 | .644 | .649 |

1.2 | .925 | .967 | .991 | .996 | .994 | .998 | .985 | .993 | .916 | .944 |

Half-normal | 0 | .050 | .024 | .049 | .024 | .051 | .027 | .048 | .023 | .048 | .025 |

0.4 | .204 | .185 | .294 | .279 | .331 | .332 | .305 | .262 | .212 | .134 |

0.8 | .612 | .674 | .821 | .868 | .856 | .898 | .813 | .846 | .624 | .625 |

1.2 | .928 | .970 | .989 | .997 | .994 | .998 | .984 | .994 | .917 | .934 |

Mixed normal | 0 | .043 | .011 | .042 | .006 | .051 | .022 | .047 | .007 | .042 | .013 |

0.4 | .225 | .176 | .330 | .246 | .337 | .314 | .368 | .202 | .259 | .177 |

0.8 | .626 | .694 | .810 | .871 | .855 | .892 | .808 | .870 | .654 | .481 |

1.2 | .919 | .973 | .983 | .997 | .993 | .998 | .975 | .998 | .906 | .944 |

Rayleigh | 0 | .047 | .033 | .051 | .038 | .049 | .035 | .044 | .036 | .049 | .035 |

0.4 | .206 | .197 | .298 | .284 | .335 | .319 | .301 | .289 | .214 | .183 |

0.8 | .618 | .646 | .815 | .840 | .862 | .884 | .819 | .833 | .629 | .642 |

1.2 | .922 | .951 | .990 | .995 | .995 | .998 | .990 | .992 | .923 | .931 |

Logistic | 0 | .049 | .053 | .051 | .052 | .048 | .051 | .050 | .042 | .046 | .049 |

0.4 | .204 | .211 | .312 | .310 | .337 | .345 | .307 | .310 | .210 | .217 |

0.8 | .635 | .636 | .823 | .821 | .862 | .868 | .827 | .824 | .629 | .638 |

| 1.2 | .926 | .926 | .987 | .990 | .994 | .995 | .989 | .988 | .922 | .928 |

*t* on ranks, σ_{1}/σ_{2} = 1.0 cut-off = 1.2 | *t*_{R} | *t*_{RH} | *t* _{ R } | *t* _{ RH } | *t* _{ R } | *t* _{ RH } | *t* _{ R } | *t* _{ RH } | *t* _{ R } | *t* _{ RH } |

Normal | 0 | .045 | .048 | .051 | .047 | .051 | .053 | .051 | .049 | .051 | .046 |

0.4 | .200 | .182 | .281 | .288 | .305 | .311 | .284 | .287 | .186 | .190 |

0.8 | .593 | .592 | .807 | .802 | .844 | .851 | .803 | .800 | .589 | .585 |

1.2 | .910 | .909 | .990 | .989 | .995 | .996 | .987 | .989 | .911 | .193 |

Exponential | 0 | .048 | .031 | .046 | .033 | .048 | .031 | .045 | .031 | .047 | .033 |

0.4 | .352 | .413 | .566 | .602 | .606 | .618 | .549 | .537 | .394 | .306 |

0.8 | .886 | .938 | .972 | .983 | .972 | .981 | .944 | .949 | .790 | .751 |

1.2 | .996 | .999 | 1 | 1 | .999 | 1 | .997 | .998 | .944 | .949 |

Laplace | 0 | .043 | .050 | .049 | .047 | .045 | .051 | .048 | .046 | .045 | .049 |

0.4 | .277 | .261 | .410 | .402 | .443 | .441 | .407 | .406 | .271 | .262 |

0.8 | .728 | .712 | .905 | .896 | .934 | .933 | .904 | .900 | .729 | .717 |

1.2 | .951 | .950 | .996 | .994 | .999 | .998 | .995 | .995 | .948 | .948 |

Lognormal | 0 | .051 | .034 | .052 | .038 | .055 | .039 | .052 | .039 | .052 | .032 |

0.4 | .715 | .813 | .888 | .919 | .896 | .914 | .838 | .824 | .637 | .539 |

0.8 | .996 | .999 | .999 | 1 | .998 | .999 | .993 | .995 | .926 | .914 |

1.2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | .984 | .986 |

Extreme value | 0 | .052 | .038 | .050 | .041 | .045 | .040 | .050 | .036 | .048 | .034 |

.4 | .221 | .234 | .347 | .361 | .394 | .393 | .354 | .334 | .248 | .196 |

.8 | .698 | .732 | .886 | .903 | .910 | .919 | .864 | .879 | .682 | .649 |

1.2 | .973 | .977 | .996 | .999 | .997 | .997 | .992 | .995 | .923 | .932 |

Half-normal | 0 | .052 | .036 | .049 | .041 | .052 | .040 | .050 | .038 | .050 | .034 |

0.4 | .212 | .230 | .352 | .359 | .392 | .394 | .364 | .333 | .258 | .189 |

0.8 | .683 | .732 | .876 | .894 | .892 | .907 | .838 | .847 | .640 | .611 |

1.2 | .966 | .979 | .996 | .998 | .995 | .998 | .985 | .991 | .896 | .910 |

Mixed-normal | 0 | .049 | .038 | .048 | .037 | .050 | .036 | .050 | .039 | .050 | .038 |

0.4 | .514 | .529 | .713 | .728 | .755 | .769 | .701 | .700 | .506 | .396 |

0.8 | .977 | .980 | .998 | .999 | .999 | .999 | .996 | .999 | .941 | .953 |

1.2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | .991 | .999 |

Rayleigh | 0 | .050 | .039 | .049 | .042 | .048 | .040 | .050 | .042 | .041 | .038 |

0.4 | .189 | .199 | .300 | .299 | .337 | .327 | .308 | .296 | .211 | .176 |

0.8 | .612 | .639 | .835 | .838 | .863 | .868 | .808 | .813 | .614 | .596 |

1.2 | .944 | .953 | .992 | .993 | .993 | .995 | .985 | .988 | .897 | .903 |

Logistic | 0 | .050 | .046 | .045 | .048 | .046 | .046 | .051 | .044 | .049 | .043 |

0.4 | .208 | .212 | .318 | .327 | .354 | .357 | .318 | .313 | .217 | .216 |

0.8 | .649 | .628 | .845 | .840 | .882 | .880 | .836 | .843 | .640 | .643 |

| 1.2 | .922 | .928 | .992 | .993 | .994 | .996 | .992 | .991 | .929 | .929 |

Student *t*, σ_{1}/σ_{2} = 2.0 cut-off = 1.2 | *t* | *t* _{H} | *t* | *t* _{H} | *t* | *t* _{H} | *t* | *t* _{H} | *t* | *t* _{H} |

Normal | 0 | .205 | .257 | .106 | .210 | .051 | .136 | .019 | .052 | .004 | .008 |

0.4 | .296 | .346 | .219 | .346 | .160 | .287 | .081 | .179 | .025 | .044 |

0.8 | .487 | .544 | .513 | .653 | .472 | .641 | .347 | .500 | .129 | .188 |

1.2 | .715 | .764 | .817 | .900 | .824 | .915 | .736 | .837 | .403 | .508 |

Exponential | 0 | .210 | .361 | .127 | .408 | .060 | .315 | .024 | .081 | .008 | .014 |

0.4 | .247 | .066 | .206 | .055 | .126 | .032 | .054 | .013 | .007 | .063 |

0.8 | .455 | .079 | .530 | .072 | .506 | .047 | .347 | .018 | .096 | .240 |

1.2 | .733 | .340 | .868 | .465 | .872 | .447 | .781 | .296 | .446 | .524 |

Laplace | 0 | .197 | .131 | .118 | .095 | .056 | .053 | .019 | .029 | .004 | .014 |

0.4 | .291 | .258 | .247 | .271 | .176 | .214 | .091 | .149 | .023 | .063 |

0.8 | .507 | .554 | .548 | .644 | .505 | .610 | .378 | .478 | .141 | .240 |

1.2 | .725 | .810 | .816 | .909 | .817 | .898 | .740 | .819 | .440 | .524 |

Lognormal | 0 | .199 | .203 | .132 | .240 | .077 | .202 | .040 | .130 | .021 | .055 |

0.4 | .238 | .016 | .184 | .009 | .120 | .004 | .048 | .002 | .006 | .001 |

0.8 | .497 | .169 | .607 | .200 | .564 | .162 | .436 | .070 | .146 | .005 |

1.2 | .788 | .726 | .895 | .830 | .900 | .816 | .829 | .650 | .553 | .205 |

Extreme value | 0 | .206 | .345 | .122 | .344 | .058 | .261 | .021 | .147 | .005 | .046 |

0.4 | .261 | .164 | .224 | .096 | .140 | .048 | .069 | .016 | .010 | .005 |

0.8 | .459 | .200 | .527 | .168 | .490 | .115 | .356 | .050 | .110 | .003 |

1.2 | .716 | .433 | .838 | .511 | .845 | .482 | .763 | .336 | .431 | .049 |

Half-normal | 0 | .208 | .491 | .121 | .542 | .057 | .430 | .021 | .234 | .006 | .059 |

0.4 | .263 | .229 | .203 | .165 | .138 | .082 | .062 | .029 | .012 | .005 |

0.8 | .469 | .149 | .503 | .086 | .477 | .057 | .327 | .026 | .102 | .003 |

1.2 | .703 | .292 | .842 | .336 | .842 | .334 | .751 | .252 | .402 | .049 |

Mixed-normal | 0 | .190 | .190 | .130 | .267 | .072 | .240 | .032 | .144 | .020 | .035 |

0.4 | .227 | .055 | .186 | .036 | .113 | .019 | .042 | .004 | .005 | .001 |

0.8 | .463 | .217 | .558 | .294 | .516 | .265 | .366 | .110 | .085 | .003 |

1.2 | .760 | .623 | .887 | .799 | .892 | .780 | .819 | .550 | .472 | .092 |

Rayleigh | 0 | .205 | .390 | .116 | .384 | .049 | .312 | .016 | .164 | .003 | .035 |

0.4 | .271 | .251 | .215 | .163 | .150 | .076 | .070 | .021 | .012 | .003 |

0.8 | .462 | .277 | .524 | .245 | .486 | .152 | .350 | .072 | .112 | .014 |

1.2 | .713 | .477 | .833 | .535 | .840 | .501 | .758 | .371 | .406 | .119 |

Logistic | 0 | .200 | .574 | .113 | .130 | .047 | .071 | .018 | .026 | .003 | .011 |

0.4 | .283 | .572 | .228 | .284 | .171 | .215 | .085 | .134 | .022 | .050 |

0.8 | .469 | .599 | .533 | .627 | .489 | .598 | .355 | .462 | .136 | .206 |

| 1.2 | .706 | .665 | .818 | .889 | .826 | .897 | .741 | .809 | .422 | .507 |

*t* on ranks, σ_{1}/σ_{2} = 2.0 cut-off = 1.2 | *t*_{R} | *t*_{RH} | *t*_{R} | *t*_{RH} | *t*_{R} | *t*_{RH} | *t*_{R} | *t*_{RH} | *t*_{R} | *t*_{RH} |

Normal | 0 | .120 | .240 | .090 | .190 | .058 | .131 | .034 | .055 | .014 | .008 |

0.4 | .167 | .309 | .180 | .338 | .164 | .285 | .124 | .183 | .057 | .046 |

0.8 | .334 | .521 | .422 | .625 | .456 | .617 | .398 | .488 | .226 | .198 |

1.2 | .544 | .734 | .726 | .887 | .781 | .904 | .768 | .838 | .541 | .508 |

Exponential | 0 | .238 | .442 | .266 | .546 | .268 | .535 | .218 | .454 | .084 | .223 |

0.4 | .124 | .152 | .096 | .156 | .067 | .120 | .036 | .072 | .011 | .027 |

0.8 | .215 | .124 | .276 | .121 | .289 | .106 | .243 | .068 | .138 | .018 |

1.2 | .645 | .496 | .846 | .698 | .890 | .723 | .847 | .640 | .661 | .323 |

Laplace | 0 | .101 | .125 | .086 | .088 | .065 | .060 | .040 | .029 | .016 | .016 |

0.4 | .206 | .265 | .246 | .294 | .251 | .272 | .202 | .185 | .093 | .087 |

0.8 | .446 | .580 | .602 | .713 | .657 | .720 | .619 | .632 | .400 | .324 |

1.2 | .685 | .830 | .862 | .942 | .913 | .951 | .912 | .920 | .758 | .664 |

Lognormal | 0 | .340 | .411 | .443 | .596 | .473 | .640 | .411 | .596 | .201 | .382 |

0.4 | .114 | .086 | .086 | .069 | .058 | .051 | .036 | .030 | .019 | .014 |

0.8 | .500 | .425 | .701 | .586 | .755 | .609 | .700 | .508 | .492 | .221 |

1.2 | .981 | .982 | .999 | .997 | .997 | .994 | .993 | .981 | .93 | .788 |

Extreme value | 0 | .145 | .314 | .138 | .332 | .108 | .280 | .064 | .176 | .022 | .074 |

0.4 | .129 | .165 | .119 | .111 | .092 | .062 | .058 | .028 | .023 | .008 |

0.8 | .272 | .223 | .342 | .222 | .366 | .182 | .309 | .120 | .164 | .032 |

1.2 | .518 | .467 | .703 | .579 | .762 | .592 | .736 | .505 | .517 | .198 |

Half-normal | 0 | .164 | .454 | .155 | .523 | .128 | .466 | .088 | .325 | .026 | .101 |

0.4 | .128 | .219 | .097 | .185 | .066 | .124 | .042 | .058 | .014 | .012 |

0.8 | .212 | .163 | .251 | .111 | .254 | .074 | .200 | .050 | .099 | .013 |

1.2 | .444 | .298 | .609 | .358 | .675 | .374 | .625 | .309 | .417 | .126 |

Mixed-normal | 0 | .158 | .191 | .167 | .233 | .139 | .238 | .105 | .197 | .042 | .112 |

0.4 | .156 | .114 | .169 | .103 | .142 | .083 | .111 | .038 | .056 | .011 |

0.8 | .489 | .389 | .658 | .558 | .716 | .597 | .680 | .464 | .464 | .167 |

1.2 | .836 | .785 | .964 | .944 | .984 | .963 | .974 | .922 | .873 | .612 |

Rayleigh | 0 | .130 | .350 | .111 | .347 | .071 | .276 | .044 | .161 | .015 | .045 |

0.4 | .136 | .227 | .121 | .156 | .096 | .086 | .062 | .029 | .021 | .006 |

0.8 | .262 | .268 | .317 | .252 | .331 | .176 | .282 | .106 | .134 | .028 |

1.2 | .481 | .466 | .640 | .556 | .710 | .526 | .678 | .434 | .454 | .172 |

Logistic | 0 | .109 | .174 | .089 | .123 | .057 | .068 | .033 | .028 | .012 | .011 |

0.4 | .180 | .281 | .199 | .290 | .187 | .227 | .140 | .141 | .064 | .054 |

0.8 | .360 | .512 | .486 | .632 | .530 | .608 | .476 | .500 | .289 | .231 |

1.2 | .590 | .777 | .787 | .893 | .856 | .903 | .832 | .859 | .637 | .565 |

In the case of the normal distribution, both correlations were close to zero under all conditions. Apparently the degree of heterogeneity of a given pair of samples does not provide information as to the value of the *t* statistic or the likelihood of rejecting *H*_{0}, contrary to the belief that prior examination of variance heterogeneity is beneficial in hypothesis testing. Although advance knowledge of variance heterogeneity at the population level can be a warning to researchers to avoid conventional two-sample tests of location, such knowledge apparently is not very helpful when obtained from particular samples that happen to be available. In the case of non-normal distributions, the correlations in Table 3 were higher, but in those cases, we have seen, the effects of heterogeneity are anomalous and selection of samples still is of little or no advantage in hypothesis testing.

### 7. Decisions based on preliminary tests of equality of variances

- Top of page
- Abstract
- 1. Introduction
- 2. Simulation method
- 3. Results of simulations
- 4. Results of explicit selection of samples
- 5. Power curves with and without selection
- 6. Correlation of sample heterogeneity and probability of rejecting
*H*_{0} - 7. Decisions based on preliminary tests of equality of variances
- 8. Approximate
*t* tests without pooled variance estimates - 9. Some practical implications
- References

In practical hypothesis testing, a decision usually has to be made, by inspecting a single pair of samples, as to whether or not homogeneity of variance is satisfied. One strategy often employed is using a preliminary test for equality of variances as a basis for the decision (Albers, Boon & Kallenberg, 2000; Brown & Forsythe, 1974; Gastwirth, Gel & Miao, 2009; Levene, 1960; O'Brien, 1981; Scheffe′, 1959, 1970). Many preliminary tests have been proposed over the years, but the method has not been very successful. One problem of this approach is that the use of two hypothesis tests in succession compounds the Type I error rates. That is, the Type I error rate of the preliminary test of variance equality is superimposed on the Type I error probability of the subsequent hypothesis test of location (Zimmerman, 2004).

The method of the present study in a sense is a limiting case of the use of a preliminary test, because it is not subject to Type I errors resulting from initial sampling variability. That is, the simulation procedure reveals what might be expected of an ideal preliminary test based on sample statistics that contains no Type I errors. The finding that the explicit simulation method in the present study was not successful suggests that a preliminary test of equality of variances cannot accomplish what is intended, not even if a preliminary test made no Type I errors at all.

Some further evidence for this interpretation is given by Figure 7. Here, the same procedures depicted in Figures 1 and 2 were repeated with normal and exponential distributions with the addition of two preliminary tests of equality of variances that have been used widely in the past, the *F* test and the Levene test. On each replication, one of those tests was first performed, and if it rejected the hypothesis of variance equality at the .05 significance level, that pair of samples was rejected and another sample was drawn, continuing until the preliminary test failed to reject the null hypothesis of equal variances.

The graphs reveal that a test of location conducted only after a favourable outcome of either initial test of equality of variances resulted in a power curve that was slightly below the one obtained from simulations based on selected samples with an absolute cut-off. The introduction of preliminary tests did not move the power curves appreciably closer to the one with the expected Type I error probability with no spurious increases in probability.

It should be emphasized that these results do not call into question use of the Levene test as a test of equality of variances (see Gastwirth *et al*., 2009; for further details). Rather, they show that even an effective method used for that purpose is not suitable in conjunction with a *t* test to protect against variance heterogeneity, because Type I errors in the two separate steps are compounded. The present results suggest that any two-stage procedure would be subject to the same limitations no matter how effective a preliminary test might be (see also Zimmerman, 2004).

### 8. Approximate *t* tests without pooled variance estimates

- Top of page
- Abstract
- 1. Introduction
- 2. Simulation method
- 3. Results of simulations
- 4. Results of explicit selection of samples
- 5. Power curves with and without selection
- 6. Correlation of sample heterogeneity and probability of rejecting
*H*_{0} - 7. Decisions based on preliminary tests of equality of variances
- 8. Approximate
*t* tests without pooled variance estimates - 9. Some practical implications
- References

Another widely used attempt to overcome effects of heterogeneous variances in the *t* test is the separate-variances approximation introduced by Welch (1938), Satterthwaite (1946), and Aspin (1948). In this method, the sample variances in the denominator of the *t* statistic are not pooled in order to estimate the population variance, but remain separate, and an adjustment is made in the number of degrees of freedom. That procedure has proved successful in maintaining Type I error rates in the case of normal distributions, when variances are unequal and at the same time sample sizes are unequal.

On the other hand, the results of the method applied to non-normal distributions are not encouraging. Table 4 presents results comparing the Welch *t* test, using separate variances, to the Student *t* test, using pooled variances for the normal and the same nine non-normal distributions previously studied. The method was applied to both scores and ranks. In the case of the normal distribution, the Welch test performed well and eliminated the inflation or depression of the probability of Type I error rates. For many of the non-normal distributions, however, the Type I error rates and power remained anomalous. In the case of the exponential, lognormal, extreme value, half-normal, and mixed normal distributions, the Welch *t* test again proved to be biased in the same way as the Student *t* test, despite the attempted correction for unequal variances.

Table 4. Proportion of samples in which the *t* statistic exceeded its critical value (α = .05) when the ratio of sample standard deviations, *r*, was greater than a cut-off value (*r *>* c*) or less than the cut-off value (*r *< *c*). Columns labelled *p*(*r *> *c*) are the proportion of all samples in which *r *> *c**n*_{1} = 10 | *n*_{2} = 50 | σ_{1}/σ_{2} = 1.0 |
---|

Distribution | μ_{1} − μ_{2} | *c* = 1.0 | *c* = 1.2 | *c* = 1.3 |
---|

*p*(*r *> *c*) | *r* > *c* | *r* < *c* | *p*(*r *> *c*) | *r* > *c* | *r* < *c* | *p*(*r *> *c*) | *r* > *c* | *r* < *c* |
---|

Normal | 0 | .711 | .049 | .053 | .490 | .050 | .050 | .318 | .050 | .049 |

0.4 | .715 | .207 | .209 | .488 | .208 | .205 | .318 | .200 | .205 |

0.8 | .717 | .622 | .623 | .489 | .629 | .622 | .322 | .620 | .624 |

1.2 | .714 | .928 | .927 | .486 | .928 | .928 | .321 | .925 | .927 |

Exponential | 0 | .837 | .054 | .009 | .693 | .062 | .010 | .571 | .074 | .014 |

0.4 | .838 | .216 | .199 | .691 | .224 | .195 | .571 | .226 | .206 |

0.8 | .838 | .589 | .784 | .689 | .552 | .778 | .571 | .512 | .766 |

1.2 | .838 | .913 | .990 | .691 | .893 | .989 | .567 | .878 | .987 |

Laplace | 0 | .803 | .049 | .050 | .634 | .053 | .051 | .494 | .048 | .051 |

0.4 | .803 | .206 | .221 | .634 | .202 | .221 | .494 | .202 | .219 |

0.8 | .806 | .635 | .645 | .635 | .633 | .640 | .496 | .626 | .639 |

1.2 | .801 | .922 | .920 | .634 | .921 | .920 | .491 | .921 | .924 |

Lognormal | 0 | .900 | .053 | .006 | .809 | .059 | .006 | .726 | .063 | .010 |

0.4 | .901 | .259 | .321 | .809 | .253 | .311 | .733 | .245 | .305 |

0.8 | .898 | .643 | .895 | .805 | .617 | .903 | .726 | .584 | .897 |

1.2 | .899 | .889 | .994 | .807 | .877 | .994 | .729 | .866 | .992 |

Extreme value | 0 | .783 | .056 | .024 | .597 | .062 | .026 | .453 | .069 | .030 |

0.4 | .781 | .212 | .195 | .597 | .217 | .196 | .448 | .219 | .201 |

0.8 | .782 | .600 | .678 | .595 | .575 | .675 | .449 | .552 | .673 |

1.2 | .783 | .916 | .970 | .605 | .899 | .968 | .450 | .880 | .963 |

Half-normal | 0 | .758 | .058 | .022 | .552 | .068 | .024 | .391 | .084 | .027 |

0.4 | .756 | .208 | .188 | .550 | .213 | .190 | .391 | .219 | .194 |

0.8 | .755 | .584 | .676 | .551 | .554 | .672 | .391 | .511 | .664 |

1.2 | .757 | .908 | .978 | .552 | .888 | .974 | .392 | .863 | .970 |

Mixed normal | 0 | .932 | .049 | .012 | .867 | .051 | .012 | .795 | .053 | .011 |

0.4 | .931 | .232 | .168 | .864 | .232 | .184 | .799 | .237 | .194 |

0.8 | .932 | .616 | .653 | .864 | .607 | .693 | .796 | .587 | .727 |

1.2 | .931 | .916 | .967 | .866 | .911 | .973 | .797 | .905 | .976 |

Rayleigh | 0 | .722 | .055 | .032 | .504 | .062 | .034 | .340 | .071 | .035 |

0.4 | .724 | .209 | .191 | .503 | .208 | .197 | .341 | .215 | .200 |

0.8 | .724 | .603 | .647 | .504 | .589 | .642 | .340 | .566 | .638 |

1.2 | .727 | .919 | .959 | .507 | .904 | .955 | .339 | .889 | .950 |

Logistic | 0 | .762 | .050 | .048 | .559 | .050 | .051 | .401 | .047 | .051 |

0.4 | .762 | .205 | .201 | .562 | .203 | .212 | .408 | .202 | .216 |

0.8 | .762 | .626 | .633 | .560 | .624 | .628 | .405 | .622 | .636 |

| 1.2 | .761 | .923 | .926 | .560 | .921 | .930 | .404 | .922 | .926 |

*n*_{1}* *= 10 | *n*_{2} = 50 | σ_{1}/σ_{2} = 1.5 |

normal | 0 | .894 | .129 | .149 | .788 | .133 | .146 | .678 | .125 | .150 |

0.4 | .893 | .249 | .263 | .790 | .245 | .265 | .680 | .240 | .274 |

0.8 | .894 | .535 | .553 | .787 | .533 | .559 | .679 | .525 | .563 |

1.2 | .894 | .805 | .833 | .785 | .807 | .834 | .679 | .808 | .826 |

exponential | 0 | .865 | .135 | .060 | .746 | .149 | .067 | .642 | .162 | .064 |

0.4 | .867 | .254 | .039 | .748 | .284 | .043 | .639 | .321 | .050 |

0.8 | .869 | .567 | .264 | .748 | .610 | .283 | .643 | .649 | .306 |

1.2 | .869 | .843 | .794 | .745 | .850 | .795 | .642 | .861 | .791 |

Laplace | 0 | .866 | .133 | .089 | .748 | .142 | .096 | .637 | .145 | .096 |

0.4 | .869 | .253 | .235 | .743 | .263 | .234 | .638 | .266 | .240 |

0.8 | .868 | .546 | .569 | .746 | .544 | .573 | .638 | .543 | .570 |

1.2 | .866 | .812 | .863 | .749 | .803 | .863 | .636 | .796 | .857 |

lognormal | 0 | .890 | .127 | .030 | .793 | .140 | .027 | .702 | ..149 | .031 |

0.4 | .889 | .261 | .053 | .790 | .288 | .055 | .701 | .316 | .063 |

0.8 | .889 | .581 | .560 | .790 | .587 | .555 | .704 | .592 | .556 |

1.2 | .888 | .843 | .950 | .794 | .828 | .948 | .701 | .819 | .946 |

extremevalue | 0 | .864 | .128 | .141 | .739 | .130 | .135 | .633 | .132 | .128 |

0.4 | .866 | .257 | .087 | .744 | .282 | .103 | .631 | .304 | .110 |

0.8 | .865 | .559 | .313 | .743 | .600 | .325 | .631 | .636 | .347 |

1.2 | .864 | .850 | .692 | .742 | .867 | .705 | .630 | .888 | .722 |

half-normal | 0 | .879 | .126 | .166 | .766 | .126 | .156 | .654 | .124 | .145 |

0.4 | .881 | .252 | .062 | .765 | .275 | .075 | .651 | .305 | .094 |

0.8 | .877 | .557 | .253 | .766 | .594 | .270 | .653 | .634 | .291 |

1.2 | .883 | .846 | .610 | .764 | .873 | .655 | .654 | .894 | .668 |

mixed-normal | 0 | .916 | .109 | .080 | .836 | .116 | .084 | .764 | .117 | .078 |

0.4 | .917 | .236 | .097 | .836 | .253 | .083 | .765 | .265 | .081 |

0.8 | .914 | .540 | .478 | .839 | .548 | .463 | .766 | .561 | .449 |

1.2 | .914 | .843 | .900 | .839 | .837 | .885 | .763 | .838 | .864 |

Rayleigh | 0 | .886 | .128 | .182 | .778 | .122 | .172 | .668 | .115 | .162 |

0.4 | .890 | .252 | .131 | .780 | .267 | .138 | .669 | .281 | .151 |

0.8 | .888 | .547 | .339 | .777 | .573 | .357 | .668 | .600 | .374 |

1.2 | .888 | .837 | .673 | .779 | .854 | .683 | .668 | .873 | .706 |

logistic | 0 | .875 | .131 | .114 | .756 | .132 | .117 | .648 | .140 | .116 |

0.4 | .875 | .250 | .247 | .757 | .252 | .244 | .643 | .252 | .250 |

0.8 | .874 | .545 | .561 | .757 | .533 | .569 | .645 | .535 | .567 |

1.2 | .874 | .811 | .856 | .758 | .803 | .848 | .647 | .800 | .846 |

Table 5. Point-biserial correlation between ratio of sample standard deviations and rejection of *H*_{0} by individual pairs of samples (*r*_{PB}) and Pearson correlation between ratio of sample standard deviations and the *t* statistic (*r*)Distribution | σ_{1}/σ_{2} = 1.0 | σ_{1}/σ_{2} = 1.5 |
---|

*r* _{PB} | *r* | *r* _{PB} | *r* |
---|

Normal | −.001 | .005 | −.032 | −.037 |

Exponential | .227 | .429 | .279 | .311 |

Laplace | −.022 | −.060 | .095 | .110 |

Lognormal | .180 | .361 | .333 | .411 |

Extreme value | .135 | .216 | .088 | .101 |

Half-normal | .201 | .302 | .057 | .055 |

Mixed normal | .122 | .230 | .290 | .334 |

Rayleigh | .098 | .142 | −.012 | −.021 |

Logistic | −.006 | −.020 | .034 | .043 |

Table 6. Probability of rejecting *H*_{0} by Student *t* test (*t*) and Welch approximate *t* test (*t*_{W}) performed on scores and on ranks, as a function of difference between means (μ_{1} - μ_{2}), for unequal sample sizes and unequal variances (σ_{1}/σ_{2} = 2.0)Distribution | μ_{1} − μ_{2} | Test on scores | | Test on ranks | |
---|

*n*_{1} = 10 *n*_{2} = 50 | *n*_{1} = 50 *n*_{2} = 10 | *n*_{1} = 10 *n*_{2} = 50 | *n*_{1} = 50 *n*_{2} = 10 |
---|

*t* | *t* _{W} | *t* | *t* _{W} | *t* | *t* _{W} | *t* | *t* _{W} |
---|

Normal | 0 | .205 | .052 | .003 | .047 | .115 | .059 | .014 | .060 |

0.4 | .288 | .092 | .028 | .152 | .171 | .093 | .054 | .151 |

0.8 | .484 | .198 | .135 | .430 | .335 | .210 | .221 | .428 |

1.2 | .709 | .390 | .403 | .774 | .536 | .382 | .540 | .754 |

Exponential | 0 | .208 | .097 | .009 | .047 | .233 | .115 | .078 | .276 |

0.4 | .239 | .050 | .004 | .207 | .116 | .057 | .012 | .061 |

0.8 | .459 | .110 | .092 | .497 | .224 | .161 | .142 | .236 |

1.2 | .730 | .345 | .448 | .767 | .650 | .632 | .643 | .620 |

Laplace | 0 | .199 | .046 | .004 | .049 | .096 | .052 | .015 | .056 |

0.4 | .288 | .097 | .023 | .156 | .199 | .127 | .101 | .215 |

0.8 | .505 | .245 | .142 | .470 | .441 | .308 | .403 | .572 |

1.2 | .730 | .451 | .423 | .776 | .683 | .533 | .766 | .853 |

Lognormal | 0 | .201 | .145 | .021 | .037 | .336 | .183 | .197 | .525 |

0.4 | .232 | .047 | .005 | .258 | .114 | .063 | .014 | .057 |

0.8 | .489 | .130 | .142 | .610 | .505 | .485 | .481 | .457 |

1.2 | .784 | .546 | .546 | .805 | .981 | .994 | .926 | .850 |

Extreme value | 0 | .204 | .063 | .004 | .049 | .142 | .069 | .026 | .098 |

0.4 | .262 | .055 | .009 | .179 | .124 | .065 | .022 | .086 |

0.8 | .475 | .142 | .106 | .476 | .261 | .175 | .157 | .304 |

1.2 | .720 | .351 | .429 | .771 | .519 | .394 | .521 | .631 |

Half-normal | 0 | .209 | .070 | .005 | .049 | .157 | .075 | .022 | .113 |

0.4 | .260 | .062 | .011 | .170 | .117 | .058 | .014 | .076 |

0.8 | .464 | .130 | .106 | .456 | .216 | .130 | .096 | .223 |

1.2 | .714 | .325 | .407 | .759 | .440 | .330 | .424 | .545 |

Mixed normal | 0 | .190 | .108 | .016 | .042 | .152 | .081 | .045 | .131 |

0.4 | .227 | .036 | .004 | .257 | .150 | .089 | .054 | .144 |

0.8 | .465 | .138 | .086 | .567 | .480 | .363 | .470 | .577 |

1.2 | .756 | .456 | .482 | .720 | .838 | .744 | .871 | .852 |

Rayleigh | 0 | .206 | .059 | .005 | .055 | .124 | .064 | .014 | .074 |

0.4 | .269 | .067 | .012 | .170 | .135 | .071 | .025 | .098 |

0.8 | .471 | .158 | .115 | .461 | .258 | .161 | .140 | .304 |

1.2 | .713 | .345 | .415 | .761 | .471 | .344 | .455 | .622 |

Logistic | 0 | .195 | .049 | .003 | .050 | .110 | .056 | .011 | .060 |

0.4 | .286 | .090 | .022 | .153 | .178 | .102 | .065 | .170 |

0.8 | .492 | .216 | .129 | .445 | .368 | .245 | .283 | .480 |

1.2 | .721 | .417 | .425 | .784 | .602 | .453 | .629 | .795 |

In the case of the test on ranks applied to the non-normal distributions, the results were similar. The Type I error rates remained too high or too low for most distributions. Furthermore, the bias of the test on ranks was even more extreme than the bias of the test on scores for the same 5 non-normal distributions mentioned above. Some more detailed power curves are exhibited in Figures 8 and 9. In both figures, sample sizes are *n*_{1} = 10 and *n*_{2} = 50, and, except for the curves with filled circles, the variances were different, that is, σ_{1}/σ_{2} = 2.0, which are conditions under which inflation of power curves is expected in the case of normal distributions. The significance level was .05. The figures compare the *t* test applied to the same sample sizes when the variances are equal (filled circles) with the *t* test and Welch test when variances are unequal.

In the case of the normal distribution, the Welch test restored the inflated Type I error rate to its usual value, for both scores and ranks. However, the power remained considerably below the usual value found when variances are equal. In the case of the exponential distribution those anomalous effects were magnified. For both the Student *t* test and the Welch *t* test applied to ranks, there was extreme bias in the case of the exponential distribution.

Apparently the Welch test and related separate-variances tests do an excellent job of maintaining Type I error rates for normal distributions. However, they are ineffective in maintaining either Type I error rates or power in the case of many non-normal distributions, and even for normal distributions their power is not what might be desired.

### 9. Some practical implications

- Top of page
- Abstract
- 1. Introduction
- 2. Simulation method
- 3. Results of simulations
- 4. Results of explicit selection of samples
- 5. Power curves with and without selection
- 6. Correlation of sample heterogeneity and probability of rejecting
*H*_{0} - 7. Decisions based on preliminary tests of equality of variances
- 8. Approximate
*t* tests without pooled variance estimates - 9. Some practical implications
- References

It is not easy to decide when the assumption of homogeneity of variance in hypothesis testing is violated and when it is not. Discussions of the topic in the past have not always distinguished between population variances and sample variances. It is possible for two samples drawn from populations with decidedly different variances to have nearly the same variances, as a result of ordinary sampling variability. Also it is possible for two samples drawn from a single distribution or from two distributions with the same variance to have different variances. Researchers do not often have prior knowledge of the specific forms of distributions, so there is usually some uncertainty and a need to make decisions based on inspection of samples.

An idea of what can be expected in practice is indicated by Figures 5 and 6, which plot relative frequency distributions of the individual ratios of standard deviations of pairs of samples, *s*_{1}/*s*_{2}, taken from normal, exponential, and lognormal distributions over 100,000 replications, when the difference between means was 0.5σ. It is clear that the sample ratios covered an extensive range of values, and it would difficult to decide in any one case about the suitability of a test of location. In some instances (Figure 5), the populations had equal variances, but the samples represented in a fairly large region in the tails of the distributions would be taken as evidence of violation of the assumption when there is none. In other instances (Figure 6), the population variances were unequal, but many pairs of samples in the central part of the distribution could be taken as evidence that homogeneity is satisfied when it is not. Despite selection, there is still an extensive distribution of ratios that are not close to 1.00. The extent to which these possibilities would arise in practice would depend on sample sizes and consequent variability of the ratios.

The simulations in the present study revealed that arbitrary selection of samples with nearly equal variances, specifically ratios of standard deviations as small as 1.1, 1.2, or 1.3, did not improve the robustness of the Student *t* test or its non-parametric counterpart, the Wilcoxon–Mann–Whitney test, when population variances were unequal. On the contrary, selection of homogeneous samples often made the situation worse. When normal distributions had equal variances, the Type I error rates and power of the Student *t* test performed on selected or unselected samples were almost indistinguishable. However, for non-normal distributions, Type I error rates typically fell considerably below the nominal significance level, and power functions were anomalous. In the case of the *t* test on ranks, the Type I error rates for unselected samples were maintained, as is well known. However, for selected samples, that desirable property of non-parametric methods was no longer evident, and the Type I error rates declined below the significance level. Often there was extreme bias and power declined as the difference between means increased.

When population variances were heterogeneous, the Type I errors and power of tests on unselected samples exhibited the usual inflation or depression that occurs with unequal sample sizes. However, both *t* and *t*_{R} performed on selected samples resulted in an even greater degree of inflation or depression. As a general rule, selection of samples to ensure homogeneity appears to move the probabilities of Type I and Type II errors still further in the direction that the selection process is intended to counteract.

In the case of normal distributions, there is no correlation between the presence of unequal variances in a pair of samples and the outcome of a hypothesis test based on those samples. In the case of non-normal distributions there are some correlations, but not in a direction that would be useful in practical research. It may be necessary to accept the fact that further modifications or adjustments to the *t* test and Wilcoxon–Mann–Whitney test cannot produce a method that will protect Type I errors and power under variance heterogeneity. On the other hand, it is possible that newer methods, not necessarily based on selection of samples or ranking, can better handle the two sample location problem when variances are unequal. Bootstrap methods, use of trimmed means, and other techniques that have been explored in recent years (Wilcox, 2003, 2005) may prove to be more useful in practical research.