SEARCH

SEARCH BY CITATION

Abstract

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Simulation method
  5. 3. Results of simulations
  6. 4. Results of explicit selection of samples
  7. 5. Power curves with and without selection
  8. 6. Correlation of sample heterogeneity and probability of rejecting H0
  9. 7. Decisions based on preliminary tests of equality of variances
  10. 8. Approximate t tests without pooled variance estimates
  11. 9. Some practical implications
  12. References

The two-sample Student t test of location was performed on random samples of scores and on rank-transformed scores from normal and non-normal population distributions with unequal variances. The same test also was performed on scores that had been explicitly selected to have nearly equal sample variances. The desired homogeneity of variance was brought about by repeatedly rejecting pairs of samples having a ratio of standard deviations that exceeded a predetermined cut-off value of 1.1, 1.2, or 1.3, while retaining pairs with ratios less than the cut-off value. Despite this forced conformity with the assumption of equal variances, the tests on the selected samples were no more robust than tests on unselected samples, and in most cases substantially less robust. Under conditions where sample sizes were unequal, so that Type I error rates were inflated and power curves were atypical, the selection procedure produced still greater inflation and distortion of the power curves.

1. Introduction

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Simulation method
  5. 3. Results of simulations
  6. 4. Results of explicit selection of samples
  7. 5. Power curves with and without selection
  8. 6. Correlation of sample heterogeneity and probability of rejecting H0
  9. 7. Decisions based on preliminary tests of equality of variances
  10. 8. Approximate t tests without pooled variance estimates
  11. 9. Some practical implications
  12. References

Numerous studies over the years have examined the extent to which familiar hypothesis tests of location, such as the Student t test, the ANOVA F test, and non-parametric counterparts of those methods, are robust to heterogeneity of variance of treatment groups, also called heteroscedasticity (Boneau, 1960; Hsu, 1938; Overall, Atlas & Gibson, 1995; Posten, Yeh & Owen, 1982; Scheffé, 1959, 1970). Typically, studies have found those widely used tests to be very sensitive to variance differences, with unfavourable effects on Type I error rates and power, especially when sample sizes are unequal.

Modifications of the familiar methods have been successful under some conditions (for example, Aspin, 1948; Satterthwaite, 1946; Welch, 1938, 1947), but often are not robust under other conditions that frequently arise in practice. For example, a modified test that successfully protects the Type I error rate for normal distributions may not have sufficient power or may fail entirely for non-normal distributions (see, for example, Bradstreet, 1997; Cressie & Whitford, 1986; Fagerland & Sandvik, 2009; Stonehouse & Forrester, 1998).

The purpose of the present paper is to examine the consequences of making a decision about heterogeneity of variance by inspecting sample data, under specific assumptions about the distributions of populations and sample sizes. If population distributions do in fact have unequal variances, what is the outcome when a researcher decides, by inspecting samples with only slightly different variances, that a Student t test is sufficiently robust to proceed? On the other hand, if population variances are in fact equal, how often will sample variances turn out to be noticeably different, to the extent that a researcher would not proceed?

An example of the problem can be seen from the data in Table 1, which illustrates that samples with heterogeneous variances can be drawn from homogeneous populations, and homogeneous samples can be drawn from heterogeneous populations. In both parts of the table, samples were randomly drawn from populations with no differences between means. In Table 1(a), the variances were the same in both populations, that is, σ12 = 1.0. However, as a result of random sampling, it turned out that the sample standard deviations were decidedly unequal, s1/s2 = 5.496. Normally, researchers would be inclined to reject such data as not suitable for statistical analysis using the Student t test. However, the t statistic is actually 0.433 for these data, which would yield a correct statistical decision.

Table 1. Examples of samples with (a) heterogeneous variances drawn from a homogeneous population and (b) homogeneous samples drawn from a heterogeneous population
(a)(b)
x 1 x 2 x 1 x 2
8.910.17.411.9
11.59.69.19.6
12.110.38.611.1
11.210.56.86.9
8.79.75.59.3
7.09.8 9.9
10.610.1 10.7
9.910.3 10.8
8.610.0 9.5
9.69.7 8.8
   12.3
   11.0
μ1 − μ2 = 0 μ1 − μ2 = 0
σ12 = 1.0 σ12 = 5.0
s1 = 2.439 s1 = 2.090
s2 = 0.081 s2 = 2.180
s1/s2 = 5.496 s1/s2 = 1.021
t = 0.433 t = 3.415
p < .357 p < .003

For the scores in Table 1(b), the variances of the populations were heterogeneous, with σ12 = 5.0. Nevertheless, it turned out that the sample standard deviations indicated homogeneity, that is, s1/s2 = 1.021, which would be suitable for a t test. But, because the population means are equal, the t statistic of 3.415 would yield an incorrect statistical decision. These examples are extreme in order to indicate the problem and are not representative of hypothesis testing in practical research. Nevertheless, the extent to which similar discrepancies occur as a result of random sampling is largely unknown. The present study employed simulations to provide an indication of how likely those outcomes may be and how Type I error rates and power are affected under various assumptions about distributions and sample sizes.

Pairs of samples were repeatedly drawn from normal and non-normal population distributions having predetermined parameters. On each replication of the sampling procedure, a record was made of the ratio of the sample standard deviations. In addition, a record was made of whether or not the difference in means of the two samples on that particular occasion exceeded the cut-off value for significance at the .05 level. That data from individual replications made it possible to compare the probability of rejecting the null hypothesis on trials where the ratio exceeded a certain cut-off value and trials on which it did not. The data also indicated how that probability depended on the nature of the distributions, the sample sizes, and the extent of the inequality of variances in the populations. Some results of transformation to ranks, preliminary tests equality of variances, and using separate-variances tests such as the Welch test, were also investigated.

2. Simulation method

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Simulation method
  5. 3. Results of simulations
  6. 4. Results of explicit selection of samples
  7. 5. Power curves with and without selection
  8. 6. Correlation of sample heterogeneity and probability of rejecting H0
  9. 7. Decisions based on preliminary tests of equality of variances
  10. 8. Approximate t tests without pooled variance estimates
  11. 9. Some practical implications
  12. References

The simulations were programmed using Mathematica, version 4.1, together with Mathematica statistical add-on packages (Wolfram, 1999).1 With the exception of the mixed normal distribution, random deviates were obtained from built-in functions in the statistical add-on packages, and the programming of the simulation procedures was also done with the Mathematica programming language. The main concern of the study was finding accurate estimates of Type I error probabilities and power for significance tests of location (the t test on initial scores and the t test on ranks), for normal and eight non-normal distributions, and for both selected and unselected samples.

The number of iterations of the sampling procedure for each condition was 10,000 for Tables 6 and 4, 50,000 for Table 4, and 20,000 for Table 5. Each point representing a probability in Figures 1-4 and 7-9 was obtained from 10,000 iterations, and each frequency distribution in Figures 5 and 6 from 100,000 iterations. Those numbers were chosen to be sufficiently large to give accurate estimates of conditional probabilities, when only a fraction of the total number of iterations contributed to an estimated probability of interest.

image

Figure 1. Probability of rejecting H0 by the Student t test on scores from normal and exponential distributions with and without selection of homogeneous samples.

Download figure to PowerPoint

image

Figure 2. Probability of rejecting H0 by the Student t test on ranks of scores from exponential and lognormal distributions with and without selection of homogeneous samples.

Download figure to PowerPoint

image

Figure 3. Probability of rejecting H0 by the Student t test on scores from normal and exponential distributions for samples with variance ratios above and below a cut-off value.

Download figure to PowerPoint

image

Figure 4. Probability of rejecting H0 by the Student t test on scores from lognormal and extreme value distributions for samples with variance ratios above and below a cut-off value.

Download figure to PowerPoint

image

Figure 5. Relative frequency distributions of ratios of sample standard deviations of pairs of samples when the t statistic is above or below its critical value. Ratio of population standard deviations was 1.0.

Download figure to PowerPoint

image

Figure 6. Relative frequency distributions of ratios of sample standard deviations when the t statistic is above or below its critical value. Ratio of population standard deviations was 1.5.

Download figure to PowerPoint

Non-parametric rank tests are often used to protect the significance level and to increase power for non-normal data. It is known that a rank transformation, in which a Student t test is performed on ranks replacing the original scores, has essentially the same outcome as the normal approximation form of the Wilcoxon–Mann–Whitney rank-sum test performed on the same data (Conover & Iman, 1980). In the present study, some comparisons were made between the t test on initial scores resulting from the simulations and the t test on corresponding ranks of the scores in the combined samples. Therefore, some of the findings in the study can be regarded as comparisons between the parametric t test and the non-parametric Wilcoxon–Mann–Whitney test. Throughout the study, the hypothesis tested, for both t and tR, was H: |μ1  μ2| = 0 against the alternative |μ1 μ2| > 0.

The parameters of the nine distributions used in the simulations are described in Table 2 (see also Evans, Hastings & Peacock, 2000). Random deviates from all distributions except the mixed normal were obtained from the Mathematica statistical add-on package, using the parameters given in Table 2. A mixed normal distribution, used to represent samples with outliers, consisted of samples from N(0, 1) with probability .95 and from N(8, 1) with probability .05. To make comparisons possible, all distributions were standardized to have mean 0 and variance 1.

Table 2. Skewness and kurtosis of distributions. All distributions were standardized to μ = 0 and σ = 1
DistributionsParametersSkewnessKurtosis
Normalμ = 0, σ = 103
ExponentialScale parameter 129
LaplaceLocation parameter 1, scale parameter 106
LognormalDerived from N(0,1)6.185113.936
Extreme value (Gumbel)Location parameter 1, scale parameter 11.1405.400
Half-normalScale parameter 0.750.9953.869
Mixed normalN(0,1) with prob. .95, N(8,1) with prob. .052.69711.539
RayleighScale parameter 16.3253.250
LogisticLocation parameter 0, scale parameter 0.504.2

3. Results of simulations

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Simulation method
  5. 3. Results of simulations
  6. 4. Results of explicit selection of samples
  7. 5. Power curves with and without selection
  8. 6. Correlation of sample heterogeneity and probability of rejecting H0
  9. 7. Decisions based on preliminary tests of equality of variances
  10. 8. Approximate t tests without pooled variance estimates
  11. 9. Some practical implications
  12. References

Table 6 indicates that both the t test and its non-parametric counterpart based on ranks affect the various normal and non-normal distributions in Table 1 in ways that are consistent with previous simulation studies. At the same time, the data reveal some unexpected effects when samples were selected so that variances were homogeneous. First, it is well known that non-normality often severely disrupts both Type I error rates and power of the t test for many distributions, whereas non-parametric tests, such as the Wilcoxon–Mann–Whitney test and the Wilcoxon signed-ranks test, restore the probabilities to values typical of a normal distribution.

However, many studies over the years have indicated that heterogeneity of variance, in combination with unequal sample sizes, disrupts the Type I error rates and power of the t test (see, for example, Hsu, 1938; Scheffe′, 1959, 1970; Stonehouse & Forrester, 1998). When a larger variance is associated with a smaller sample size, both the Type I error probability and power of the t test are spuriously elevated, and when a larger variance is associated with a larger sample size, the probabilities are depressed. When sample sizes are equal, the influence of variance heterogeneity is considerably smaller.

The data in Table 6 consistently exhibit those familiar effects for the various distributions and combinations of sample sizes. When n1 = n2 = 30, the t test on scores performed well for normal distributions despite inequality of variances. However, when sample sizes were unequal and the larger sample size was associated with the smaller variance (the first two pairs of columns in the table), both the Type I error rates and power of t were inflated, and the same was true of tR to a somewhat lesser extent. The influence of unequal variances was consistently greater when n1 = 10 and n2 = 50 than when n1 = 20 and n2 = 40.

In a similar way, when sample sizes were unequal and the larger sample size was associated with the larger variance (the last two pairs of columns in the table), both Type I error rates and power were depressed below the nominal significance level. Moreover, the effect was larger when n1 = 50 and n2 = 10 than when n1 = 40 and n2 = 20. The changes associated with unequal variances were not eliminated when the t test on ranks replaced the t test on scores.

4. Results of explicit selection of samples

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Simulation method
  5. 3. Results of simulations
  6. 4. Results of explicit selection of samples
  7. 5. Power curves with and without selection
  8. 6. Correlation of sample heterogeneity and probability of rejecting H0
  9. 7. Decisions based on preliminary tests of equality of variances
  10. 8. Approximate t tests without pooled variance estimates
  11. 9. Some practical implications
  12. References

A picture quite different from what might be expected emerged when explicit selection ensured nearly equal variances in those pairs of samples retained for the hypothesis test. In the case of the normal distribution, the results were almost the same for selected and unselected samples, and that was true for both t and tR. There was little or no effect on Type I error rates or power. In cases where variances were unequal and the larger variance was associated with the smaller sample, selection of homogeneous samples resulted in a change for the worse. The Type I error rates were inflated still further above the nominal significance level, and the probability of rejecting H0 for non-zero differences between means was spuriously increased still more. Again, the effect was similar for both t and tR, although the changes were somewhat less for tR. When a larger variance was associated with a larger sample size, selection brought the Type I error probability slightly closer to the significance level.

In the case of non-normal distributions, when variances were equal, selection typically reduced the Type I error probabilities for tR substantially below the nominal significance level. That is, the non-parametric method no longer maintained the Type I error rate close to the significance level despite non-normality. The probabilities were close to zero in many instances. For a difference between means of 0.4, selection of homogeneous samples typically reduced the probability of rejecting H0 slightly, and for differences of 0.8 and 1.2 selection increased the probability of rejecting H0, in many cases to a much larger extent.

The table indicates that both t and tR were biased in the case of several non-normal distributions, more severely for tR than for t. That is, the probability of rejecting H0 actually decreased as the difference between means increased. For example, consider tR performed on the exponential and lognormal distributions when σ12 = 2.0. For those distributions, the probability of rejecting H0 when μ1 − μ2 = 0.4 was considerably less than the Type I error probability. The same bias also occurred for the extreme value, half-normal, and mixed normal distributions.

For non-normal distributions with unequal variances, the selection process often resulted in extensive bias where it did not exist before, for both t and tR. For example, consider the exponential distribution for σ12 = 1.6 and μ1 − μ2 of 0, 0.4, and 0.8. For all combinations of sample sizes, the probability decreased when the difference was 0.4 and then increased again when it was 0.8. The same occurred for the lognormal, extreme value, and half-normal, distributions.

5. Power curves with and without selection

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Simulation method
  5. 3. Results of simulations
  6. 4. Results of explicit selection of samples
  7. 5. Power curves with and without selection
  8. 6. Correlation of sample heterogeneity and probability of rejecting H0
  9. 7. Decisions based on preliminary tests of equality of variances
  10. 8. Approximate t tests without pooled variance estimates
  11. 9. Some practical implications
  12. References

A more detailed picture of some of these relationships is given in Figures 1 and 2, which plot the probability of rejecting H0 as a function of μ1 − μ2, over a range from 0 to 1, in increments of 0.1. The upper part of Figure 1 shows power curves for the normal distribution when variances were equal but sample sizes were unequal (square symbols), and the function was typical of a power curve of a normal distribution. When variances were unequal (filled circles), the Type I error probability was inflated, and the increase in power as μ1 − μ2 increased was considerably smaller than usual. When only pairs of samples having a ratio of standard deviations less than 1.2 were retained, the Type I error probability was inflated still more, and the curve was slightly elevated above the one for unselected samples with unequal variances.

The lower part of Figure 1 shows power functions of the t test performed on samples from an exponential distribution under the same conditions The two curves for unselected samples with equal and unequal variances were much the same as for the normal distribution, but the curve for the selected samples declined abruptly and showed pronounced bias over an extensive range. Figure 2 shows similar results for tR performed on samples from exponential and lognormal distributions. When variances were unequal, the curves for both selected and unselected samples again were close together, and both exhibited extreme bias. For the exponential and lognormal distributions, and for both t and tR, selection of samples brought no improvement and distorted the form of the curves to an even greater extent.

Table 5 provides a somewhat different perspective by comparing the probability of rejection of H0 on those occasions when the ratio of sample standard deviations exceeded a cut-off value and on those occasions when the ratio was below that value. The three cut-off values were 1.1, 1.2, and 1.3, and the two conditions are represented in the columns labelled r > c and r < c. The first column in each group of three columns, labelled p(r > c), is the proportion of all samples in which the ratio of sample standard deviations exceeded the cut-off value. In the first part of the table, the population ratio σ12 was 1.0, and in the second part it was 1.5.

It is clear that, for all distributions, the proportion of the sample ratios exceeding the cut-off value declined as the ratio increased. However, the probability of rejecting H0 remained about the same for all cut-off ratios, although it varied depending on the type of distribution. For the normal, Laplace, and logistic distributions the probability was nearly the same above and below the cut-off ratio. However, for all other distributions, the probability was somewhat larger below the cut-off ratio, which is consistent with the results in Table 6.

Figures 3 and 4 provide more detailed power curves for these conditions for normal, exponential, lognormal, and extreme value distributions. For the normal distribution, there was almost no difference in the proportions for any value of μ1 − μ2. For the other three distributions, the curves representing selection of samples differed in ways consistent with Tables 6 and 5. However, the curves representing more heterogeneous samples (r > c) were close to those representing all samples. Again, for the exponential and lognormal distributions, the selection procedure resulted in bias where it did not exist before, and the same was true for the extreme value distribution.

6. Correlation of sample heterogeneity and probability of rejecting H0

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Simulation method
  5. 3. Results of simulations
  6. 4. Results of explicit selection of samples
  7. 5. Power curves with and without selection
  8. 6. Correlation of sample heterogeneity and probability of rejecting H0
  9. 7. Decisions based on preliminary tests of equality of variances
  10. 8. Approximate t tests without pooled variance estimates
  11. 9. Some practical implications
  12. References

When considering heterogeneity of variance of individual pairs of samples in practical hypothesis testing, the data in Table 3 are instructive. The table shows two correlation coefficients obtained from 20,000 iterations of the sampling procedure under the conditions that produced the frequency distributions in Figures 5 and 6 (to be discussed below). The correlation in the first column of each pair of columns (labelled rPB) is the point-biserial correlation of the ratio of the standard deviations of samples, s1/s2, and the rejection of H0 (denoted by 1), or non-rejection (denoted by 0). The second correlation in each pair (labelled r) is the Pearson correlation of the ratio of standard deviations and the value of the t statistic. Two population ratios of standard deviations, 1.0 and 1.5, were included.

Table 3. Probability of rejecting H0 by t tests on all pairs of samples of scores (t) and ranks (tR), and also on pairs of samples of scores (tH) and ranks (tRH) selected for homogeneity of variance
Student t, σ12 = 1.0 cut-off = 1.2Sample sizes
Distributionμ1 − μ2 n = 10 n = 50 n = 20 n = 40 n = 30 n = 30 n = 40 n = 20 n = 50 n = 10
t t H t t H t t H t t H t t H
Normal0.051.051.052.050.053.050.053.056.048.050
0.4.199.206.293.289.327.330.301.303.207.200
0.8.615.624.826.825.869.864.821.817.616.606
1.2.926.930.993.993.994.996.990.991.930.930
Exponential0.044.011.047.007.044.008.046.007.044.011
0.4.222.191.318.307.356.318.332.259.225.100
0.8.608.776.810.924.858.930.822.884.661.647
1.2.923.991.985.999.989.998.984.996.920.950
Laplace0.057.048.050.051.052.045.048.050.051.055
0.4.221.213.311.320.349.355.316.318.205.218
0.8.627.640.827.819.863.873.821.828.638.638
1.2.924.923.986.986.992.993.988.985.919.926
Lognormal0.050.006.040.006.041.004.043.004.050.004
0.4.265.315.394.474.431.480.415.400.281.163
0.8.669.906.822.961.854.964.843.926.724.744
1.2.900.994.958.997.967.998.958.992.904.957
Extreme value0.048.024.049.029.051.027.049.024.050.027
0.4.205.201.305.301.331.332.316.294.208.169
0.8.626.676.815.868.856.898.829.848.644.649
1.2.925.967.991.996.994.998.985.993.916.944
Half-normal0.050.024.049.024.051.027.048.023.048.025
0.4.204.185.294.279.331.332.305.262.212.134
0.8.612.674.821.868.856.898.813.846.624.625
1.2.928.970.989.997.994.998.984.994.917.934
Mixed normal0.043.011.042.006.051.022.047.007.042.013
0.4.225.176.330.246.337.314.368.202.259.177
0.8.626.694.810.871.855.892.808.870.654.481
1.2.919.973.983.997.993.998.975.998.906.944
Rayleigh0.047.033.051.038.049.035.044.036.049.035
0.4.206.197.298.284.335.319.301.289.214.183
0.8.618.646.815.840.862.884.819.833.629.642
1.2.922.951.990.995.995.998.990.992.923.931
Logistic0.049.053.051.052.048.051.050.042.046.049
0.4.204.211.312.310.337.345.307.310.210.217
0.8.635.636.823.821.862.868.827.824.629.638
 1.2.926.926.987.990.994.995.989.988.922.928
t on ranks, σ12 = 1.0 cut-off = 1.2tRtRH t R t RH t R t RH t R t RH t R t RH
Normal0.045.048.051.047.051.053.051.049.051.046
0.4.200.182.281.288.305.311.284.287.186.190
0.8.593.592.807.802.844.851.803.800.589.585
1.2.910.909.990.989.995.996.987.989.911.193
Exponential0.048.031.046.033.048.031.045.031.047.033
0.4.352.413.566.602.606.618.549.537.394.306
0.8.886.938.972.983.972.981.944.949.790.751
1.2.996.99911.9991.997.998.944.949
Laplace0.043.050.049.047.045.051.048.046.045.049
0.4.277.261.410.402.443.441.407.406.271.262
0.8.728.712.905.896.934.933.904.900.729.717
1.2.951.950.996.994.999.998.995.995.948.948
Lognormal0.051.034.052.038.055.039.052.039.052.032
0.4.715.813.888.919.896.914.838.824.637.539
0.8.996.999.9991.998.999.993.995.926.914
1.211111111.984.986
Extreme value0.052.038.050.041.045.040.050.036.048.034
.4.221.234.347.361.394.393.354.334.248.196
.8.698.732.886.903.910.919.864.879.682.649
1.2.973.977.996.999.997.997.992.995.923.932
Half-normal0.052.036.049.041.052.040.050.038.050.034
0.4.212.230.352.359.392.394.364.333.258.189
0.8.683.732.876.894.892.907.838.847.640.611
1.2.966.979.996.998.995.998.985.991.896.910
Mixed-normal0.049.038.048.037.050.036.050.039.050.038
0.4.514.529.713.728.755.769.701.700.506.396
0.8.977.980.998.999.999.999.996.999.941.953
1.211111111.991.999
Rayleigh0.050.039.049.042.048.040.050.042.041.038
0.4.189.199.300.299.337.327.308.296.211.176
0.8.612.639.835.838.863.868.808.813.614.596
1.2.944.953.992.993.993.995.985.988.897.903
Logistic0.050.046.045.048.046.046.051.044.049.043
0.4.208.212.318.327.354.357.318.313.217.216
0.8.649.628.845.840.882.880.836.843.640.643
 1.2.922.928.992.993.994.996.992.991.929.929
Student t, σ12 = 2.0 cut-off = 1.2 t t H t t H t t H t t H t t H
Normal0.205.257.106.210.051.136.019.052.004.008
0.4.296.346.219.346.160.287.081.179.025.044
0.8.487.544.513.653.472.641.347.500.129.188
1.2.715.764.817.900.824.915.736.837.403.508
Exponential0.210.361.127.408.060.315.024.081.008.014
0.4.247.066.206.055.126.032.054.013.007.063
0.8.455.079.530.072.506.047.347.018.096.240
1.2.733.340.868.465.872.447.781.296.446.524
Laplace0.197.131.118.095.056.053.019.029.004.014
0.4.291.258.247.271.176.214.091.149.023.063
0.8.507.554.548.644.505.610.378.478.141.240
1.2.725.810.816.909.817.898.740.819.440.524
Lognormal0.199.203.132.240.077.202.040.130.021.055
0.4.238.016.184.009.120.004.048.002.006.001
0.8.497.169.607.200.564.162.436.070.146.005
1.2.788.726.895.830.900.816.829.650.553.205
Extreme value0.206.345.122.344.058.261.021.147.005.046
0.4.261.164.224.096.140.048.069.016.010.005
0.8.459.200.527.168.490.115.356.050.110.003
1.2.716.433.838.511.845.482.763.336.431.049
Half-normal0.208.491.121.542.057.430.021.234.006.059
0.4.263.229.203.165.138.082.062.029.012.005
0.8.469.149.503.086.477.057.327.026.102.003
1.2.703.292.842.336.842.334.751.252.402.049
Mixed-normal0.190.190.130.267.072.240.032.144.020.035
0.4.227.055.186.036.113.019.042.004.005.001
0.8.463.217.558.294.516.265.366.110.085.003
1.2.760.623.887.799.892.780.819.550.472.092
Rayleigh0.205.390.116.384.049.312.016.164.003.035
0.4.271.251.215.163.150.076.070.021.012.003
0.8.462.277.524.245.486.152.350.072.112.014
1.2.713.477.833.535.840.501.758.371.406.119
Logistic0.200.574.113.130.047.071.018.026.003.011
0.4.283.572.228.284.171.215.085.134.022.050
0.8.469.599.533.627.489.598.355.462.136.206
 1.2.706.665.818.889.826.897.741.809.422.507
t on ranks, σ12 = 2.0 cut-off = 1.2tRtRHtRtRHtRtRHtRtRHtRtRH
Normal0.120.240.090.190.058.131.034.055.014.008
0.4.167.309.180.338.164.285.124.183.057.046
0.8.334.521.422.625.456.617.398.488.226.198
1.2.544.734.726.887.781.904.768.838.541.508
Exponential0.238.442.266.546.268.535.218.454.084.223
0.4.124.152.096.156.067.120.036.072.011.027
0.8.215.124.276.121.289.106.243.068.138.018
1.2.645.496.846.698.890.723.847.640.661.323
Laplace0.101.125.086.088.065.060.040.029.016.016
0.4.206.265.246.294.251.272.202.185.093.087
0.8.446.580.602.713.657.720.619.632.400.324
1.2.685.830.862.942.913.951.912.920.758.664
Lognormal0.340.411.443.596.473.640.411.596.201.382
0.4.114.086.086.069.058.051.036.030.019.014
0.8.500.425.701.586.755.609.700.508.492.221
1.2.981.982.999.997.997.994.993.981.93.788
Extreme value0.145.314.138.332.108.280.064.176.022.074
0.4.129.165.119.111.092.062.058.028.023.008
0.8.272.223.342.222.366.182.309.120.164.032
1.2.518.467.703.579.762.592.736.505.517.198
Half-normal0.164.454.155.523.128.466.088.325.026.101
0.4.128.219.097.185.066.124.042.058.014.012
0.8.212.163.251.111.254.074.200.050.099.013
1.2.444.298.609.358.675.374.625.309.417.126
Mixed-normal0.158.191.167.233.139.238.105.197.042.112
0.4.156.114.169.103.142.083.111.038.056.011
0.8.489.389.658.558.716.597.680.464.464.167
1.2.836.785.964.944.984.963.974.922.873.612
Rayleigh0.130.350.111.347.071.276.044.161.015.045
0.4.136.227.121.156.096.086.062.029.021.006
0.8.262.268.317.252.331.176.282.106.134.028
1.2.481.466.640.556.710.526.678.434.454.172
Logistic0.109.174.089.123.057.068.033.028.012.011
0.4.180.281.199.290.187.227.140.141.064.054
0.8.360.512.486.632.530.608.476.500.289.231
1.2.590.777.787.893.856.903.832.859.637.565

In the case of the normal distribution, both correlations were close to zero under all conditions. Apparently the degree of heterogeneity of a given pair of samples does not provide information as to the value of the t statistic or the likelihood of rejecting H0, contrary to the belief that prior examination of variance heterogeneity is beneficial in hypothesis testing. Although advance knowledge of variance heterogeneity at the population level can be a warning to researchers to avoid conventional two-sample tests of location, such knowledge apparently is not very helpful when obtained from particular samples that happen to be available. In the case of non-normal distributions, the correlations in Table 3 were higher, but in those cases, we have seen, the effects of heterogeneity are anomalous and selection of samples still is of little or no advantage in hypothesis testing.

7. Decisions based on preliminary tests of equality of variances

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Simulation method
  5. 3. Results of simulations
  6. 4. Results of explicit selection of samples
  7. 5. Power curves with and without selection
  8. 6. Correlation of sample heterogeneity and probability of rejecting H0
  9. 7. Decisions based on preliminary tests of equality of variances
  10. 8. Approximate t tests without pooled variance estimates
  11. 9. Some practical implications
  12. References

In practical hypothesis testing, a decision usually has to be made, by inspecting a single pair of samples, as to whether or not homogeneity of variance is satisfied. One strategy often employed is using a preliminary test for equality of variances as a basis for the decision (Albers, Boon & Kallenberg, 2000; Brown & Forsythe, 1974; Gastwirth, Gel & Miao, 2009; Levene, 1960; O'Brien, 1981; Scheffe′, 1959, 1970). Many preliminary tests have been proposed over the years, but the method has not been very successful. One problem of this approach is that the use of two hypothesis tests in succession compounds the Type I error rates. That is, the Type I error rate of the preliminary test of variance equality is superimposed on the Type I error probability of the subsequent hypothesis test of location (Zimmerman, 2004).

The method of the present study in a sense is a limiting case of the use of a preliminary test, because it is not subject to Type I errors resulting from initial sampling variability. That is, the simulation procedure reveals what might be expected of an ideal preliminary test based on sample statistics that contains no Type I errors. The finding that the explicit simulation method in the present study was not successful suggests that a preliminary test of equality of variances cannot accomplish what is intended, not even if a preliminary test made no Type I errors at all.

Some further evidence for this interpretation is given by Figure 7. Here, the same procedures depicted in Figures 1 and 2 were repeated with normal and exponential distributions with the addition of two preliminary tests of equality of variances that have been used widely in the past, the F test and the Levene test. On each replication, one of those tests was first performed, and if it rejected the hypothesis of variance equality at the .05 significance level, that pair of samples was rejected and another sample was drawn, continuing until the preliminary test failed to reject the null hypothesis of equal variances.

image

Figure 7. Probability of rejecting H0 for unselected samples, for samples with ratios of standard deviations below a cut-off value, and for samples selected by a preliminary hypothesis tests of variance equality.

Download figure to PowerPoint

The graphs reveal that a test of location conducted only after a favourable outcome of either initial test of equality of variances resulted in a power curve that was slightly below the one obtained from simulations based on selected samples with an absolute cut-off. The introduction of preliminary tests did not move the power curves appreciably closer to the one with the expected Type I error probability with no spurious increases in probability.

It should be emphasized that these results do not call into question use of the Levene test as a test of equality of variances (see Gastwirth et al., 2009; for further details). Rather, they show that even an effective method used for that purpose is not suitable in conjunction with a t test to protect against variance heterogeneity, because Type I errors in the two separate steps are compounded. The present results suggest that any two-stage procedure would be subject to the same limitations no matter how effective a preliminary test might be (see also Zimmerman, 2004).

8. Approximate t tests without pooled variance estimates

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Simulation method
  5. 3. Results of simulations
  6. 4. Results of explicit selection of samples
  7. 5. Power curves with and without selection
  8. 6. Correlation of sample heterogeneity and probability of rejecting H0
  9. 7. Decisions based on preliminary tests of equality of variances
  10. 8. Approximate t tests without pooled variance estimates
  11. 9. Some practical implications
  12. References

Another widely used attempt to overcome effects of heterogeneous variances in the t test is the separate-variances approximation introduced by Welch (1938), Satterthwaite (1946), and Aspin (1948). In this method, the sample variances in the denominator of the t statistic are not pooled in order to estimate the population variance, but remain separate, and an adjustment is made in the number of degrees of freedom. That procedure has proved successful in maintaining Type I error rates in the case of normal distributions, when variances are unequal and at the same time sample sizes are unequal.

On the other hand, the results of the method applied to non-normal distributions are not encouraging. Table 4 presents results comparing the Welch t test, using separate variances, to the Student t test, using pooled variances for the normal and the same nine non-normal distributions previously studied. The method was applied to both scores and ranks. In the case of the normal distribution, the Welch test performed well and eliminated the inflation or depression of the probability of Type I error rates. For many of the non-normal distributions, however, the Type I error rates and power remained anomalous. In the case of the exponential, lognormal, extreme value, half-normal, and mixed normal distributions, the Welch t test again proved to be biased in the same way as the Student t test, despite the attempted correction for unequal variances.

Table 4. Proportion of samples in which the t statistic exceeded its critical value (α = .05) when the ratio of sample standard deviations, r, was greater than a cut-off value (> c) or less than the cut-off value (c). Columns labelled p(c) are the proportion of all samples in which c
n1 = 10n2 = 50σ12 = 1.0
Distributionμ1 − μ2c = 1.0c = 1.2c = 1.3
p(c)r > cr < cp(c)r > cr < cp(c)r > cr < c
Normal0.711.049.053.490.050.050.318.050.049
0.4.715.207.209.488.208.205.318.200.205
0.8.717.622.623.489.629.622.322.620.624
1.2.714.928.927.486.928.928.321.925.927
Exponential0.837.054.009.693.062.010.571.074.014
0.4.838.216.199.691.224.195.571.226.206
0.8.838.589.784.689.552.778.571.512.766
1.2.838.913.990.691.893.989.567.878.987
Laplace0.803.049.050.634.053.051.494.048.051
0.4.803.206.221.634.202.221.494.202.219
0.8.806.635.645.635.633.640.496.626.639
1.2.801.922.920.634.921.920.491.921.924
Lognormal0.900.053.006.809.059.006.726.063.010
0.4.901.259.321.809.253.311.733.245.305
0.8.898.643.895.805.617.903.726.584.897
1.2.899.889.994.807.877.994.729.866.992
Extreme value0.783.056.024.597.062.026.453.069.030
0.4.781.212.195.597.217.196.448.219.201
0.8.782.600.678.595.575.675.449.552.673
1.2.783.916.970.605.899.968.450.880.963
Half-normal0.758.058.022.552.068.024.391.084.027
0.4.756.208.188.550.213.190.391.219.194
0.8.755.584.676.551.554.672.391.511.664
1.2.757.908.978.552.888.974.392.863.970
Mixed normal0.932.049.012.867.051.012.795.053.011
0.4.931.232.168.864.232.184.799.237.194
0.8.932.616.653.864.607.693.796.587.727
1.2.931.916.967.866.911.973.797.905.976
Rayleigh0.722.055.032.504.062.034.340.071.035
0.4.724.209.191.503.208.197.341.215.200
0.8.724.603.647.504.589.642.340.566.638
1.2.727.919.959.507.904.955.339.889.950
Logistic0.762.050.048.559.050.051.401.047.051
0.4.762.205.201.562.203.212.408.202.216
0.8.762.626.633.560.624.628.405.622.636
 1.2.761.923.926.560.921.930.404.922.926
n1 = 10n2 = 50σ12 = 1.5
normal0.894.129.149.788.133.146.678.125.150
0.4.893.249.263.790.245.265.680.240.274
0.8.894.535.553.787.533.559.679.525.563
1.2.894.805.833.785.807.834.679.808.826
exponential0.865.135.060.746.149.067.642.162.064
0.4.867.254.039.748.284.043.639.321.050
0.8.869.567.264.748.610.283.643.649.306
1.2.869.843.794.745.850.795.642.861.791
Laplace0.866.133.089.748.142.096.637.145.096
0.4.869.253.235.743.263.234.638.266.240
0.8.868.546.569.746.544.573.638.543.570
1.2.866.812.863.749.803.863.636.796.857
lognormal0.890.127.030.793.140.027.702..149.031
0.4.889.261.053.790.288.055.701.316.063
0.8.889.581.560.790.587.555.704.592.556
1.2.888.843.950.794.828.948.701.819.946
extremevalue0.864.128.141.739.130.135.633.132.128
0.4.866.257.087.744.282.103.631.304.110
0.8.865.559.313.743.600.325.631.636.347
1.2.864.850.692.742.867.705.630.888.722
half-normal0.879.126.166.766.126.156.654.124.145
0.4.881.252.062.765.275.075.651.305.094
0.8.877.557.253.766.594.270.653.634.291
1.2.883.846.610.764.873.655.654.894.668
mixed-normal0.916.109.080.836.116.084.764.117.078
0.4.917.236.097.836.253.083.765.265.081
0.8.914.540.478.839.548.463.766.561.449
1.2.914.843.900.839.837.885.763.838.864
Rayleigh0.886.128.182.778.122.172.668.115.162
0.4.890.252.131.780.267.138.669.281.151
0.8.888.547.339.777.573.357.668.600.374
1.2.888.837.673.779.854.683.668.873.706
logistic0.875.131.114.756.132.117.648.140.116
0.4.875.250.247.757.252.244.643.252.250
0.8.874.545.561.757.533.569.645.535.567
1.2.874.811.856.758.803.848.647.800.846
Table 5. Point-biserial correlation between ratio of sample standard deviations and rejection of H0 by individual pairs of samples (rPB) and Pearson correlation between ratio of sample standard deviations and the t statistic (r)
Distributionσ12 = 1.0σ12 = 1.5
r PB r r PB r
Normal−.001.005−.032−.037
Exponential.227.429.279.311
Laplace−.022−.060.095.110
Lognormal.180.361.333.411
Extreme value.135.216.088.101
Half-normal.201.302.057.055
Mixed normal.122.230.290.334
Rayleigh.098.142−.012−.021
Logistic−.006−.020.034.043
Table 6. Probability of rejecting H0 by Student t test (t) and Welch approximate t test (tW) performed on scores and on ranks, as a function of difference between means (μ1 - μ2), for unequal sample sizes and unequal variances (σ12 = 2.0)
Distributionμ1 − μ2Test on scores Test on ranks 
n1 = 10 n2 = 50n1 = 50 n2 = 10n1 = 10 n2 = 50n1 = 50 n2 = 10
t t W t t W t t W t t W
Normal0.205.052.003.047.115.059.014.060
0.4.288.092.028.152.171.093.054.151
0.8.484.198.135.430.335.210.221.428
1.2.709.390.403.774.536.382.540.754
Exponential0.208.097.009.047.233.115.078.276
0.4.239.050.004.207.116.057.012.061
0.8.459.110.092.497.224.161.142.236
1.2.730.345.448.767.650.632.643.620
Laplace0.199.046.004.049.096.052.015.056
0.4.288.097.023.156.199.127.101.215
0.8.505.245.142.470.441.308.403.572
1.2.730.451.423.776.683.533.766.853
Lognormal0.201.145.021.037.336.183.197.525
0.4.232.047.005.258.114.063.014.057
0.8.489.130.142.610.505.485.481.457
1.2.784.546.546.805.981.994.926.850
Extreme value0.204.063.004.049.142.069.026.098
0.4.262.055.009.179.124.065.022.086
0.8.475.142.106.476.261.175.157.304
1.2.720.351.429.771.519.394.521.631
Half-normal0.209.070.005.049.157.075.022.113
0.4.260.062.011.170.117.058.014.076
0.8.464.130.106.456.216.130.096.223
1.2.714.325.407.759.440.330.424.545
Mixed normal0.190.108.016.042.152.081.045.131
0.4.227.036.004.257.150.089.054.144
0.8.465.138.086.567.480.363.470.577
1.2.756.456.482.720.838.744.871.852
Rayleigh0.206.059.005.055.124.064.014.074
0.4.269.067.012.170.135.071.025.098
0.8.471.158.115.461.258.161.140.304
1.2.713.345.415.761.471.344.455.622
Logistic0.195.049.003.050.110.056.011.060
0.4.286.090.022.153.178.102.065.170
0.8.492.216.129.445.368.245.283.480
1.2.721.417.425.784.602.453.629.795

In the case of the test on ranks applied to the non-normal distributions, the results were similar. The Type I error rates remained too high or too low for most distributions. Furthermore, the bias of the test on ranks was even more extreme than the bias of the test on scores for the same 5 non-normal distributions mentioned above. Some more detailed power curves are exhibited in Figures 8 and 9. In both figures, sample sizes are n1 = 10 and n2 = 50, and, except for the curves with filled circles, the variances were different, that is, σ12 = 2.0, which are conditions under which inflation of power curves is expected in the case of normal distributions. The significance level was .05. The figures compare the t test applied to the same sample sizes when the variances are equal (filled circles) with the t test and Welch test when variances are unequal.

image

Figure 8. Power curves of Student t test with equal variances, Student t test with unequal variances, and Welch t test with unequal variances (performed on scores).

Download figure to PowerPoint

image

Figure 9. Power curves of Student t test with equal variances, Student t test with unequal variances, and Welch t test with unequal variances (performed on ranks).

Download figure to PowerPoint

In the case of the normal distribution, the Welch test restored the inflated Type I error rate to its usual value, for both scores and ranks. However, the power remained considerably below the usual value found when variances are equal. In the case of the exponential distribution those anomalous effects were magnified. For both the Student t test and the Welch t test applied to ranks, there was extreme bias in the case of the exponential distribution.

Apparently the Welch test and related separate-variances tests do an excellent job of maintaining Type I error rates for normal distributions. However, they are ineffective in maintaining either Type I error rates or power in the case of many non-normal distributions, and even for normal distributions their power is not what might be desired.

9. Some practical implications

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Simulation method
  5. 3. Results of simulations
  6. 4. Results of explicit selection of samples
  7. 5. Power curves with and without selection
  8. 6. Correlation of sample heterogeneity and probability of rejecting H0
  9. 7. Decisions based on preliminary tests of equality of variances
  10. 8. Approximate t tests without pooled variance estimates
  11. 9. Some practical implications
  12. References

It is not easy to decide when the assumption of homogeneity of variance in hypothesis testing is violated and when it is not. Discussions of the topic in the past have not always distinguished between population variances and sample variances. It is possible for two samples drawn from populations with decidedly different variances to have nearly the same variances, as a result of ordinary sampling variability. Also it is possible for two samples drawn from a single distribution or from two distributions with the same variance to have different variances. Researchers do not often have prior knowledge of the specific forms of distributions, so there is usually some uncertainty and a need to make decisions based on inspection of samples.

An idea of what can be expected in practice is indicated by Figures 5 and 6, which plot relative frequency distributions of the individual ratios of standard deviations of pairs of samples, s1/s2, taken from normal, exponential, and lognormal distributions over 100,000 replications, when the difference between means was 0.5σ. It is clear that the sample ratios covered an extensive range of values, and it would difficult to decide in any one case about the suitability of a test of location. In some instances (Figure 5), the populations had equal variances, but the samples represented in a fairly large region in the tails of the distributions would be taken as evidence of violation of the assumption when there is none. In other instances (Figure 6), the population variances were unequal, but many pairs of samples in the central part of the distribution could be taken as evidence that homogeneity is satisfied when it is not. Despite selection, there is still an extensive distribution of ratios that are not close to 1.00. The extent to which these possibilities would arise in practice would depend on sample sizes and consequent variability of the ratios.

The simulations in the present study revealed that arbitrary selection of samples with nearly equal variances, specifically ratios of standard deviations as small as 1.1, 1.2, or 1.3, did not improve the robustness of the Student t test or its non-parametric counterpart, the Wilcoxon–Mann–Whitney test, when population variances were unequal. On the contrary, selection of homogeneous samples often made the situation worse. When normal distributions had equal variances, the Type I error rates and power of the Student t test performed on selected or unselected samples were almost indistinguishable. However, for non-normal distributions, Type I error rates typically fell considerably below the nominal significance level, and power functions were anomalous. In the case of the t test on ranks, the Type I error rates for unselected samples were maintained, as is well known. However, for selected samples, that desirable property of non-parametric methods was no longer evident, and the Type I error rates declined below the significance level. Often there was extreme bias and power declined as the difference between means increased.

When population variances were heterogeneous, the Type I errors and power of tests on unselected samples exhibited the usual inflation or depression that occurs with unequal sample sizes. However, both t and tR performed on selected samples resulted in an even greater degree of inflation or depression. As a general rule, selection of samples to ensure homogeneity appears to move the probabilities of Type I and Type II errors still further in the direction that the selection process is intended to counteract.

In the case of normal distributions, there is no correlation between the presence of unequal variances in a pair of samples and the outcome of a hypothesis test based on those samples. In the case of non-normal distributions there are some correlations, but not in a direction that would be useful in practical research. It may be necessary to accept the fact that further modifications or adjustments to the t test and Wilcoxon–Mann–Whitney test cannot produce a method that will protect Type I errors and power under variance heterogeneity. On the other hand, it is possible that newer methods, not necessarily based on selection of samples or ranking, can better handle the two sample location problem when variances are unequal. Bootstrap methods, use of trimmed means, and other techniques that have been explored in recent years (Wilcox, 2003, 2005) may prove to be more useful in practical research.

Footnote
  1. 1

    The Mathematica code used in the simulations in this study can be obtained by writing to the author.

References

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Simulation method
  5. 3. Results of simulations
  6. 4. Results of explicit selection of samples
  7. 5. Power curves with and without selection
  8. 6. Correlation of sample heterogeneity and probability of rejecting H0
  9. 7. Decisions based on preliminary tests of equality of variances
  10. 8. Approximate t tests without pooled variance estimates
  11. 9. Some practical implications
  12. References
  • Albers, W., Boon, P. C., & Kallenberg, W. C. M. (2000). The asymptotic behavior of tests for normal means based on a variance pre-test. Journal of Statistical Planning and Inference, 88, 4757.
  • Aspin, A. A. (1948). An examination and further development of a formula arising in the comparison of two mean values. Biometrika, 43, 8896.
  • Boneau, C. A. (1960). The effects of violation of assumptions underlying the t test. Psychological Bulletin, 57, 4964.
  • Bradstreet, T. E. (1997). A Monte Carlo study of Type I error rates for the two-sample Behrens-Fisher problem with and without rank transformation. Computational Statistics and Data Analysis, 25, 167179.
  • Brown, M. B., & Forsythe, A. B. (1974). The ANOVA and multiple comparisons for data with heterogeneous variances. Biometrics, 30, 719724.
  • Conover, W. J., & Iman, R. L. (1980). Rank transformations as a bridge between parametric and nonparametric statistics. American Statistician, 35, 124129.
  • Cressie, N. A. C., & Whitford, H. J. (1986). How to use the two sample t test. Biomedical Journal, 28, 131148.
  • Evans, M., Hastings, N., & Peacock, B. (2000). Statistical distributions (3rd ed). New York: Wiley.
  • Fagerland, M. W., & Sandvik, L. (2009). Performance of five two-sample location tests for skewed distributions with unequal variances. Contemporary Clinical Trials, 30, 490496.
  • Gastwirth, J. L., Gel, Y. R., & Miao, W. (2009). The impact of Levene's test of equality of variances on statistical theory and practice. Statistical Science, 24(3), 343360.
  • Hsu, P. L. (1938). Contributions to the theory of Student's t test as applied to the problem of two samples. Statistical Research Memoirs, 2, 124.
  • Levene, H. (1960). Robust tests for equality of variance. In I. Olkin (Ed.), Contributions to probability and statistics (pp. 278299). Palo Alto, CA: Stanford University Press.
  • O'Brien, R. G. (1981). A simple test for variance effects in experimental designs. Psychological Bulletin, 89, 570574.
  • Overall, J. E., Atlas, R. S., & Gibson, J. M. (1995). Tests that are robust against variance heterogeneity in k × 2 designs with unequal cell frequencies. Psychological Reports, 76, 10111017.
  • Posten, H. O., Yeh, H. C., & Owen, D. B. (1982). Robustness of the two-sample t-test under violations of the homogeneity of variance assumption. Communications in Statistics – Theory and Methods, II, 109126.
  • Satterthwaite, F. E. (1946). An approximate distribution of estimates of variance components. Biometrics Bulletin, 2, 110114.
  • Scheffe, H. (1959). The analysis of variance. New York: Wiley.
  • Scheffe, H. (1970). Practical solutions of the Behrens-Fisher problem. Journal of the American Statistical Association, 65, 15011508.
  • Stonehouse, J. M., & Forrester, G. J. (1998). Robustness of the t and U tests under combined assumption violations. Journal of Applied Statistics, 25, 6374.
  • Welch, B. L. (1938). The significance of the difference between two means when the population variances are unequal. Biometrika, 29, 350362.
  • Welch, B. L. (1947). The generalization of Student's problem when several different population variances are involved. Biometrika, 34, 2935.
  • Wilcox, R. R. (2003). Applying contemporary statistical techniques. New York: Academic Press.
  • Wilcox, R. R. (2005). Introduction to robust estimation and hypothesis testing. New York: Academic Press.
  • Wolfram, S. (1999). The Mathematica book, 4th ed. Cambridge, UK: Cambridge University Press.
  • Zimmerman, D. W. (2004). A note on preliminary tests of equality of variances. British Journal of Mathematical and Statistical Psychology, 57, 173181.