Evaluation of whole effluent toxicity data characteristics and use of Welch's T-test in the test of significant toxicity analysis



The U.S. Environmental Protection Agency (U.S. EPA) and state agencies evaluate the toxicity of effluent and surface water samples based on statistical endpoints derived from multiconcentration tests (e.g., no observed effect concentration, EC25). The test of significant toxicity (TST) analysis is a two-sample comparison test that uses Welch's t test to compare organism responses in a sample (effluent or surface water) with responses in a control or site sample. In general, any form of t test (Welch's t included) is appropriate only if the data meet assumptions of normality and homogeneous variances. Otherwise, nonparametric tests are recommended. TST was designed to use Welch's t as the statistical test for all whole effluent toxicity (WET) test data. The authors evaluated the suitability of using Welch's t test for analyzing two-sample toxicity (WET) data, and within the TST approach, by examining the distribution and variances of data from over 2,000 WET tests and by conducting multiple simulations of WET test data. Simulated data were generated having variances and nonnormal distributions similar to observed WET test data for control and the effluent treatment groups. The authors demonstrate that (1) moderately unequal variances (similar to WET data) have little effect on coverage of the t test or Welch t test (for normally distributed data), and (2) for nonnormally distributed data (similar in distribution to WET data) TST, using Welch's t test, has close to nominal coverage on the basis of simulations with up to a ninefold difference in variance between the effluent and control groups (∼95th percentile based on observed WET test data). Environ. Toxicol. Chem. 2013;32:468–474. © 2012 SETAC


The hypothesis test approach is commonly used in the National Pollutant Discharge Elimination System Program in the United States to analyze whole effluent toxicity (WET) test data. The test of significant toxicity (TST) was recently proposed as an alternative statistical approach for analyzing WET data. The TST approach is a statistical method that uses hypothesis-testing techniques based on previous U.S. Environmental Protection Agency (U.S. EPA) guidance 1 as well as work by many researchers 2–5. The TST examines whether the results of two treatments differ by an a priori prescribed amount rather than whether they are the same, as in traditional hypothesis testing 1, 4. The TST approach for WET testing uses the null hypothesis: µT ≤ b * µC. This null hypothesis includes a specific value for the ratio µTC, designated b (where b is a constant, 0.0 < b < 1.0), to delineate unacceptable and acceptable levels of toxicity. It also reverses the inequalities so that it is assumed that the sample has an unacceptable level of toxicity until demonstrated otherwise.

The TST approach uses Welch's t test to compare organism response in an effluent or receiving water sample with the response in a control or reference site sample. Welch's t test is a modification of the Student's t test and is intended for use in two-sample comparisons when there is a possibility of unequal variances between the two treatments 6. Welch's t test accounts for different variances in two groups and assumes data are normally distributed 6–10. Many researchers report that when unequal variances are combined with nonnormal distributions, both the traditional t test and nonparametric methods (e.g., Mann-Whitney-Wilcoxon test) have type I error rates that strongly deviate from nominal error rates 11–14. In these situations, Welch's t test has been recommended because it is more robust to type I error than other statistical tests 13–15. However, for nonnormal data that have skewed, long-tailed distributions (e.g., log normal or exponential distribution), the Welch's t test is known to have poor coverage 14; that is, the realized error rate (α) under the null hypothesis is greater than the intended, nominal value. The WET data are subject to unequal variances between a control and an effluent treatment, particularly for those test methods that measure acute mortality 16. Thus, use of Welch's t test rather than the traditional t test (which assumes equal variances) is advisable. However, the issue of nonnormality is also a concern when applying the Welch's t test. If WET test data are nonnormally distributed in a way that does not substantially compromise coverage of the Welch's test, such as a leptokurtic distribution 10–14, Welch's t test would still be an appropriate test for analyzing two-sample (e.g., concentration) comparisons of WET data.

In the present study, we examined the distribution and variance of typical WET test data to determine the suitability of using Welch's t test, particularly within the TST approach. We demonstrate that: (1) moderately unequal variances observed in WET test data have little effect on coverage of the t test or Welch t test (for normally distributed data), and (2) for the type of nonnormally distributed data observed in WET tests using a two-sample design, Welch's t test yields similar to nominal coverage using the TST approach.


Characterization of WET data

Valid WET data were received from several reliable sources and compiled into a database to characterize statistical properties of each WET test method 1. Sources included the Washington State Department of Ecology, U.S. EPA's Office of Science and Technology, the North Carolina Department of the Environment and Natural Resources, the California State Water Resources Control Board, and the Virginia Department of Environmental Quality. More than 2,000 valid WET tests of interest were incorporated, representing many permittees and laboratories. Only data from WET tests meeting U.S. EPA's test acceptability criteria were used in the analyses.

Because various WET test methods have a different experimental design, and thus could represent different distribution functions, a range of WET test methods (six) were examined to determine the frequency and magnitude of unequal variances between control and the effluent treatment, as well as the frequency and type of nonnormality in these methods. The WET test endpoints included Ceriodaphnia dubia (water flea) reproduction (n = 10 for each treatment), Pimephales promelas (fathead minnow) growth and survival (n = 4), Americamysis bahia (mysid shrimp) growth (n = 8), and Macrocystis pyrifera (giant kelp) germ-tube length and percentage of germination (n = 5) tests 17. In addition, standard data transformations were used for tests when data were nonnormal to see whether transformed data would meet assumptions of normality.

Statistical tests

All statistical analyses were performed using R 18. Standard F tests (p = 0.01) were conducted for each valid WET test (effluent and control) to determine whether variances were unequal. Shapiro-Wilk's normality test 17, 19 was used to evaluate whether WET test data were normally distributed. Pearson's measure of kurtosis (R moments package) was used to determine whether skewness or kurtosis, or both, were major sources of nonnormality. The critical values of those moments for a normal distribution are shown in Table 1. A skewness measure less than 0 indicates that the sample comes from a population that is skewed to the left, and a skewness measure larger than 0 indicates that the distribution is skewed to the right 20. A kurtosis measure significantly larger than the median value (50th percentile) for a given test design indicates an underlying leptokurtic distribution 21. We also used the D'Agostino test of skewness 20 and Anscombe–Glynn test of kurtosis 21 for hypothesis testing to test for significant levels of skewness and kurtosis.

Table 1. Distribution of critical skewness and kurtosis ranges for different sample size (N) based on 1,000,000 simulation runs
  • a

    N = 20 corresponds to Ceriodaphnia dubia reproduction test (10 replicates in effluent and control); N = 16 corresponds to the Mysid chronic test (eight replicates per treatment); N = 10 corresponds to the two giant kelp chronic test endpoints (five replicates per treatment); N = 8 corresponds to fathead minnow acute and chronic tests (four replicates per treatment).



Simulations of both unequal variances and nonnormal distribution scenarios were performed using R. The objective of the simulations was to confirm that the α error rate is relatively stable against deviation from unequal variances and against deviations from nonnormal distribution when variances are unequal as well, for both the traditional t test and Welch's t test. The nonparametric Wilcoxon rank sum test (computing the exact p value) was also used to provide a comparison with results from parametric tests for the unequal variance and nonnormal distribution conditions.

Unequal variances

Various simulations were conducted, using the chronic Ceriodaphnia test method as an example, to examine α error rate using the traditional t test, Welch's t test, and Wilcoxon rank sum test, with data having different relationships between control and effluent variance. From analyses of WET test data summarized in Table 2, a variance ratio (effluent/control) of 9:1 (95th percentile of variance ratio) was found to be a reasonable upper limit. Therefore, four different simulations were examined: (1) equal variances and no mean difference between control and effluent; (2) effluent with nine times the control variance and no mean difference; (3) equal variance and a 25% mean effect in the effluent treatment; and (4) effluent having nine times the control variance and a 25% mean effect. Equal sample size (n = 10 using the Ceriodaphnia chronic test method) was assumed for both control and effluent treatment groups, which is most often the case in routine WET testing conducted in the United States and elsewhere 17.

Table 2. Number (and percentage) of tests with nonnormal distribution and unequal variances for different types of whole effluent toxicity (WET) tests, as well as the effect of data transformation on distribution, including skew and kurtosis
Test nameNumber of testsData transformation# (%) Nonnormal tests (p ≤ 0.01)# (%) Tests failing f-test for unequal variances (p ≤ 0.01)Range of skewness statistic for nonnormal tests# (%) Tests failing D'Agostino test for skewness (p ≤ 0.01)Range of kurtosis statistic for nonnormal tests# (%) Tests failing Anscombe test for kurtosis (p ≤ 0.01)
Ceriodaphnia dubia reproduction1,382Raw285 (20.6)390 (28.2)−1.529 to −0.2633 (2.4)3.821–6.571159 (11.5)
Sqrt trans418 (30.2)545 (39.4)−1.790 to −0.38589 (6.4)4.013–7.45268 (19.4)
Log +1525 (37.9)630 (45.6)−2.058 to −0.564143 (10.3)4.06–8.43343 (24.9)
Fish growth108Raw2 (1.9)18 (16.7)−1.253−1.2500 (0)3.261–4.2130 (0)
Mysid growth907Raw10 (1.1)37 (4.0)−0.423−1.4431 (0.1)2.52–4.9127 (0.77)
Giant kelp germ-tube length100Raw9 (9.0)22 (22)−1.478−1.5480 (0)4.025–5.4566 (6)
Log + 18 (8.0)30 (30)−1.571−1.2340 (0)4.25–6.0808 (8)
Sqrt9 (9.0)29 (29)−1.625−1.3810 (0)4.238–6.0688 (8)
Giant kelp germination100Raw3 (3.0)15 (15)−0.9−1.2810 (0)3.465–4.6973 (3)
Arcsin(sqrt)1 (1.0)9 (9)−0.872−1.040 (0)3.465–4.6980 (0)
Fish survival108Percent44 (40.7)61 (56.5)−1.633−0.6540 (0)2–4.673 (2.8)
Arcsin(sqrt)42 (38.9)61 (56.5)−1.633−00 (0)2–4.673 (2.8)


The distribution of control and effluent reproduction data from 281 C. dubia multi-concentration tests was examined. Generally, each test included five effluent concentrations and a control. Although most tests indicated that control reproduction follows a normal distribution (mean = 24.5, standard deviation = 5.65), reproduction data in the effluent treatments tend to deviate from a normal distribution (Fig. 1). Effluents with low toxicity appear more likely to have reproduction data that derived from a normal distribution (Fig. 1B), whereas effluents exhibiting high toxicity have data that appear more likely to have a skewed distribution (nonnormal distribution Fig. 1C, D). To address this observation, two populations were simulated on the basis of the shape of the frequency distribution in the highest effluent concentration in each C. dubia test (Fig. 2). The first simulated effluent population had a mean of 25 (equal to the population mean for the control group) and a standard deviation of 8.35, whereas the second one had a population mean of b × 25 (where b = 0.75 for chronic test methods 1), resulting in an effluent mean of 18.75. The variance of those two effluent populations was the same. Random samples taken from these two populations were used to compare with the control population data (mean = 25, standard deviation = 5.65).

Figure 1.

Histograms of observed Ceriodaphnia reproduction at different levels of effluent concentrations based on 281 multiple concentration tests.

Figure 2.

Simulated frequency distributions of Ceriodaphnia reproduction data with two populations having nonnormal data and different means. Both populations have a standard deviation of 7.7.


Characterization of WET data

Some WET test methods and endpoints demonstrated a higher frequency of unequal variances than other test methods (Table 2). For example, more than half of the P. promelas (fish) acute survival tests had unequal variances (F test, p ≤ 0.01). This result is expected because control acute survival typically has little or no variance (e.g., all control replicates display 100% survival). Ceriodaphnia dubia reproduction had the next highest frequency of tests with unequal variances (28.2%). The giant kelp germ-tube length and percent germination endpoints as well as P. promelas (fish) chronic growth WET endpoint each had a lower frequency of tests with unequal variances (15–22%), whereas the mysid growth endpoint had the lowest frequency of unequal variances of the six test endpoints evaluated (4%).

Using the C. dubia reproduction endpoint as an example of a WET method having a higher frequency of heterogeneous variances, the variance ratio between effluent and control was generally less than 9:1 (95th percentile ratio), with a median variance ratio of 2.5. Examination of data using other growth/reproduction endpoints indicates that most tests have a variance ratio of less than 10:1 (95th percentile) and a median variance ratio of less than 3.0. Percent data (e.g., giant kelp germination) appear to be subject to higher variance ratios (20∼30:1); however, the fish acute percent survival endpoint has a variance ratio generally less than 6.2:1 (95th percentile).

The number of tests failing the Shapiro's normality test at 1% probability is reported in Table 2. Approximately 21% of the C. dubia reproduction tests (285 of 1,382 cases) failed the normality test (Table 2). Both square root and logarithm transformations did not correct the nonnormal distribution problem and instead increased the total number of tests failing the normality test (Table 2). The D'Agostino test of skewness indicated that 33 tests (<3%) were highly skewed. The test of kurtosis indicated that 11% of tests (160) had significantly leptokurtic distribution (Table 2). Apparently, most of the C. dubia test data failed the normality test because of kurtosis (leptokurtic distribution), and the occasional asymmetric distribution was mostly from outliers (Fig. 3). In general, most WET test growth endpoints (i.e., P. promelas growth, mysid growth, or giant kelp germ-tube length) were normally distributed. Both fish and mysid growth data exhibited nonnormal distribution in only a few cases (<2%), and those were generally related to leptokurtic distributions that were short-tailed (Table 2). Almost half of the acute fish survival tests had nonnormally distributed data (Table 2). Zero variance in many tests for either the control (34 cases) or the effluent treatment (26 cases) was the main cause of failing the normality test. Nonnormality in acute fish survival data was attributable to leptokurtic distribution of the data (Table 2).

Figure 3.

Histograms of examples of Ceriodaphnia chronic reproduction test data showing nonnormal distribution and especially leptokurtic distribution. Each histogram shows the combined centered (standardized) values for both control and treatment.

These analyses indicate that WET data in general do not have long-tail, highly skewed distribution, suggesting that the distribution characteristics of Welch's t test should be appropriate for analyzing two-concentration WET data.


Unequal variances

When equal variances are present and the true difference is zero, the observed error rates from both the traditional t test and Welch's t test were very similar to the nominal error rates, which was expected because this condition meets the assumptions of the t test (Table 3). When control and treatment groups have unequal variance (i.e., effluent variance = 9 × the control variance), the traditional t test had a slightly higher type I error rate, but Welch's t test had a type I error rate similar to the expected value (Table 3). When the true response at the IWC is 0.75 × control mean (b = 0.75), and both populations have equal variances, α error rates were very similar to the expected error rate using both the traditional t test and Welch's t test (Table 3). When the true response at the IWC is 0.75 × control mean and population variances are not equal (i.e., effluent variance = 9 × the control variance), error rates were approximately 2 to 3% higher than expected using the traditional t test but similar to expected error rates using Welch's t test (Table 3). Higher type I error rates using the t test translates to more tests being declared toxic when in fact they are not. The Wilcoxon rank sum test did not perform better than either the traditional t test or Welch's t test in any of the situations.

Table 3. Results of Monte Carlo simulations evaluating alpha error rate using either the traditional t test or Welch's t test with data having different relationships between control and effluent variancesa
 Alphaµc = µtµt = 0.75 µc
t testWelch t testWilcoxon rank sumt testWelch t testWilcoxon rank sum
  • The b value is set at 0.75. Results are based on 1,000,000 simulation runs per scenario.

  • a

    Smath image = control variance; Smath image = IWC variance; µc = control mean; and µt = IWC mean.

Smath image = Smath image0.0100.0100.0090.0090.0100.0100.005
Smath image = Smath image/90.0100.0130.0100.0140.0210.0100.013


Simulation results indicated that when the two populations had the same mean but had a different distribution shape as compared with a normal distribution (control population), the α error rates using both the traditional t test and the Welch's t test were almost identical, as expected (Table 4). The α error rate for the Wilcoxon rank sum test varied but tended to be 1 to 5% higher than expected; in other words, the Wilcoxon rank sum test tended to reject the null hypothesis more often than expected. When the true population mean difference between control and effluent is 25% of the control mean (b = 0.75), and when the effluent population is not normally distributed, the α error rate was only slightly higher than the expected value using the traditional t test (Table 4), whereas Welch's t test resulted in a decrease in the nominal α error rate by approximately 1% using the TST approach. The nonparametric Wilcoxon test did not perform as well, resulting in a 2 to 5% decrease in α error rate (i.e., less chance to reject the null hypothesis or a higher rate of declaring a sample toxic when in fact it is not). Thus, when WET data are extremely nonnormal and variances between the control and effluent treatment are heterogeneous, Welch's t test is less likely to reject the null hypothesis (i.e., the analysis will be more conservative). As test data approach a normal distribution, α error rates using Welch's t test are closer to nominal values.

Table 4. Results of Monte Carlo simulation analyses (100,000 simulations per scenario) indicating alpha error rates based on comparisons between two nonnormally distributed populations and a normal distribution (control population, µ0= 25, standard deviation = 5.65)a
AlphaPopulation mean µ0 = µ1 = 25µ0 = 25, µ2 = 18.75, b = 0.75
Student t testWelch's t testWilcoxon's rank sumTST t testTST Welch'sWilcoxon rank sum
  • a

    The nonnormally distributed population means (µ1 and µ2) are 25 and 18.75, respectively, and the standard deviation is 8.35 in both populations.

    TST = test of significant toxicity.



When population variances are not equal, or test samples are nonnormally distributed (or both), concerns could be raised using the two-concentration t test or the bioequivalence t test 2, because statistical assumptions might not be met. The U.S. EPA WET test methods 17 specify that if the data fail Shapiro-Wilks's normality test or Bartlett's homoscedasticity test (or both), a nonparametric test such as Wilcoxon rank sum test should be used. Extension of such nonparametric tests to TST, however, is complicated because the null hypothesis for nonparametric tests is that results from control and effluent are from the same population. This is stated as the null hypothesis of no difference among treatments. Because an effect size 1– (b × µc) is specified in the TST approach that is related to the control population mean, a nonparametric equivalent to a t test approach using a bioequivalence formulation (such as with the TST approach) has been difficult to demonstrate 13, 22, 23. Another drawback of nonparametric methods such as Wilcoxon rank sum test is that WET test data tend to have many ties (equal ranks) so that computation of an exact p value is often not possible. Our simulation results demonstrated that the Wilcoxon rank sum test generated an α error rate that deviated from expected when WET data are extremely nonnormal, and variances between the control and effluent treatment are heterogeneous. Applying the Wilcoxon rank sum test could declare more than 5% tests as toxic when they actually are not. For these types of WET data, the nonparametric alternative may not be appropriate.

Data compiled from more than 2,000 valid WET tests in this study confirmed that the type of distributions exhibited in most WET test data do not seriously compromise the use of a t test. The data can be dealt with appropriately using Welch's t test for unequal variances, as shown in simulation analyses in this study. Use of Welch's t test for TST analysis is supported on the basis of analysis of actual WET test data, which indicate that most WET test data are normally distributed or have a leptokurtic distribution with short tails such that the use of Welch's t test produces type I error rates very close to expected error rates. Whereas the specific results pertain to the C. dubia reproduction endpoint, the general conclusions of this analysis would apply to all WET methods and endpoints. Such results confirm that Welch's t test has better coverage than the traditional t test using the TST approach when variances are unequal. Statistical literature indicates that actual power of the t test (and by extension Welch's t test) is greater when populations are leptokurtic, especially for small sample sizes typical of most WET test designs 10.

The WET test data are expected to have short-tailed distributions, supporting the use of Welch's t test, because of the test method's required test acceptability criteria and test termination times, which constrain the range of endpoint responses encountered. For example, a chronic C. dubia test must have 80% or more survival and an average of 15 or more young per surviving female in the control for the test to meet the required test acceptability criteria (i.e., a valid test 17). Additionally, test termination is prescribed in the method as the time at which at least 60% of the surviving control females generate at least three broods, which can be 6 to 8 d (maximum 8 d), also a test requirement 17. These requirements result in a lower distribution bound (e.g., reproduction responses in controls start at 15). In addition, the upper part of the distribution cannot go to infinity, even if populations were to survive and reproduce beyond the prescribed test requirements because of biological constraints. Similar test method and biological constraints apply to all other WET test endpoints (e.g., growth, survival).

Furthermore, Welch's t test is robust to nonnormal distributions when the underlying distribution is symmetric and skewness is low, especially with sample sizes greater than 10 24–26. The West Coast U.S. WET methods, the C. dubia reproduction, and the mysid chronic growth WET methods evaluated have sample sizes greater than 10 as part of the minimum test design. Therefore, Welch's t test is expected to be robust for these methods. Thus at least for those WET methods and others with similarly large sample sizes (more organisms either per replicate or per treatment), Welch's t test should not result in a substantial underestimation of the type I error rate.

In addition, the type I error rate using TST for several WET methods is set at 0.05 or greater. The higher α levels include WET test methods that have smaller sample sizes such as the fathead minnow acute test. The slight overestimation of the nominal type I error rate that can occur using Welch's t test when WET test data are strongly nonnormally distributed is insignificant given the higher nominal α levels established for these methods 1, 4. For the West Coast WET test methods that have α levels set at 0.05, effect size examined in those test methods is large and, in many cases, data are normally distributed even without data transformation (e.g., giant kelp germination and germ-tube length endpoints, Table 2).

Whereas the type II error rate was not the focus of this study, we note that α values for all test methods were developed such that the type II error rate at a mean effect of 10% or less as compared with the control was 0.05 or less. Thus, a slight increase in the type I error rate using Welch's t test and nonnormal, heterogeneous data would have little effect on the type II error rate at a 10% or less mean effect. As noted in previous U.S. EPA guidance 1 and other publications 4, 5 regarding the TST approach, the rate at which a sample is declared toxic will increase as the mean effect increases above 10%. Thus, the type II error rate has been controlled in the TST approach, as discussed in previous publications 1, 4. Therefore, so long as the type I error rate remains stable with different types of data distributions or unequal variances, as demonstrated in the present study, very little change in the type II rate with percent mean effect is expected.

The observed sample distribution from 281 C. dubia multiple concentration tests indicates that test populations at low effluent concentrations (<20% effluent) are less likely to deviate from normal distribution. A similar trend is expected for other WET endpoints, such as growth, because low toxicity is less likely to lead to sudden changes to these endpoints. The simulation based on the distribution shape of the high effluent concentration population also indicates that the α error rate using Welch's t test is less than expected. That is, Welch's t test is more conservative (i.e., a lower than expected error rate) when toxicity is high. Therefore, the type of nonnormal distribution observed in WET tests should not negatively affect the outcome of TST analyses.

Analyses used to develop the TST approach indicate that data transformation (log or square root) does not correct the nonnormality issue for WET test data (Table 1). This is usually because of the leptokurtic distribution rather than because of skewness of data (Table 2). Therefore, data transformation before TST analysis is not recommended except for percent data, which should be arcsine square root transformed before TST analysis (consistent with current U.S. EPA analysis recommendations 17). This precaution is suggested because percent data (especially acute percent survival) is most prone to nonnormality.

In conclusion, given the leptokurtic and short-tailed distribution of most WET test data, as well as the other factors noted, Welch's t test is appropriate to use for one-tailed, two-sample comparisons using TST. Furthermore, because Welch's t test performs as effectively as the t test in terms of type I error when data are normally distributed and variances are equal 8, 9, Welch's t test should be used for all WET test data analysis using TST. Many researchers have shown that the combination of using a preliminary variance test (e.g., F test) plus a t test does not control type I error rates as well as simply always performing an unequal variance t test such as Welch's t test 27, 28. Several researchers therefore report that deciding whether to perform one statistical test on the basis of the outcome of another is generally unwise 29–31.


J. Fox provided statistical and analytical guidance in this research. J. Gilliam, J. Roberts, and M. Bowersox provided assistance on data compilation, database organization, and data presentation. This document has been reviewed in accordance with U.S. Environmental Protection Agency policy and approved for publication. Approval does not signify that the contents necessarily reflect the views or policies of the Agency nor does mention of trade names or commercial products constitute endorsement or recommendation for use.