Several recent letters to Integrated Environmental Assessment and Management and in Environmental Toxicology and Chemistry have discussed the need to replace the no observed effect concentration (NOEC) with the x% effects concentration (ECx) in environmental guidelines and publications (Landis and Chapman, 2011a, 2011b; Fox, 2011; Jager, 2012), and these follow numerous earlier letters and articles on this subject. In our view, calls for a ban on hypothesis testing in general, and NOECs in particular, are seriously misguided. The limitations of hypothesis testing and the benefits of regression analysis have both been exaggerated. There is no question that regression analysis is a vitally important tool in the statistician's toolbox and is ideally suited to evaluating some types of data. Conceptually, the advantages of ECx over NOEC are clear, in part because of the context in which the issue is usually framed. In an experiment with 5 or more concentrations of a chemical and a reasonably well-behaved quantal response (such as mortality) or continuous response (such as body or body-part weight or length, or biomass), it is often quite straightforward to fit a regression model from which meaningful ECx estimates can be derived for some range of values of x. For data of this sort, the ECx approach is appropriate, with some cautions discussed below. Alas, the conditions that are compatible with the ECx approach do not always match the reality of ecotoxicity testing.
Hypothesis testing has much wider application than determining NOECs and attempts to eliminate NOECs should not be interpreted as calls for the elimination of hypothesis testing in general, which appears to be the implication of Chapman and Landis (2011b). The authors suggesting the ban on NOECs seem to take a very narrow view of the questions that are asked in exposure–response testing, and this narrowness makes their arguments misleading. Likewise, it is an error to equate calculation of ECx values with the use of regression analysis. Use of regression analysis and calculation of ECx values are 2 separate (although related) concepts. Indeed, ECx values can be calculated without use of regression methods; for example, by using an up-and-down sequential estimation method.
The question should be framed in terms of whether the use of ECx values is preferable to the use of NOEC values and in what situations. The answer to this question is yes in many situations, but by no means all. Moreover, this answer does not have any implication regarding the use of hypothesis testing per se. Concerns regarding the shortcomings of using NOECs have been discussed widely (see Landis and Chapman, 2011a for citations) and are appropriate. However, there are also problems with use of ECx values that need to be addressed but typically are not. Increased emphasis on the calculation of ECx values should be accompanied by attendant efforts to provide guidance that deals with these problems, or in many cases, we will be trading one type of deficiency for another.
SIMPLIFIED STATISTICAL THINKING HAS MALIGNED THE NOEC
Regression analysis is contrasted in the letter of Landis and Chapman (2011a) with a hypothesis testing approach that is neither statistically nor biologically sound, as though that were all hypothesis testing in this area had to offer. The letter indicates that hypothesis testing starts with an ANOVA test for differences among the control and test concentrations. If that is not significant, one concludes that there is no effect, and the NOEC is the highest tested concentration. In the scheme that they describe, if this omnibus test is significant, then one uses some pairwise multiple comparison procedure to compare each treatment group against the control, independent of what happens in other treatment groups. They lament that valuable information from the experiment is lost. They are correct, but this is not an appropriate approach to this analysis on several grounds. An omnibus test for differences among the control and treatments, such as an ANOVA F test, is not an appropriate gatekeeper for whether to test for treatment effects in a toxicity test where the treatments form an ordered set of exposure concentrations of a single chemical. The F test (and similar statements apply to the Kruskal–Wallis test for nonparametric analysis) guards against many comparisons that are typically of no interest in toxicology, such as comparisons of one treatment group against another. Consequently, the F test may not be significant when there are real treatment effects that show up in the form of trends in the concentration-response. Such effects would be missed, because no further testing would be done. On the other hand, the F test may be significant because of some difference among the treatments that does not indicate a significant difference of any treatment compared to the control. In either case, the use of the F test as a gatekeeper distorts the significance levels of subsequent comparisons of treatments to control. Hochberg and Tamhane (1987) and Hsu (1996) discuss this issue in a broader context. Even if one were to do pairwise comparisons of treatments to control, they do not, and should not, depend on a prior significant F test. The significance levels associated with Dunnett's test, for example, are based on an assumption of its use independent of the F test or any other prior test. However, Dunnett's test is not appropriate for toxicity experiments, and there are much better ways to analyze toxicity data that do take the concentration–response trends into account.
Several trend-based tests are available that assume only a monotone concentration–response rather than a specific mathematical shape and that have much better power properties than pairwise tests. To be clear, the model underlying trend tests is that the means satisfy µ0 ≥ µ1 ≥ µ2 ≥ µ3 ≥…≥ µk (or with all inequalities reversed). Trend tests include Williams' test (that has the usual requirement of normality and variance homogeneity) and the Jonckheere–Terpstra test (that is much more broadly applicable). Both follow a step-down testing strategy that makes direct use of any concentration–response trend in the data and does not require adjusting the significance levels by the number of treatment groups to preserve the nominal false positive rate. These trend-based tests are in several recent test guidelines (e.g., Organisation for Economic Co-operation and Development [OECD] TG 229–231, 234) and are discussed in detail in OECD (2006).
It is interesting that van Dam et al., (2012) cite OECD (2006) as supporting the ban on NOECs in regulatory work. Two of the undersigned were major contributors to OECD (2006). Although some members of the team were highly critical of NOECs, others argued that estimation of ECx values using standard regression techniques was also severely flawed, and favored eliminating regression as well, in favor of what some call biologically based models. It was not the judgment of the team writing the OECD document to support such a ban, and the extensive descriptions in the document of hypothesis testing methods recommended for use in ecotoxicological testing do not support the conclusion that a ban on using NOECs was intended.
DATA TYPES UNSUITABLE FOR ECx DETERMINATION
One may still prefer ECx approaches wherein the experimental design is adequate. However, before a blanket requirement is made to require EC10 or other ECx values (e.g., EC5, EC20, or other ECx for unspecified x), useful models must be developed for several types of responses and situations, and some consideration needs to be given to the quality of the estimates produced by these models. Several types of data that arise in ecotoxicity experiments will illustrate this need. The EC10 is discussed in these illustrations, but we in no way want to imply that a 10% effect is the appropriate level of effect for all types of tests and endpoints. Indeed, the selection of the value for x is very important and often ill-considered.
Consider severity scores that arise in fish or amphibian histopathology often carried out in conjunction with studies of potential endocrine disruptors. These scores are typically on an ordinal scale ranging from 0 for normal to 3 for severe abnormality (for amphibian thyroid glands) or 0 to 4 (for fish gonads). Obviously, ordinary regression models are inappropriate for analysis of such scores. For example, the severity scores that are observed for many features assessed in gonads of control fish in the Fish Short Term Reproduction Assay typically are 0 or 1 with scores up to 4 occurring infrequently. It is also not clear what an EC10 would mean for such data. It cannot mean a 10% increase in the incidence of scores above 0 (or above 1 if all or most controls are scored 1), for that ignores the severity of the effect. It cannot mean a 10% increase in mean score, because mean score has no meaning for ordinal-scale variables. One could come up with some sort of weighted average of scores for all levels, but it is not at all clear whether there is biological meaning to that idea. There is also the question of how to properly accommodate the replicate nature of the experimental design in the estimation of an ECx value when multiple tanks of multiple fish are tested at each concentration. Green et al. (2012) have developed a hypothesis testing approach that handles this type of response and accounts for replicates and the expected increase in severity with increased concentration.
For fish and daphnia experiments, first and last day of reproduction, hatching, or swim-up illustrate another, somewhat similar, class of responses for which ordinary regression is not suitable. The direction of effect can be up or down, although generally in only 1 direction for a given chemical. The range of values (days) is typically very limited (with only 1-to-4 values) because the events are measured only daily, and all occur over a 1-to-4 day interval. One might prefer measurements to be taken more frequently, but it is not practical to do so and, in any event, the data to be analyzed are as described. The problem cannot be dismissed with a call for a better metric. To avoid confusion, these responses are not recorded on individual fish, but on the replicate, so there is no averaging of first day of swim-up across fish within a replicate. Each replicate has 1 value, and it is an integer. It is not unusual for the response to be constant for the control and lowest concentrations. It sometimes happens that only the high concentration shows any difference from the control. Unlike the severity scores, it does make sense conceptually to ask for the concentration that produces a specified (e.g., 10%) increase (or decrease) in the mean number of days to swim-up. However, it is not clear what model(s) should be used for the purpose, given the large number of tied scores, or what advantage there is to the regression approach beyond finding a significant delay or acceleration in the response. One suggestion that has been considered is to change the definition of ECx for these responses from a percent change in the time to swim up to the percentage of the population that is delayed. Apart from the confusion that would then arise from 2 meanings of the term, this approach would ignore the severity of the effect. That is, it would treat a delay of 3 days the same as a delay of 1 day. An exact permutation version of the 2-sided step-down Jonckheere–Terpstra test has proven effective in capturing increases or delays in these responses.
The analysis of the proportion of abnormal or deformed larvae in fish early-life-stage experiments is also problematic. Abnormalities are infrequently observed in control organisms, and the highest observed incidence rate in treated organisms can be under 10% but can still be statistically significant. Should we merely report that EC10 exceeds the highest tested concentration? What model should be used? If we model replicate tank proportions, then the requirements of ordinary regression are violated by the many zero-proportion tanks. If we do probit, logistic, or Weibull modeling, we ignore the replicate nature of the experiment, and there may not be enough variation to estimate the model parameters, and very likely will not be enough to construct confidence bounds on ECx. Is an ECx estimate without confidence bounds better than a NOEC? If so, how is it better?
Consider survival analysis in avian and fish studies. It is not uncommon for every concentration to have either a 0% or 100% survival, or for only 1 concentration to have partial survival. In these cases, it is either not possible to fit a probit, logistic, or Weibull model or a model can be fitted, but without any possibility of computing confidence bounds. It is still possible to compute an EC50 estimate with confidence bounds using moving average angle or binary methods. Do these qualify as regression estimates? If these methods are used, then the confidence bounds are based on hypothesis tests, even though an ECx is estimated. Alternatively, OECD TG 425 provides an up-and-down method for estimating EC50 with confidence bounds that is not regression based. Use of these methods should not be discouraged simply because they are not based on regression techniques.
Another interesting case is the measurement of vitellogenin (VTG) in fish to evaluate possible endocrine effects of chemicals. The measured response, VTG concentration, is extremely variable, and effects of 1000% or higher increases in VTG are observed; frequently there is also very high interlaboratory and even intralaboratory variability in this measurement. The data are continuous but by no means normally distributed or homogeneous in variance. For hypothesis testing purposes, a log-transform or even a rank-order transform is used to deal with the huge spread in the data. Regression models, even on log-transformed responses, are often very poor and generate very wide confidence bounds. It is totally pointless to estimate EC10 with such data, so what size effect should be estimated? Furthermore, do we estimate an x% effect based on the untransformed control mean or on the log-transform? If the former, the model-fitting algorithm will usually not converge. If the latter, the meaning of ECx will vary from experiment to experiment in a much bigger way than with more well-behaved data. For example, if the mean control response is 10, 100, 1000, or 10 000, then a 10% increase in the logarithm corresponds to a 26%, 58%, 100%, or 151% effect in the untransformed values.
At the other extreme, length measurements in daphnia chronic studies typically have very low variability and show only small changes in mean length, quite often less than 10%. Nonetheless, we often find statistically significant effects as small as 2% to 3%, and these are considered biologically important. Do we change this practice and report only that EC10 exceeds the highest tested concentration? Or do we undertake the challenging process of insisting that the biologists and/or ecotoxicologists specify the size effect to be estimated? Statisticians would like to have that information regardless of the statistical approach used but have found it difficult to obtain agreement on appropriate effect sizes.
Analysis of sex ratio in the fish sexual development test involves evaluating increases or decreases in the proportion of one phenotypic sex that may be attributable to an endocrine disrupting chemical. For those species where a genetic marker of sex is available, we can model true sexual reversal (i.e., when genetic sex is not the same as phenotypic sex, and the expected rate of reversal equals zero), and this is statistically equivalent to analysis of mortality data where there are well known regression models that can be used, either ignoring the replicate experimental design or, preferably, taking it into account. For those species where only the phenotypic sex can be determined, the control proportion for males or females is expected to be approximately 0.5, which is where the sample proportions have their maximum variance. Thus, there is considerable background incidence to take into account and more variability in control proportions than encountered in survival studies. It is not unusual to find a 20% standard error in the estimated control proportion of fish with a given phenotypic sex. If a regression model can be fit to these data (and usually, but not always, it can be), it is possible to estimate EC10, but does it make sense to do so? More generally, does it make sense to use an ECx estimate if its confidence interval includes zero or spans several test concentrations, or if the confidence interval for the estimated proportion at ECx contains the control mean proportion of the sex? In what way is such an ECx an improvement over a NOEC? Much of the advocacy for ECx over NOEC mentions the value of confidence intervals for ECx and the absence of same for the NOEC. Yet in application in the regulatory context, almost no consideration is given to the confidence interval and what it says about the quality of the estimate. There is a real danger of seeming to impart a perceived precision through a mathematical equation based on data that cannot support such precision. With all the limitations that people claim for the NOEC, science is not advanced by replacing it with a very precise but highly uncertain ECx estimate.
In the “simple” case of survival analysis, it is common to see step responses, even when test concentrations are very close together (e.g., in acute fish tests, mortality jumps from 0% to 100% in adjacent test concentrations). ECx values can be calculated in various ways in this situation (methods include regression, but not using standard regression models). In such cases, calculation of an EC50 may give a meaningful value (based on assumption of symmetry of response distribution), but estimation of an EC10 is highly problematic because no information is available to derive its location. Indeed, an examination of results from a large number of acute studies that were carried out by Rufli and Springer (2011) suggests that mortality patterns in most fish acute toxicity tests carried out according to US Environmental Protection Agency (USEPA) and OECD test guidelines cannot be analyzed using standard maximum likelihood regression methods (e.g., fitting probit or logistic models). Regression analysis of morality (e.g., probit or logistic models) using standard maximum likelihood methods requires that partial mortality (mortality >0% and <100%) occurs in at least 2 test concentrations. Rufli and Springer (2011) examined results in 2 databases: an “Industry Laboratory Database” consisting of data from a series of 523 96–h LC50-studies carried out according to OECD TG 203 from 1990 to 2000 by laboratories of Ciba-Geigy AG, Novartis, and Syngenta Crop Protection AG; and a second database consisting of 4010 studies carried out according to the comparable USEPA guidelines (OPPTS 850.1075 and FIFRA 72–1) that was extracted from the OPP Pesticide Ecotoxicity Database. Mortality was observed in at least 1 concentration in all tests, but partial mortality was observed in 2 or more concentrations in only 26% of the studies in the Industry Laboratory Database. The percentage of studies from the OPPTS database with fewer than 2 partial mortalities could not be determined directly, but slope estimates were provided for only 16% of studies in the database, suggesting that many of the studies had fewer than the 2 treatment groups with partial mortalities that would be required to estimate the slope using standard regression methods. Clearly, discussion of estimating ECx values for fish acute toxicity tests carried out according to USEPA and OECD test guidelines should focus on techniques that are appropriate for the majority of studies, and such techniques will not be standard regression methods. Relatively widely accepted methods are available for estimating EC50 values when fewer than 2 partial mortalities occur in a study, but no such methods exist for estimating EC10 or EC20 values, and if such ECx estimates are needed, guidance on methods is needed.
Analyses of split plots and other more complicated designs are not readily approached using ECx calculations, and indeed, it may be important to address questions such as the significance of the effects of plots. There should be no restriction on the use of hypothesis tests in the analysis of such studies.
Although the letter from Fox (2011) dismissed concerns about the challenges of model selection, it should be kept in mind that most analyses of data from routine studies are not done by professional statisticians, and this is unlikely to change no matter how much we in the statistics field might wish otherwise. Fitting regression models is more demanding than hypothesis testing. It allows greater interpretive capability but at a price. If a professional statistician is fitting the model, there should be little problem with model selection, except in those cases where no model fits. However, many labs that do testing for regulatory submissions do not have competent professional statisticians on staff or available for consultation. In these situations, what will likely happen is that a model of convenience will be fit, usually with the best of intentions but sometimes ignoring real problems with the model or data. Anyone who doubts that should consider the state of affairs with species sensitivity distributions, where the default distribution is log-normal and goodness-of-fit is often ignored. There are fewer problems when nonstatisticians calculate NOECs, because it is possible for the software to be programmed with a more complete decision-making process.
Although there are several widely accepted models for the sort of continuous responses often encountered in ecotoxicity testing (Slob, 2002; Bruce and Versteeg, 1992; OECD 2006), that cannot be said of models for data where hormesis or low dose stimulation is observed. Failure to take hormesis into account can over- or underestimate ECx by raising the estimated control mean from which an x% effect is calculated. Although several hormetic models are discussed in the literature, such as the Brain–Cousens model (Brain and Cousens 1989; Schabenberger et al., 1999), none can be said to be universally or even widely accepted in the regulatory community. For that reason, and because of the huge financial implications of rejected product submissions, some companies and labs are reluctant to use them. The issues in this and the previous paragraph are practical, not technical, but cannot be ignored.
As the last example (but not the last type of response where the regression-only approach raises questions), consider repeated-measures studies, possibly including multiple generations, such as for the Japanese medaka multigeneration studies being developed by Japan and the USEPA under OECD auspices. Although it is certainly possible to fit a regression model to weekly egg production data using dummy variables for generations and measurement times within generations and base it on the proper covariance structure, there is no meaning to interpolating between generations and often no interest in interpolating between weeks over the 5 to 6 weeks of prime egg laying. Interpretation of the ANOVA model, with comparisons of generations and egg production weeks within and across generations among other comparisons, is easier than that of the regression model, even though it does not allow interpolation of the effect between tested concentrations. We can do regression in this situation, but if the scientists and regulators do not want it and see no need for it, do we insist they use it anyway?
In closing, it is clearly appropriate to use regression models and to estimate ECx values where the data are sufficient and “fit-for-purpose” and the choice of the percent effect to be estimated has a biological basis. However, we believe that the claim that this one statistical tool is the only meaningful way to analyze ecotoxicological data cannot be supported. We are further concerned by the imposition of some arbitrary percent effect (10%; 20% is suggested in some OECD test guidelines; 25% is mentioned in some USEPA test guidelines) that can be too small or too large depending on the response and species. One of the undersigned has evaluated raw data from 41 algal toxicity tests and found a reasonable correspondence between the NOEC and the EC20 based on biomass, whereas the NOEC correlates with the EC10 based on growth rate. This is not to say that the 2 types of results are interchangeable, but merely to illustrate that “no one size fits all” when it comes to the selection of the value for “x.” In fact, we do not believe that NOECs and ECx values are interchangeable at all, because ideally, they are derived from completely different experimental designs. However, it is common to see regulatory guidance (such as the USEPA) use, for example, the “NOAEC or EC05” from a terrestrial plant toxicity test for the evaluation of risks to endangered plants, in effect equating the two.
We have only briefly alluded to situations where regression is normally appropriate, but for a particular data set, no viable model can be found. Nor have we addressed the implications of a tiered regulatory evaluation process on experimental designs and the statistical methods. Tier 1 studies typically may have only 1 to 3 test concentrations (e.g., USEPA EDSP Tier 1 ecotoxicity tests). The goal of carrying out these studies is not to quantify the concentration at which a specific size effect occurs but to determine whether there is an effect that needs to be explored further in a higher tier study. Regression models for these experiments, and ECx estimates from them, have little or no value. Nor is there any practical way to change all the Tier 1 studies into studies that involve larger numbers of test concentrations that would allow regression. We do not see any use in doing so in any event. There are good tools to be brought to bear on the types of data and situations we have indicated, and our hands should not be tied to prevent us from using them. Large numbers of important ecotoxicological studies will still be carried out to meet regulatory requirements for NOEC values. Indeed, the studies involved, such as avian reproduction studies, fish full lifecycle studies, and endocrine disruptor screening tests are among the most expensive, data intensive, and important of all ecotoxicological studies. Barring the results from such studies from the scientific literature seems a most unfortunate suggestion. It is possible that some of the results from these tests could be reanalyzed to provide ECx values for publication, if scientists could agree on the value of x for each study type, species, or endpoint, but the reality is that this is not likely to happen even in those cases when it is possible.