The effective design of field studies requires that sample size requirements be estimated for important endpoints before conducting assessments. This a priori calculation of sample size requires initial estimates for the variability of the endpoints of interest, decisions regarding significance levels and the power desired, and identification of an effect size to be detected. Although many programs have called for use of critical effect sizes (CES) in the design of monitoring programs, few attempts have been made to define them. This paper reviews approaches that have been or could be used to set specific CES. The ideal method for setting CES would be to define the level of protection that prevents ecologically relevant impacts and to set a warning level of change that would be more sensitive than that CES level to provide a margin of safety; however, few examples of this approach being applied exist. Program-specific CES could be developed through the use of numbers based on regulatory or detection limits, a number defined through stakeholder negotiation, estimates of the ranges of reference data, or calculation from the distribution of data using frequency plots or multivariate techniques. The CES that have been defined often are consistent with a CES of approximately 25%, or two standard deviations, for many biological or ecological monitoring endpoints, and this value appears to be reasonable for use in a wide variety of monitoring programs and with a wide variety of endpoints.
The traditional paradigm in ecological studies and environmental monitoring bases both research and management decisions on statistical tests of null hypotheses (H0). Statistical tests of collected data are used to evaluate the probability of observing a test statistic at least as extreme as the one observed. Newman  recently reviewed the use of statistical tests and highlighted some of the concerns. The statistical test sets a predefined critical significance value (α), which is the type I error rate, and a second predefined critical value (β), which describes the overall type II error rate. Typically, α is set at 0.05, but a range of values are used for β. In addition, insufficient attention has been given to setting β in a justifiable manner . Some programs (i.e., the Canadian Environmental Effects Monitoring [EEM] program ) have set α = β for field programs so that the risk of a false conclusion is equally balanced between risk to industry (type I error) and risk to the ecosystem (type II error).
If the probability of H0 being true is less than α, a significant departure from H0 and, therefore, from HA (the logical opposite of H0) is supported, and a difference, or effect, likely exists. A significant difference between treatments is defined as a meaningful effect, or the effect size [1,3]. The designation of this value for purposes of study design often is referred to as the critical effect size (CES). No clear guidelines exist on how to (or who should) determine how large an effect is unacceptable, and this determination can vary with the design, purpose, and regulatory basis for a monitoring program. Differences exist between the statistical threshold for detecting changes and the threshold that describes either ecologically important changes  or the changes thought to be of importance to higher levels of organization . Few studies have attempted to link changes across levels of organization [6,7], and even fewer have tried to define levels of change that would be protective of higher levels of organization or important ecological processes.
The specification of a CES has been argued to be one of the most crucial aspects of environmental monitoring programs [3,8–10], although failure to consider CES a priori has been widespread . In practice, CES rarely are used to guide ecological experiments or environmental management decisions in an effective manner, and few examples exist of aquatic monitoring programs implementing CES in a useful way. This has resulted in considerable criticism regarding the adequacy of environmental monitoring programs [11,12].
Critical effects sizes are defined by two components: The form and the magnitude (i.e., the type and the size) of the impact to be detected. The form of impact involves deciding what endpoints to monitor (e.g., individual- and/or community-level characteristics), deciding whether we are concerned with the means and/or variances at impacted sites relative to control sites, and specifying at what scales the impact is expected to occur. The magnitude of the impact is a measure of the amount by which means or variances change .
Critical effects sizes can be used in two main ways: As an a priori component of study design to set target levels for sample collection sufficient to detect a difference that would trigger more detailed monitoring [1,2], or a posteriori to identify the minimum effect size (difference between H0 and HA) that would be deemed unacceptable and trigger management actions [3,13,14]. Decisions regarding management action are site-specific and depend on a number of factors, including the magnitude and extent of changes, species sensitivity, number of endpoints responding, and their trend over time [2,15]. Ideally, a monitoring framework would be intentionally designed with endpoints that would be more sensitive than regulatory endpoints (e.g., species presence and absence) to allow a response time that would mitigate serious impacts. In any case, the CES becomes a component of study design to calculate the desired number of animals/samples/replicates required to make decisions.
Table Table 1.. Potential methods used for determining critical effect sizes in monitoring program using different methods
Used by government in monitoring
Used in research
Ecologically relevant differences
Maybe; data not available
Would require time and monetary investment
Legal mandate or program policy
Already exist in the Canadian Environmental Effects Monitoring program
Using a number from outside the monitoring program
Probably insufficient data
Universal effect size
Data distributions at reference sites
Range of natural variability
Within-site natural variability
Range of differences between studies
Data distributions across impacted and reference sites
Distribution of impacted site data relative to reference data
Distribution of relative comparisons between sites
Already exist in the Canadian Environmental Effects Monitoring program
Distribution of statistical results of comparisons between sites
Pooled effect size
Analysis of reference conditions
Combination multimetric and reference condition
× (European Union)
Probably not useful for Environmental Effects Monitoring
Magnitude of effects relative to impacts
Maybe; not enough detail outlined in reports
Spatially based modeling
Based on fish community approach, may be suitable for benthos
A variety of other papers have focused on issues related to power [3,8,10,12,16–21] and sampling design [10,19,22–25]. The present paper focuses on approaches that have been or could be used to set a CES or on approaches and processes that have been used to set targets and their potential relevance for setting CES. Three approaches are ideal: Knowing the size of a change that is ecologically relevant, negotiating a level of change that would be considered as ecologically relevant, and adopting a number set from ecological studies and environmental monitoring [3,26], medical research [27,28], or behavioral sciences . Because so few attempts have been made to set CES, the present review examines alternative approaches, and it divides them into numbers derived from outside the specific monitoring program, the examination of past data using distributions, or multivariate techniques (Table 1). Regardless of the approach taken or the endpoints in question, the magnitude of the change that programs have used for a CES has been similar (Table 2). Detailed descriptions of the potential approaches follow.
Table Table 2.. Summary of critical effect sizes found in the literature review
Magnitude of difference
Ecologically relevant differences
Abundance of invertebrates settling relative to reference areas
Three main approaches to develop an ideal CES include using published data to determine the magnitude of changes in similar endpoints that would result in changes at higher levels of biological organization, negotiating a CES a priori between stakeholders, and adopting proven CES used in other monitoring programs that use similar endpoints.
Citing or developing sufficient background data
Few attempts have been made to use comparative studies and ecological theory to develop a CES. Lincoln-Smith et al.  examined the effects of a marine reserve on the recovery of valuable, commercially exploited tropical invertebrates and designed a study to detect 25 and 50% changes in proportional abundance relative to data collected before establishment of the marine reserve. Critical effects sizes were based on the percentage difference in the abundance of selected invertebrates at the marine reserve (before its establishment) relative to the abundance of selected invertebrates reported in the literature in areas free of exploitation elsewhere in the tropical Pacific.
Other examples of successful approaches are limited, but some hypothetical examples of the definition of CES were based on ecological-response thresholds and sensitivity provided by Mapstone  regarding the effect of loss of key algal species on the functioning of rocky intertidal platforms and by Downes et al.  concerning the potential impacts on natural benthic macroinvertebrate species by the liming of Welsh streams to promote the survival of trout. These hypothetical examples employ data and safety margins or stakeholder negotiation, and they could be used to guide data development to define CES. In reality, however, few studies have the luxury of the time lag required to develop the baseline databases needed to define these CES a priori. Specifying a CES is not a simple procedure, and unfortunately, ecological information at the level of detail presented in the mentioned examples [3,9] seldom is available. Consequently, few regulations state a required CES.
Deriving numbers from stakeholder negotiation
Although science is involved in quantifying relationships, the strength of impacts, and the variables being measured, social and economic values need to be considered when trying to decide what constitutes “harmless” or “acceptable” . As a consequence, several authors emphasize the importance of public consultation with landowners, environmental groups, and industry when determining CES to develop a credible and widely accepted result by the stakeholders involved [3,9,12].
One of the major challenges involved with determining the magnitude of change deemed important is achieving consensus between stakeholders regarding how to interpret the results. For example, although descriptions of the impacts of pulp mill effluent on sexual maturation and reproductive development in fish have been available for more than 15 years , a lack of consensus remains among stakeholders concerning whether these changes are real, consistent, or important . The inability to reach consensus about the existence of impacts, or about their causes, is based on a number of factors, including a lack of agreement regarding the importance of measurement endpoints, confusion about the relevance of changes, and confusion about what would have to be done if an impact was declared. The decision concerning the acceptability of changes can be based on science or can be developed through consensus agreement on the nature of changes that will be socially or economically unacceptable. These decision points can be agreed to through stakeholder consultation a priori, before data collection, or a posteriori.
In Sweden, a multistakeholder group met over a considerable period of time and proposed levels of impact that would be considered as unacceptable . The group divided indicators into functional groups  and recommended that if three or more variables in the same functional group were significantly affected (meaning a statistical difference), this should be interpreted as an unacceptable disturbance of the function. For physiological functions, an unacceptable disturbance was two or more statistical differences, which would represent an unacceptable disturbance of fish health and an evident risk of population effects through increased mortality. If statistical changes occur in a functional group in one (physiological) or two (other) variables, further investigations would be needed to confirm the responses and to analyze their wider significance.
This strategy of multistakeholder negotiation to a priori define levels of impact that would be considered as unacceptable does not require that the changes be ecologically relevant, only that all parties agree in advance that the changes would be considered important enough to correct. It should be emphasized that the Swedish EEM program includes a wide variety of endpoints across multiple levels of organization, ranging from physiological changes in fish to community-level disturbances .
Criteria set by program policy—The Canadian EEM example
In 1992, the Canadian Pulp and Paper Effluent Regulations were developed, and these included an EEM program . The EEM program is a cyclical monitoring study that provides information regarding the potential effects of effluent on fish populations, fish tissue, and benthic invertebrate communities . Nine decision endpoints for fish and macroinvertebrate communities currently are used in the EEM program. Although effects are determined based on consistent, statistically significant effects over two consecutive monitoring cycles, significant environmental impacts requiring further study currently are identified based on the exceedance of responses beyond a CES. Critical effects sizes initially were developed after the completion of cycle 1 monitoring by fish and macroinvertebrate expert working groups comprised of both industry and government scientists [15,36]. Fish and macroinvertebrate working groups operated independently to develop the monitoring requirements, study designs, and CES for their respective taxonomic groups. Critical effects sizes currently are set for fish at a 25% difference in relative gonad and liver size and a 10% in condition, whereas macroinvertebrate population- and community-level endpoints are set at two standard deviations as derived from reference area data  (Table 2).
For fish, new pulp mill EEM requirements were developed largely as a result of concerns about potential reproductive responses to effluent. As a consequence, changes in the gonadosomatic index were of primary concern, and CES initially were set at 25%, based on the magnitude of difference observed at pulp mills at Jackfish Bay (Canada)  and Norrsundet (Sweden) , where reproductive impacts were known to have occurred. Effectively, the target was set to define how often changes were seen that were as large as those that were commonly accepted to represent significant changes. Other endpoints developed for the EEM program, such as liver size and condition, were determined to be less variable than gonad size, and as a consequence, sample sizes for these endpoints were based on the power and sample size requirements to detect a statistical difference of 25% in the gonadosomatic index (actually, a range of 20–30%). These levels of change subsequently were reexamined after two more cycles of data collection, based on the distributions of data (discussed below).
The present review did not find other examples of CES defined by developing baseline data to define a CES, or to define them through stakeholder negotiation or legal mandate. If information from comparable studies, ecological theory, or legal mandates is insufficient and published numbers for CES are not available or not acceptable, it may be possible to develop CES using generic numbers or using numbers generated from within the monitoring program after it is operating (see later sections).
SOURCES UNRELATED TO MONITORING PROGRAM DATA
Several methods have been proposed for setting CES that could be based on a threshold that would trigger a regulatory response, a value based on a universal effect size, or a value based on detection thresholds.
Regulatory response threshold
It is possible to set a CES based on a regulatory threshold identified through a review of regulatory decisions that have been made relevant to the endpoint in question (i.e., how often a discharger has been fined for a specific outcome). Analysis would require reviewing the regulatory decisions that have been made as a consequence of environmental impacts and relevant effects sizes and then making a decision regarding how large a difference in endpoints has been associated with these decisions. One of the challenges is the lack of a sufficient record of enforcing environmental regulations to generate a sufficiently large database to use for this purpose.
Universal effect size
Critical effects sizes could be established through the use of standard/universal effect sizes. Within the social sciences, specific magnitudes of effect size for small (0.10), medium (0.25), and large (0.40) were proposed by Cohen . The biological and ecological basis for their use in environmental monitoring remains unproven, however, and even Cohen acknowledged that these values were relative to the specific content and method applicable to a given research situation. Researchers from several fields of study have recommended against adopting CES based on this method [9,27] and have emphasized that no single value is applicable in all situations. Although Cohen's effect sizes have been used in the environmental monitoring literature, thus far they have been used only in a theoretical context focused on optimizing sample sizes in monitoring programs .
It is possible to choose a CES based on analytical detection limits, especially for chemically based monitoring endpoints. These have been used, including in Canada, for endpoints such as the presence of chlorinated dioxins and furans in effluents (i.e., dioxins must be nondetectable in effluents or <10 ppq, which was the detection limit at the time the regulation was developed). The limit could be defined based on practicality (e.g., a detection limit), experience (difference from normal that is able to be detected), or data collection requirements, but the difficulty is that the relationship to effects on receiving water biota is unclear. Critical effect sizes based on detection thresholds make sense only when the target is well defined; otherwise, a CES would have to be developed using data collected from similar monitoring programs when they are available.
DATA DISTRIBUTIONS AT REFERENCE SITES
Once a program is running, potential CES can be defined in a variety of ways based on data generated within a monitoring program, parallel research to generate the necessary background data for the program, or data from similar programs. The variability estimated could be derived from the ranges of data found across a range of natural variability within or among reference areas (temporal or spatial) or from the distributions of reference data. A variety of approaches can be used to define reference sites ; these include best professional judgment; using minimally disturbed or least-disturbed sites; following preset chemical, physical, or biological criteria; or interpreting historical conditions.
Range of natural variability among reference areas
Several researchers have cited the importance of using natural variability among reference sites within a monitoring program to identify the range of expected response levels and then subsequently defining CES as an observed value either outside of or at the extreme of this range. This approach would involve sampling multiple reference sites, potentially in multiple reference seasons, to assess natural variation in characteristics. Balk et al.  sampled Eurasian perch (Perca fluviatilis) at two reference sites for six years to document natural variability in whole-organism characteristics. Swedish scientists have a number of long-term reference databases to use in defining natural variability [38,41,42]. Few studies have defined natural variability, although some databases are available, including long-term studies on fish endpoints at pulp mills  (K.R. Munkittrick, unpublished data). It should be noted that natural variability among reference sites frequently is used to derive the equivalent of CES for invertebrate and fish monitoring studies using both multimetric approaches (see, e.g., [44,45]) and multivariate approaches (see, e.g., ). These kinds of approaches are described in more detail below.
Effectively, this approach defines the range as values outside of those seen at reference sites. This approach has definite disadvantages for use in a monitoring program, including the fact that many monitoring endpoints change seasonally and are affected by habitat. As a consequence, a CES based on this method would have the added disadvantage of needing to be region and habitat specific, and a nationwide standard would not be possible. Dramatic differences can occur in reference levels within a species regionally and between lake and river environments [43,47] such that exceeding the normal range of variability would require about half the observed fish in an exposed site to actually have a condition that is outside the normal range of variability observed in reference sites. This is less acceptable than it is for more variable invertebrate community endpoints, such as indices of benthic community composition, in which use of the normal range generally is accepted .
Natural variability within reference areas
Critical effects sizes have been set for macroinvertebrate communities using a measure of natural variability within reference areas, such as two standard deviations calculated from the mean of the reference area data [15,37,48,49]. A general relationship exists between effect sizes expressed as percentage differences (from the reference mean) and as reference area standard deviation units (R.B. Lowell, unpublished analysis, Canadian pulp and paper EEM data). Fixed percentage CES for invertebrate community data (e.g., a–50% to +200% change relative to the reference mean) ultimately are based on measures of reference area variability  but would vary widely in magnitude among different geographic areas. One of the challenges of a CES based on standard deviation is that the value would be free to vary between sites and studies and could lead sampling personnel to artificially inflate reference area variability and so reduce the potential for finding significant differences.
Basing CES on variability has the advantage of being end-point specific, but the magnitude of impact determined to be important can vary between programs or cycles when converted to a fixed percentage (as in the multivariate approach described below). This could potentially be viewed as a means of implementing adaptive management (i.e., adapting CES to natural cycle-to-cycle variability in reference conditions).
DATA DISTRIBUTIONS AT REFERENCE SITES
The distribution of data from the reference sites could be used to set a CES based on ambient distributions, and these have been based on 5th or 25th percentiles of important indicators  or on the 10th percentile . Impairment also has been defined as a percentile of the reference data distribution, including assigning the 95th percentile as maximum impairment  or the 25th percentile of the least-impacted streams as minimum impairment . In addition, it is possible to set a target level as the maximum (or minimum) level found at reference sites by using a range associated with two standard deviations from the reference sites or by using the 95% confidence interval of the reference sites; this would represent a combination of the approaches, using variance to define ranges and natural variability to set the CES .
Based on these approaches, Meador et al.  judged a 20% change in biological condition as being degraded, although they caution that “a 20% change in biological condition should represent a reasonable threshold for statistical comparison and a biologically relevant response to disturbance [but] should not necessarily be considered a standard for regulatory purposes.”
A variation of the data distribution approach also has been advocated by the U.S. Environmental Protection Agency for nutrient and algal monitoring in their reference reach approach . Data from relatively undisturbed stream segments are used to identify the natural range of nutrient and algal indices, and potentially impacted streams are classified, based on their condition relative to the reference streams, as at reference, at risk, or impaired. The reference values can be selected based on best professional judgement, a percentile of the distribution (i.e., 75th), or a percentile of the streams thought to represent reference streams (i.e., 5th to 25th) .
Data distributions across impacted and reference sites
If similar monitoring programs have been developed, or if the monitoring program is ongoing, data distributions across all sites can be used to generate a CES using a variety of procedures, including the distributions of impacted sites versus reference data, the distributions of comparisons relative to reference data, and comparisons of the distributions of the statistical differences between sites or the magnitude of the differences between sites (the pooled effect size approach described below).
The scoring approaches common in multimetric index development typically use the distributions of data from both reference and impacted sites . A multimetric index takes data from a variety of endpoints, calibrates the various end-points (e.g., size, lesions, and percentage omnivorous species) individually against the distributions of data, scales its values, and obtains a unitless score, which is then aggregated with other scores to form a multimetric index (see, e.g., [55–57]). Total rankings can be summed, scaled to 100 based on the number of metrics used, or standardized to a scale of 0 to 1 through dividing by the sum of the reference sites. Ideal multimetric indices incorporate multiple levels of biological organization, address structure and function within the community, and incorporate broad sensitivities and ranges of habitat. Final metrics usually are selected from a large group of initial metrics based on criteria of power, consistency, uniqueness, or overlap with defined reference sites , and they are designed to be responsive to stressors and to exhibit low natural variability .
The relevant issue related to developing CES is the designation of the marks for the metrics and the use of the distributions of data across all sites. Sites usually are preclassified based on human judgment or stressor data, and scores are picked based on the distributions of data. The metrics can be assigned scores on a subjective scale (e.g., slight deviation or moderate)  based on dividing the distributions of data by percentiles or quartiles or based on lines that bisect the data into double or triple groups [51,61] or that bisect interquartiles . Candidate scores can be distributed across a variety of rankings, ranging from 0 to 10 , 1 to 5 , 0 to 6 , or 0, 10, or 20 . The scales can be discontinuous (i.e., a score of 0, 2, 4, and 6 are only possible as in Applegate et al. , and a score of 1, 3, and 5 as in Karr ) or continuous (as in Mayon et al. ). In all cases, assigning scores still requires some assumption about what represents normal and what represents impairment for each system and each endpoint.
A similar approach has been taken with the biological condition gradient  as a model of biological response to increasing effects of stressors. This model encompasses the complete range, or gradient, of aquatic resource conditions from natural (e.g., undisturbed or minimally disturbed conditions) to severely altered conditions. The biological condition gradient uses changes in 10 ecological attributes and divides responses into six condition tiers, with tier 1 representing natural or undisturbed conditions and tier 6 representing severely altered conditions. The attributes vary from taxonomic composition and tolerance to nonnative taxa to organism condition and ecosystem function. Organism condition indicators include fecundity, morbidity, mortality, growth rates, and anomalies (lesions, tumors, and deformities). For individual-level indicators, 1 and 2 are background, 3 is infrequent changes, 4 represents that the incidence of anomalies may be slightly higher than expected, 5 indicates that biomass may be reduced and anomalies are increasingly common, and 6 represents that long-lived taxa may be absent, biomass reduced, anomalies common and serious, and reproduction minimal except in extremely tolerant groups. The tiers are combined into a weight-of-evidence approach , and levels of impairment are not defined for specific levels, resulting in limitations similar to those seen with the multimetric approaches.
Distribution of impacted site data relative to reference data
It is possible to develop a CES by examining the distribution of the data from impacted sites relative to the reference sites, similar to what is done within the sediment-quality triad approach . The triad involves a comparison of data for sediment chemistry, sediment toxicity testing, and benthic community indices, although it can be extended to include a wide variety of other lines of evidence . Each station is classified for contamination (based on the values of the mean effects range relative to contaminant levels [high, medium, or low]), toxicity (subjective scale from toxic to marginally toxic to nontoxic), and quality of the benthic assemblage (impaired, slightly impaired, or not impaired). Each metric can be scored as 5, 3, or 1, depending on whether it approximates, deviates slightly from, or deviates greatly from conditions at reference sites, respectively. The data also can be plotted graphically (on a scale of 0–1 based on the ratio to reference)  or scaled from 0 to 100 based on the ratio of the difference to that of the maximum difference from reference site . The data also can be ranked nonparametrically, compared by rank correlations or spatial correspondence [68,70], and subjected to a principal component analysis , cluster analyses, or descriptive discriminant analysis .
The sediment-quality triad does not use a CES approach per se; rather, it uses a difference relative to the maximum difference from the reference site to rank sites and so identify the most different sites and the degree of confidence in the relationship between altered communities, chemical contamination, and inherent toxicity. The final approach is not quantitative, requires the use of multiple stations, and is based on relative differences to the reference site, basically ranking the most different as the worst. Similar to the distribution data across all sites, the relative distribution of impacted sites could be used to define CES by employing any of the types of distributions discussed in the multimetric approach.
Distribution of relative comparisons between sites
Since the completion of cycles 2 and 3 of the pulp and paper program, the Canadian EEM approach has used the distributions of relative comparisons between sites. Depending on the endpoint and cycle, studies are conducted at from 60 to more than 110 sites; each study compares reference and impacted sites (>95% of studies use a single reference and a single impacted site) and defines a magnitude of effect in that study (for males or females, targeting two species). The program prioritizes sites based on the distributions of differences between sites for the measured endpoints (Fig. 1). Using this structure, CES could then be based on defining relatively rare differences as representing potential areas of concern in which more information collection is warranted [72,73]. Currently, CES for fish are set at 25% for most endpoints . The distributions of responses also could be subjected to a distribution analysis using the 90th or 95th percentile of the differences to define the magnitude of a CES (Fig. 2); for gonad sizes, the 90th percentile difference averaged 33.9% of cycles 1 to 3 of the program (data not shown).
This approach is similar in intent to the benchmark dose (BMD) approach advocated for determining human exposure limits to toxic compounds [27,74]. The BMD uses a mathematical dose-response curve fitted to a specified benchmark response, defined as the degree of change assumed to distinguish between an adverse and a nonadverse effect . The lower limit of the statistical confidence interval of the critical effect dose is called the BMD and has been proposed as an alternative to the use of no-effect levels to define a point of departure for reference dose/concentration calculations. The benchmark response often is guided by a threshold response in a dose-response curve. Threshold responses often are difficult to specify unambiguously in practice, and values often are left to best professional judgment and are open to interpretation. Conceptually, however, the approach is built with the purpose of preventing effects at any magnitude, and it requires the initial determination of a benchmark response for a chosen endpoint in response to an exposure variable. Regardless, the choice of the benchmark response typically is arbitrary, and it is common to use additional 10-fold safety factors to compensate for the uncertainty of responses within the human population and for extrapolating the results of animal exposures to humans .
Although this approach is not typically used to set a CES, it could be. The BMD could be set where a proportional increase occurs in the response of test subjects to an effect, along a dose-response curve [27,28]. An example of the approach can be seen in Figure 2 using the Canadian EEM fish data from the pulp and paper monitoring program. A significant change in the slope of the line would signify a biologically relevant change in the magnitude of the response and could be used to determine the benchmark response.
Distribution of statistical results of comparisons between sites
Yeom and Adams  developed an integrated index to evaluate effects of stressors over several levels of biological organization, ranging from the suborganism to the community level, using applied integrative star plot analysis. Accumulated values are plotted after being summed and divided by the total score at the reference site to give a number between 0 and 1. The area of the star plot is then calculated (see Beliaeff and Burgeot ). The least-disturbed areas are assigned a value of 2.0, whereas the maximal impairment approaches 0. Sampling sites are evaluated according to three categories of health status, including acceptable (star plot area, >1.60), marginally impaired (star plot area, 1.20–1.60), and impaired (star plot area, <1.20).
The approach uses statistical results to assign the scores for the metrics: Those metrics with no difference from reference conditions at p > 0.05 are assigned a score of 4, those with a difference from reference conditions at p < 0.05 receive a score of 3, those with a difference from reference conditions at p < 0.01 receive a score of 2, and those with a difference from reference conditions at p < 0.001 receive a score of 1. For other metrics, the 95th percentile of the data distribution is used to eliminate outliers, and remaining values are standardized as a percentage of the 95th percentile value to provide a range of scores: Values of 75% and higher receive a score of 4, whereas sites with values of 50 to 74% received a score of 3, sites with values of 25 to 49% a score of 2, and sites with values less than 25% a score of 1. As with the multimetric and sediment-quality triad approaches described above, the distribution of the data could be used to define CES for specific endpoints, with the added advantage that it would be based on statistical differences.
Tejerina-Garro et al.  based their designations of candidate scores for a multimetric index on distributions of the Student's t values for comparisons with reference sites. In the Canadian EEM program, the 75th percentile p value for cycles 2 and 3 gonad size data was 0.009, and the 80th percentile p value was 0.001 (T. Barrett and K.R. Munkittrick, unpublished data).
Pooled effect size
Bailer et al.  recommended an approach that converts the differences between sites based on the pooled standard deviation, similar to the one currently conducted for metaanalysis of Canadian EEM fish endpoint data [37,49]. The standardized difference suggested  is used to compare the p values from the statistical testing at each site to the pooled effect size values [(meanexp — meanref)/(pooled standard deviation)], but care should be taken to examine the potential influence of unequal variance between sites. The regression between the two values is compared to define the effect size (in units of standard deviation) crossed by p values of 0.05 and 0.10 to determine a range of important effect sizes ranging from the intersections of p = 0.05 and p = 0.10 with the regression curve. This pooled effect range could be used to set the CES and could be described by the range or the median of the range.
Considerable interest exists in using multivariate approaches to examine the data from monitoring studies, both within a study, between sites, and between programs, and it generally is recognized that multivariate techniques can be more sensitive than univariate techniques . Various multivariate approaches were reviewed for their relevance to setting CES, including multidimensional scaling and other ordination techniques to examine differences among reference sites, distributions and magnitudes of effects sizes, a combination of approaches, and the potential use of spatially based mapping.
Multivariate analysis of reference conditions
The reference condition approach involves testing an ecosystem exposed to a potential stressor against a reference condition that is little impaired [46,79,80]. It has been applied primarily to benthic communities (see below for fish-based applications) and has been applied in the United Kingdom, Canada, United States, and Australia (for reviews, see [46,81]). It usually requires a large number (»100 reference sites) and can use the presence or absence of taxa or additive indices using multiple metrics [79,82]. Expert judgment or public workshops are used to identify regional (or ecoregional) reference sites, and a large number of community metrics (55 in Reynoldson et al. ) are assessed by ordination to reduce redundancy. Environmental variables that can be affected by disturbance are removed from the analysis. The reference condition approach clusters communities based on similarity of community structure, and it correlates the biological data with environmental attributes to define an optimal set of environmental variables that can be used to predict group membership. Test (exposure) stations are assessed relative to the group to which it is predicted to belong to determine if it is different, either by the range of variation observed at reference sites (two standard deviations) or by the use of ordination methods and determining if the reference site is within the 95% probability ellipse of the matched reference sites .
Tonn et al.  have developed a reference condition approach for fish that uses canonical correspondence analysis to identify reference community characteristics for environmental variables at reference sites, followed by discriminant function analysis with cross-validation and best-model fits to determine whether disturbed sites had predicted communities.
In terms of defining CES, the effect size could be determined from the distributions of the reference site data once the analysis has been completed. Some limitations exist: Because sites are randomly selected and usually visited once to define reference status, it does not encompass natural temporal variability; it is assumed that the large number of sites sampled encompasses the variability issue. These approaches basically define sites that are furthest from the expected reference norms but differ from the other approaches in that they describe the difference from a range of variability between reference sites. These approaches are not unlike attempts to define the maximum differences and to rank the maximum differences as the most important.
Combination of multimetric and reference condition approaches
The European fish-based index combines the multimetric Index of Biotic Integrity approach with the reference condition approach, but it requires a large number of sites (>5,000 were used by Pont et al. ). It depends on best professional judgment to classify the sites and to decide on the reference sites. The approach is semiquantitative and uses four measures of human impact (modification of morphology, hydrology, presence of toxic substances or acidification, and nutrient loading) that are ranked from 1 to 5 (no pressure to severe impact) and summed for a rating of 4 to 20. These rankings are re-ranked from 1 to 5 (based on groupings of four) to represent no impact to very heavily impacted. Fish community data were modeled statistically to define the reference sites representing the least disturbed using the relationship of functional ecological characteristics relative to 13 local and regional environmental variables, and metrics were developed based on the deviation of the site from 0 (i.e., the most different sites were rated as the worst). The final index sums all retain variables, rated on a scale of 0 to 10 each, and then recalculates the index to get a value of between 0 and 1. These classifications can be analyzed to determine an effect size that will place a response outside of the reference data.
Relative magnitude of effect
It is possible to set CES based on the multivariate magnitude of effect relative to other facilities (e.g., other pulp mills) based on a multiple dimension scaling that measures the multivariate distance a site lies from the zero-effect condition. The approach effectively would set a CES through identifying sites that show larger effects based on considering several response variables simultaneously. Figure 3 shows cycle 2 fish data for the Canadian pulp and paper EEM program plotted in a multidimensional scaling ordination . Points drawn to the bottom right of the plot represent sites where the collected fish had larger gonads, larger livers, and larger condition; sites drawn to the upper left represent sites where collected fish had smaller livers, smaller gonads, and smaller condition factor. The circle encompasses the 90% of sites closest to the origin of the plot. Under this approach, sites outside the circle would represent the 10% of mills with the worst impacts in the country. The magnitude of these differences could be used to define a CES for a subsequent cycle. This approach has a number of disadvantages. For example, a mill's location inside or outside the circle may vary depending on the endpoints included in the analysis, and the magnitude of impact determined to be important can vary from cycle to cycle and with which mills were incorporated in the analysis.
Spatially based modeling
A second European approach to examining impacts on fish uses spatially based modeling, which copes with the problem of natural variability by predicting reference conditions of distinct fish assemblage types. An individual method is developed for each fish assemblage type, and discriminant function analyses are used to model and predict status class membership of any given site . Metrics are weighted in the discriminant function according to their individual contribution to the overall pressure-index relationship. The disadvantage of discriminant function analysis is that because of its multidimensional nature, the contribution of individual metrics is hidden in the discriminant functions of the model .
A BAYESIAN PERSPECTIVE
The use of Bayesian statistics and decision theory has been increasing in ecology, conservation biology, and fisheries management since the mid-1990s (see, e.g., [86–88]). Use of Bayesian statistics and decision theory has been advocated to improve the quality of environmental monitoring programs by taking uncertainties into account quantitatively when evaluating management options [86,88,89]. In traditional frequentist inference, tests of significance are performed by supposing that a hypothesis is true (the null hypothesis) and then computing the probability of observing a statistic at least as extreme as the one actually observed during hypothetical future repeated trials. In contrast, Bayesian statistical inference requires the explicit assignment of previous probabilities, based on expert judgements elicited using existing and additional data, to the outcomes of experiments. The results of these experiments, regardless of sample size, then can be used to compute a posteriori probabilities of the hypotheses given the available data . In other words, frequentist statistics examine the probability of the data given a model (hypothesis), whereas Bayesian statistics examine the probability of a model given the data. Proponents of Bayesian statistics have argued that this approach makes better use of existing data, allows stronger conclusions to be drawn from large-scale experiments with few replicates, and is a more relevant approach to environmental decision making.
Although this approach is gaining momentum in the ecological and environmental literature, it has not been widely adopted, and the theoretical framework generally is not well understood by biologists or managers. An additional concern exists regarding subjectivity in subsequent analyses, particularly subjectivity arising from the assignment of previous probabilities. Proponents, however, argue that these concerns are largely superficial and originate from poor understanding of the concept. Although the Bayesian approach is conceptually a viable option for field-monitoring programs, the predominant method currently in practice uses traditional hypothesis-driven experimental design.
Despite the widespread appeals for using CES to guide environmental management decisions, they are rarely used in practice. This is particularly true for CES developed directly for fish endpoints, such as those used in the Canadian EEM program , in part because of the popularity of multimetric endpoints for fish assessments in the United States and Europe. Biologically relevant effect sizes should be defined a priori and, in applied studies, take into consideration the type and magnitude of change that is likely to be of concern [78,89], which requires that the sampling designs be understood in advance [90–92].
A number of potential methods for determining CES were identified and reviewed, and several of these approaches have the potential to be suitable for determining CES in a monitoring program like Canada's EEM program, based on existing EEM data collected in cycles 1 to 4. Furthermore, a number of attempts to define CES are relevant in terms of the justifications they have used and the values employed for decision making.
Values based on extreme effect sizes and on stakeholder negotiations were each identified as potential alternatives for determining CES. Numbers based on observed regulatory thresholds and published CES, however, do not appear to be viable options given the lack of information available at this time. Universal effect sizes also were determined to be unsuitable, both because single values often are not applicable across disciplines and because of the lack of biological and/or ecological basis in their assignment.
Several of the data-defined methods reviewed may be applicable for evaluating CES, and further effort should be placed on the subsequent evaluation of these approaches using available data, with additional credibility potentially achieved by obtaining agreement among various methods. These would include examinations of data distributions and the distributions of statistical differences as well as evaluations of applying the various benchmark approaches and thresholds to examine differences. Dose-response curves commonly used for determining CES type effect sizes in medical science may be a potential option for determining CES; however, available data will have to be mined to determine whether a suitable dose-response curve could be developed. An additional advantage of data-defined methods is that they offer the potential for adaptive management, thus resetting the bar between monitoring cycles and allowing continuous movement toward improvement. Suitability would depend on the motivation for the monitoring program.
Numerous examples of alternative approaches were found that would be applicable to monitoring programs. Recommended approaches include basing CES on natural variability within and among comparable reference areas (e.g., two standard deviations), an approach commonly accepted for benthic community analyses, as well as deriving fixed percentage CES (e.g., 25%) for fish endpoints (Table 2). Further analyses of the existing data should be done to evaluate these alternative approaches for setting CES.
We would like to thank the researchers we corresponded with for their thoughts and suggestions regarding the use of CES in environmental monitoring, particularly Tony Underwood and three anonymous reviewers for their insightful comments. Tim Barrett is acknowledged for the statistical plots of the cycle 2 and 3 EEM data.