Abstract
 Top of page
 Abstract
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 CONCLUSIONS
 Acknowledgements
 REFERENCES
Field data relating aquatic ecosystem responses with water quality constituents that are potential ecosystem stressors are being used increasingly in the United States in the derivation of water quality criteria to protect aquatic life. In light of this trend, there is a need for transparent quantitative methods to assess the performance of models that predict ecological conditions using a stressor–response relationship, a response variable threshold, and a stressor variable criterion. Analysis of receiver operating characteristics (ROC analysis) has a considerable history of successful use in medical diagnostic, industrial, and other fields for similarly structured decision problems, but its use for informing water quality management decisions involving riskbased environmental criteria is less common. In this article, ROC analysis is used to evaluate predictions of ecological response variable status for 3 water quality stressor–response data sets. Information on error rates is emphasized due in part to their common use in environmental studies to describe uncertainty. One data set is comprised of simulated data, and 2 involve field measurements described previously in the literature. These data sets are also analyzed using linear regression and conditional probability analysis for comparison. Results indicate that of the methods studied, ROC analysis provides the most comprehensive characterization of prediction error rates including false positive, false negative, positive predictive, and negative predictive errors. This information may be used along with other data analysis procedures to set quality objectives for and assess the predictive performance of riskbased criteria to support water quality management decisions. Integr Environ Assess Manag 2012; 8: 674–684. © 2012 SETAC
INTRODUCTION
 Top of page
 Abstract
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 CONCLUSIONS
 Acknowledgements
 REFERENCES
Increased attention is being given to the development of water quality benchmarks and criteria using weight of evidence and field data that express relationships between ecological responses of aquatic ecosystems and stressor variables that cause those responses (Paul and McDonald 2005; USEPA 2006a, 2010a, 2010b, 2011; Cormier et al. 2008; Hollister et al. 2008; Suter and Cormier 2008). Efforts by a number of state environmental agencies to develop numeric nutrient criteria have involved evaluations of relationships between nutrient concentrations and measures of ecological and biological responses to excess nutrients such as increases in algal growth. In addition, several documents published by US Environmental Protection Agency (USEPA) are intended to address how field data on potential stressors and related responses may be evaluated to develop numeric water quality criteria (WQC) or benchmarks that support the attainment of designated uses as required by the Clean Water Act (CWA). Examples include the derivation of criteria for suspended and bedded sediment (USEPA 2006a), numeric nutrient criteria (USEPA 2010b), and benchmarks for specific conductance in central Appalachian streams (USEPA 2011).
Suter and Cormier (2008) provide a useful riskbased framework for the development and evaluation of environmental quality criteria derived from field data. The authors compared conventional risk assessment with a process termed “criterion assessment.” Whereas conventional risk assessment seeks to define human health or ecological risks associated with a range of exposures to 1 or more stressors, the criterion assessment process seeks to define the level of exposure needed to achieve a specific environmental goal, e.g., an ecosystem that is “healthy” with respect to 1 or more ecological attributes. The authors define 3 phases of criterion assessment: planning, analysis, and synthesis phases. Cormier et al. (2008) describe the criterion assessment process in greater detail and provide a hypothetical example using field data on stream macro invertebrates and deposited and bedded sediments. The example is based on information in the USEPA Framework for Developing Suspended and Bedded Sediments Water Quality Criteria (USEPA 2006a).
In general, such criterion assessment is a multistep process that involves identifying numeric thresholds or ranges of 1 or more response variables that define attainment and nonattainment of designated uses, a stressor variable that causes changes in the response variables, and a model that describes the relationship between the stressor and response variables. The modeled relationship may then be used to identify levels of the stressor (i.e., a numeric criterion or benchmark) that minimize the likelihood of occurrence of the unwanted condition (for clarity and brevity throughout this article, the term “criterion” refers to either WQC or benchmarks for a stressor variable, and the term “threshold” refers to a response variable threshold).
Conceptually, the process is relatively straightforward. In practice, as other authors have described (Barbour et al. 2004; Cormier et al. 2008), several challenges may need to be addressed. For example, it may be difficult to identify 1 or more response variables and associated thresholds that appropriately define ecosystem management goals and reflect the attainment of designated uses. Ultimately, selection of responses and response thresholds may be based on a combination of policy choices and information from scientific analyses, including studies of the impacts of varying levels of ecosystem responses on designated use attainment. Next, 1 or more stressor variables that are causally related to the selected responses must be identified and the nature of their relationship to selected responses assessed. A valid causal analysis developed through a weight of evidence approach goes beyond statistical modeling and can help to demonstrate that relationships observed in field data are not simply associations with no direct causal links to a stressor variable. Without a causal connection, management of the stressor may yield no improvement in the targeted ecological response (USEPA SAB 2010). Despite these challenges, the examples cited previously indicate that decisions on appropriate response variables and thresholds are made as a part of contemporary environmental management.
An important additional challenge is to identify an appropriate value for the stressor variable, i.e., a numeric criterion, that supports attainment of the desired response condition. Contributing to this challenge is the uncertainty that exists in stressor–response relationships described using field data. Combined with a causal stressor–response relationship and an established response threshold, selection of a numeric stressor criterion completes a model framework that may be used to predict the status of the response. Uncertainty in the stressor–response relationship dictates that these predictions will also be uncertain and may make the appropriate choice of a numeric criterion unclear. Attention to the characterization of uncertainty and its effects on water quality predictions is important for improved water quality management (Borsuk et al. 2002; DiToro et al. 2005; Reckhow et al. 2005; Gronewold et al. 2008; Stevenson et al. 2008) and is the focus of this article for the model framework described above.
There are 2 important questions for WQC development and implementation based on fieldderived stressor–response relationships. For a given water body, how accurately does nonattainment of a stressor criterion indicate a nonattaining response condition, and conversely, how accurately does attainment of a stressor criterion indicate attainment of a desired response condition? These questions may be thought of in the context of statistical decision errors in hypothesis testing (Barbour et al. 2004). Smith et al. (2001, 2003) address the probability of statistical decision errors when comparing measures of a single variable to a numeric criterion for that variable to assess violations of US water quality standards. Assuming a null hypothesis that a standard is being attained, a Type I (also called a false rejection or false positive) error is defined as a case where a site may be classified as nonattaining when in fact designated uses are attained. A Type II (a false acceptance or false negative) error is defined as a case where a site is classified as attaining when it truly is nonattaining.
Smith et al. (2001) state that the choice of acceptable error rates should be a risk management decision, and achieving a balance of these errors may be appropriate when agreement on acceptable false positive and false negative error rates is not possible. Furthermore, considering decision error rates quantitatively is important because of uncertainty that exists due to natural variation and measurement and sampling errors, and because policy determinations may allow occasional violations of a standard (Smith et al. 2001). Reasons for minimizing false positive errors include a need for wise use of limited regulatory agency and other stakeholder resources and the application of remedial activities to truly impaired sites so that water quality goals can be achieved effectively (Smith et al. 2001, 2003; Llanso et al. 2009; Paul and Munns 2011). Minimizing false negative errors is important to minimize water quality risks to aquatic life and human health. Characterizing and selecting appropriate levels of both error types is also a central goal in the development of project specific data quality objectives (DQO) as discussed in USEPA (2002, 2006a, b).
For predictive models involving fieldbased stressor–response relationships, response thresholds, and stressor criteria, the consideration of decision errors can be extended to inferences about the response using information about the stressor variable. The general diagnostic nature of this prediction problem exists in a number of other fields such as medicine, meteorology, machine learning, and others (Swets et al. 2000). Often, such diagnostic models are evaluated with a receiver operating characteristic (ROC) approach in which the status of an indicator variable (e.g., exceedance or nonexceedance of an indicator threshold) is used to predict the status of the primary variable of interest (e.g., the presence or absence of disease). However, ROC analysis seems to be used less commonly to evaluate model performance in water quality management (Hale and Heltshe 2008). Some examples are present in the peerreviewed environmental literature, however, and several are given in Table 1.
Table 1. Examples from peerreviewed literature of ROC analysis used in environmental research and managementAuthor  Uses of ROC analysis 


Benyi et al. 2009  Evaluating the extent of agreement between 2 benthic macroinvertebrate indices and associations between an index and environmental metrics 
Efstratiou et al. 2009  Comparing bacterial indicators to predict the presence of Salmonella sp. in sewagepolluted marine waters using different indicator thresholds 
Hale and Heltshe 2008  Developing a benthic index for nearshore waters in the Gulf of Maine 
Hale et al. 2004  Comparing logistic regression models developed to estimate the probability of degraded benthic conditions 
Long et al. 2011  Studied factors affecting the occurrence of terrestrial carnivores within a Vermont study area 
Mason and Graham 1999  Evaluating the quality of a meteorological forecast system 
McLellan et al. 2008  Characterizing the predictive ability of competing candidate regression models used to predict the probability of return by anglers in a coastal rainbow trout fishery 
Morrison et al. 2003  Evaluating the ability of indicator variables to correctly classify water as suitable or unsuitable for swimming by comparing the mean density of Enterococcus sp. with a threshold used to protect public health 
Murtaugh 1996  Evaluating ecological indicators to identify useful surrogates or indicators for ecological response variables 
Murtaugh and Pooler 2006  Studying lake condition indicators in the northeastern United States 
Nevers and Whitman 2011  Comparing measured and predicted Escherichia coli concentrations relative to a human health standard used to decide whether beaches should be closed to swimming 
Shine et al. 2003  Comparing percent mortality in sediment bioassays with toxicity predicted from the ratio of SEM–AVS 
The basis for ROC analysis is commonly a 2 × 2 contingency table (also called a “confusion” or error matrix) representing 2 states of actual condition (e.g., a reference group and a diseased group) and 2 states of the predicted condition using results from a diagnostic test involving an indicator variable (see figure 1 in Fawcett 2006 and Table 2). The true condition is represented by 1 of 2 states, i.e., either the condition is present or it is absent. Likewise, the prediction is that the condition is either present or absent.
Table 2. Performance metrics (after Linnet 1988) and example calculations for a 2 × 2 contingency table (i.e., error matrix) for ROC analysis with quadrant counts and ROC terms calculated for a hypothetical situation involving uncorrelated stressor and response variables, with total n = 1001 data pairs, and Y_{thr}, and X_{c} set at the median values for each variablea  Indicator (stressor)  Total count 

Attaining  Nonattaining 


Actual (response) 
Nonattaining  n (FN)  n(TP)  250 + 251 = 501 
 250  251  
Attaining  n(TN)  n(FP)  252 + 248 = 500 
 252  248  
Total count  250 + 252 = 502  251 + 248 = 499  1001 
Prevalence = (n(TP) + n(FN))/(n(TP) + n(FP) + n(FN) + n(TN)) = 501/1001 ≈ 0.5 
Nonerror rates: 
Sp = n(TN)/[n(TN) + n(FP)] = 252/500 ≈ 0.5. 
Se = n(TP)/[n(FN) + n(TP)] = 251/501 ≈ 0.5. 
PPV = n(TP)/[n(TP) + n(FP)] = 251/499 ≈ 0.5. 
NPV = n(TN)/[n(TN) + n(FN)] = 252/502 ≈ 0.5. 
Accuracy = ½(Sp + Se) = (n(TP) + n(TN))/(n(TP) + n(FP) + n(FN) + n(TN)). 
= (251 + 252)/1001 ≈ 0.5. 
Error rates: 
FPE = n(FP)/[n(FN) + n(TP)] = 1 − Sp = 248/500 ≈ 0.5. 
FNE = n(FN)/[n(TN) + n(FP)] = 1 − Se = 250/501 ≈ 0.5. 
PPE = n(FP)/[n(FP) + n(TP)] = 1 − PPV = 248/499 ≈ 0.5. 
NPE = n(FN)/[n(FN) + n(TN)] = 1 − NPV = 250/502 ≈ 0.5. 
Group classifications may be based on categorical data or continuous data in which category membership is determined using previously established thresholds and/or criteria (Linnet 1988, Murtaugh 1996). In ROC analysis, counts from the 2 × 2 error matrix can then be used to derive several metrics of the predictive performance of the overall prediction model, including estimation of error rates and their complementary nonerror rates (Table 2). Error rates include false positive error (FPE), false negative error (FNE), positive predictive error (PPE), and negative predictive error (NPE). As shown in Table 2, FPE represents the proportion of all observations actually attaining the desired response condition that are indicated as nonattaining, whereas PPE represents the proportion of all observations that are indicated as nonattaining that actually attain the desired response. FNE represents the proportion of all observations that are actually not attaining the response but are indicated as attaining, whereas NPE represents the proportion of all observations that are indicated as attaining that actually do not attain the desired response. PPE and NPE may be most relevant when new information is available only for the stressor and/or indicator, and inferences about the likelihood of observing one or the other actual response condition are desired.
Nonerror rates include specificity (Sp), sensitivity (Se), positive predictive value (PPV), and negative predictive value (NPV). Sp represents the proportion of true negatives among all cases in which the desired response is actually attained. Se represents the proportion of true positives among all cases in which the desired response is not attained; thus, high sensitivity indicates good model performance for identifying truly nonattaining cases. PPV and NPV are, respectively, the proportion of true positives among all observations that are indicated as nonattaining, and the proportion of true negatives among all observations that are indicated as attaining the desired response. In addition, the overall accuracy of the predictive model may be estimated as ½ (Sp + Se). Finally, prevalence estimates the rate of the true nonattaining condition in the population.
Literature on ROC analysis often emphasizes the generation and interpretation of the ROC curve, in which Se is plotted against 1Sp as a function of a range of possible cutoff or criterion values, X_{c}, for the variable used to predict the presence or absence of the actual condition. An alternative to this presentation of ROC data is described by Linnet (1988) in which both Se and Sp are plotted against X_{c}. Using the equations in Table 2, it can be shown that FNE is equal to 1Se and FPE is equal to 1Sp. Thus, FPE and FNE also can be easily plotted as a function of X_{c}. For WQC derivation, this approach provides a useful way to characterize the influence of choices of X_{c} directly on decision error rates associated with classification predictions derived from possible stressor criteria.
This study explores the use of decision error rates estimated from ROC analysis to inform the criteria assessment process described by Suter and Cormier (2008) by providing metrics of predictive performance for models based on stressor–response data. The method is applied to 1 simulated data set and 2 data sets from published literature, along with associated response thresholds for each. It is assumed that the stressor variables are causally related to responses to focus the study on methods for quantitative evaluation of this type of criterion assessment model. Error rate estimates, rather than nonerror rates or other ROC results, are emphasized because they are often used in environmental research and management to characterize and control uncertainty (Smith et al. 2001; USEPA 2006a) and may be more commonly understood within the environmental community than Se and Sp.
The results from ROC analysis are compared to those obtained from 2 other approaches for evaluating stressor–response data: linear regression and conditional probability analysis (CPA). The simulated data set provides a hypothetical example of a simple linear relationship with relatively low variability that is useful for illustrating typical output from all 3 procedures. The published data sets are chlorophyll a (chl a) and total P (tp) concentration measurements used as the basis for proposed numeric nutrient criteria for Florida colored lakes (USEPA 2010b), and Ephemeroptera/Plecoptera/Tricoptera (EPT) taxa richness and percent sediment fine material (percent fines) data published by Paul and McDonald (2005) and Hollister et al. (2008).
METHODS
 Top of page
 Abstract
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 CONCLUSIONS
 Acknowledgements
 REFERENCES
Data Set 1 was generated with Microsoft Excel using a simple linear regression model to yield simulated response data from a set of randomly generated stressor variable observations. The stressor variable values were generated from a normal distribution. The slope, intercept, and variance used to generate response variable values were selected to yield a statistically significant positive relationship with a relatively high degree of correlation. The median value of the simulated response variable observations is used as the response variable threshold, Y_{thr}. This establishes a prevalence of 50% for the purpose of this study. Values greater than Y_{thr} are defined as representing a nonattaining response condition.
Data Set 2 consists of data on annual geometric mean chlorophyll a (chl a), a response variable, and annual geometric mean total P concentrations (abbreviated “tp” to avoid confusion with the abbreviation for true positives, TP, used in Table 2), as a stressor variable. These data were used to derive proposed WQC for Florida colored lakes (USEPA 2010b) and were previously evaluated by McLaughlin (2012). The previously selected annual geometric mean chl a concentration of 20 µg/L was used as Y_{thr}, with higher values indicating nonattainment of designated uses. The proposed baseline and modified tp criteria, 0.05 mg/L and 0.157 mg/L tp, respectively, were evaluated among other possible X_{c} values.
Data Set 3 consists of paired observations of EPT taxa richness and percent fine grain sediment in bottom substrate (percent fines). These data were previously published by Paul and McDonald (2005) and Hollister et al. (2008) to evaluate the use of CPA and to illustrate applications of the CProb software to conduct CPA. The EPT taxa richness is negatively correlated with percent fines (high EPT taxa richness and low percent fines representing higher quality conditions). EPT taxa richness less than 9 is used to indicate nonattaining conditions, consistent with the previous publications.
Summary statistics for all 3 pairs of stressor and/or indicator and response variables, including means, standard deviations, medians, minimums, and maximums were calculated using Minitab®, Version 16. Minitab also was used for correlation and regression analyses. The strength of correlation for all 3 relationships is compared using the nonparametric Spearman's rank correlation coefficient. The relationships in Data Sets 1 and 2 were analyzed using linear regression (the logarithms of chl a and tp were used for regression analysis of Data Set 2 following USEPA 2010b). Linear regression models are characterized using the slope and intercept of the regression line, the coefficient of determination (R^{2}), residual plots, and the statistical significance of the regression line, slope, and intercept parameters. Upper and lower 50% prediction limits were evaluated, consistent with the approach used for the proposed USEPA Florida lakes criteria. No linear regression model was previously described for the EPT taxa richness data by Paul and McDonald (2005) or Hollister et al. (2008), nor is one developed here. Instead, the nature of the stressor–response relationship is characterized using Spearman's ρ and locally weighted scatterplot smoothing (LOWESS). The CProb procedure referenced in Hollister et al. (2008), developed for use within the R computing environment, was used to derive all CPA results.
For ROC analysis, the pROC software (Robin et al. 2011) was used to calculate prevalence, error rates, nonerror rates, and accuracy. In addition to providing definitions of these terms, Table 2 contains a set of example values for each box of the error matrix. The example values were selected to illustrate the calculation of each term, and to provide a reference example for comparison with ROC results obtained from each data set. Because the values in each box are nearly equal, the example represents a case in which the stressor and response variables are effectively uncorrelated with the response threshold and stressor criterion set at their respective medians. In this case, all terms defined in Table 2 have calculated values of approximately 0.5. This table also can be used to show that 2 perfectly correlated variables would have nonerror rates equal to 1 and error rates equal to 0. Se and Sp for all 3 data sets including the 95% confidence interval from 2000 bootstrap resampling events, were obtained using the “ci.thresholds(rocobj)” command in pROC. Se and Sp (i.e., their median estimates at each X_{c} value) were used along with the formulas in Table 2 and raw counts for the 2 × 2 matrix to estimate median rates of all 4 error types, FPE, FNE, PPE, and NPE. Results are compared with the hypothetical uncorrelated reference example provided in Table 2.
DISCUSSION
 Top of page
 Abstract
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 CONCLUSIONS
 Acknowledgements
 REFERENCES
The objective of this study is to evaluate the use of ROC analysis to inform the criterion assessment process described by Suter and Cormier (2008) and Cormier et al. (2008). Many factors, both scientific and policy driven, may contribute to the choices of response variables, response thresholds, stressor variables, and stressor criteria that may comprise a predictive model of the type discussed in this article. In addition, multiple models derived from other sets of stressors and responses may be needed to adequately address broad ecosystem management goals. Nonetheless, an important consideration in selecting numeric criteria may be the tolerable level of uncertainty in predictions of designated use attainment status using such models. This study shows that ROC analysis can be used to estimate FPE, FNE, PPE, and NPE, as a function of X_{c} (Figure 1C, F, I). These estimates quantify, in terms of several types of prediction error rates, the potential consequences of selecting various candidate criteria.
Applying ROC analysis, CPA, and linear regression approaches to 3 different data sets illustrates differences among the types of uncertainty characterizations possible using each technique and shows how those characterizations may change as a function of specific data set attributes. Regression models can be used to describe the nature and extent of the relationship between stressor and response variables, yielding information on the form of the relationship (linear, curvilinear, nonlinear) and goodness of fit (Draper and Smith 1998). Where a valid regression model can be developed, prediction limits can provide estimates of the proportion of nonattaining responses at specific stressor levels, i.e., at points where prediction limits intersect a response threshold (USEPA 2010b). However, no guidelines currently exist for selecting appropriate prediction limits (e.g., 50%, 80%, or some other percentage) to inform water quality management decisions. Furthermore, as illustrated in this study, prediction limits do not easily yield comprehensive information on the performance of a criterionbased water quality prediction model in terms of the probability of misclassifying the attainment status of surface waters. As shown here, CPA can partially meet this latter objective; however, compared to ROC analysis, CPA provides a limited assessment of error and nonerror rates for a given prediction model as described further below.
As illustrated by the example in Table 2, when 2 variables are uncorrelated and the response variable threshold yields a prevalence of a nonattaining condition equal to 50%, error and nonerror rates from ROC analysis are expected to be 0.5 (equal to the prevalence) because the indicator/stressor variable provides no information on the level of the response. As shown by the strongly correlated linear relationship between the variables in Data Set 1 where the prevalence is also 50%, much higher nonerror rates and much lower error rates may be achieved through careful selection of X_{c}. This example also shows that choosing X_{c} based on the intersection of regression prediction limits with response thresholds can yield relatively high rates of certain decision error types depending on the strength of the stressor–response relationship. The uncorrelated reference example and the example provided by Data Set 1 show that by combining information from regression and ROC analyses, the nature and extent of the stressor–response relationship, as well as the accuracy of response condition predictions based on selected stressor criteria, can be described.
The analysis of Data Set 3, the EPT taxa richness and percent sediment fines data of Hollister et al. (2008), provide a comparison of CPA and ROC analysis using field data where the stressor and response variables are negatively correlated to a moderate extent. This example is also useful because the data are not as easily modeled using linear regression as Data Set 1, highlighting the value of the nonparametric nature of both CPA and ROC analysis. As shown by the definitions provided in Table 2, the CPA plot is equivalent to a plot of PPV as a function of X_{c} Thus, CPA also can be used to obtain 1 − PPV = PPE. CPA does not provide an estimate of either FPE or FNE rates, however. In contrast, ROC analysis readily shows that at X_{c} = 15, 22% of all attaining waters would be incorrectly classified as nonattaining (FPE = 0.22), and 26% of all nonattaining waters would be incorrectly classified as attaining (FNE = 0.26). In addition, 30% of waters having greater than 15% fines (therefore indicating nonattainment) would be attaining (PPE = 0.3) and 16% of waters with less than 15% fines (therefore indicating attainment) would actually be nonattaining (NPE = 0.16). In addition, ROC analysis provides information on the accuracy of the prediction model for a selected X_{c}. For Data Set 3, the highest overall accuracy is estimated to be just below 80% at X_{c} = 12, and is slightly less (∼75%) at X_{c} = 15. These error and nonerror rates may or may not be acceptable to water quality managers and stakeholders; however, the salient point is that ROC analysis provides more complete information than CPA on the type and magnitude of errors associated with predictions of response variable condition based on exceedances of a stressor variable criterion.
ROC analysis of Data Set 3 also shows that the lowest possible balanced FPE and FNE rates for any single X_{c} (i.e., balance point) is 0.24, which occurs at X_{c} = 13.6% fines. This value is similar to, though slightly less than, the X_{c} = 15 determined by Paul and McDonald (2005). This suggests that where it is a management goal to balance FPE and FNE rates using a single criterion X_{c}, ROC analysis can be used to identify the appropriate value. Comparing error rate results from Data Sets 1 and 3 suggests that it may not be possible to balance all 4 error rate types with a single criterion except in the most linear stressor–response relationships. Furthermore, the magnitude of the balance point is likely to reflect the amount of variation in the stressor–response relationship. Data Set 1 has the lowest balance point at 0.14 and the highest degree of correlation among all 3 data sets.
Using data shown in Figure 1I, the errors associated with other X_{c} values may also be easily determined. X_{c} values could be selected based on preferred FPE and FNE rates. For example, to reduce FNE to 10%, the corresponding X_{c} is 6% fines. This lower value reflects a preference for minimizing false negative errors, and could be considered more protective of high EPT taxa richness scores. However, at a percent fines value of 6%, the FPE rate is estimated to be nearly 50%. That is, nearly half of “healthy” cases (EPT < 9) can be expected to have percent fines greater than 6%. This proportion is roughly the same as the FPE rate for the uncorrelated reference example given in Table 2. Conversely, choosing an FPE rate of 0.1 yields X_{c} = 26% fines. However, here the FNE rate increases to more than 60%. These results illustrate that reliance on a single stressor variable criterion may not provide adequate control of errors in predicting the condition of a response variable.
ROC analysis of the Florida colored lakes data (Data Set 2) illustrates the error implications of selecting a single X_{c} criterion based on 50% prediction limits in a current regulatory application. In the proposed and final criteria developed by USEPA, the baseline tp criterion of 0.05 mg/L, established using the Y_{thr}/UPL50 of the regression model, is the applicable criterion for an individual lake if sufficient chl a data are not available. In this case, exceedance of 0.05 mg/L tp may be used to list a lake as not meeting water quality standards based on nonattainment of the chl a threshold. ROC results show that although this tp criterion limits FNE and NPE rates to less than 0.05, PPE is greater than 30% and FPE is nearly 40%. Thus, more than 30% of colored lakes exceeding the tp criterion would actually attain the chl a criterion (PPE), and nearly 40% of colored lakes that actually attain the chl a criterion would be declared nonattaining by the baseline tp criterion (FPE).
High misclassification rates can have negative consequences for water quality management. Although minimizing FNE and NPE is clearly an important environmental management objective, minimizing FPE is also relevant to maximize effective use of limited environmental management resources. If sufficient chl a data are available, as defined in the regulation, a tp criterion that is higher than the baseline criterion is allowed up to the modified tp criterion. Thus, the regulation appears structured in a way that can make use of direct measurements of the response, in this case chl a, to limit potentially high FPE and PPE rates when a single tp concentration criterion is established that minimizes FNE and NPE rates. ROC analysis provides a means to estimate the associated reduction in prediction errors.
These examples show that although many factors may affect the selection of an indicator or stressor criterion, ROC analysis can be used with stressor–response data to provide important information about potential decision error rates to decision makers and stakeholders. This information may be useful in all 3 phases of criterion assessment described by Suter and Cormier (2008) and Cormier et al. (2008), i.e., planning, analysis, and synthesis. In the planning phase, ROCderived error rate results could be used to evaluate available data and preliminary response thresholds against predefined tolerable limits on decision errors. Results could also guide additional studies designed to reduce uncertainty in the stressor–response relationship. In the analysis phase, in which goals include modeling the stressor–response relationship and identifying an appropriate response threshold or “benchmark effect” (Cormier et al. 2008), ROC error rates could be used to quantify uncertainties associated with alternative responses and thresholds. In the synthesis phase, in which evaluation of candidate stressor criteria is the primary goal, ROC error rates characterize the uncertainties associated with specific criteria selections given the model and response thresholds established in the analysis phase. Thus, decision error rate estimates obtained using ROC analysis could support a criterion assessment process in which all 3 phases are addressed in an iterative manner, from preliminary to final stages of criterion selection. Other aspects of ROC analysis not emphasized in this article, such as characterizations of nonerror rates and analysis of the ROC curve, could also contribute useful information to criterion assessment.
Paul and McDonald (2005) list 5 necessary conditions for the appropriate application of CPA, and these also may apply to ROC analysis: 1) a probabilitybased sampling design, 2) a metric that quantifies the pollutant, 3) a response metric that responds to the pollutant at present (observed) levels, 4) known characteristics of an impacted response, and 5) a pollution parameter that can exert a “strong effect” on the response metric. When these conditions exist as part of a predictive model relating a causal variable to attainment status of a response variable, ROC analysis can provide a useful characterization of the reliability of such predictions.