Hazard quotients based on a point-estimate comparison of exposure to a toxicity reference value (TRV) are commonly used to characterize risks for wildlife. Quotients may be appropriate for screening-level assessments but should be avoided in detailed assessments, because they provide little insight regarding the likely magnitude of effects and associated uncertainty. To better characterize risks to wildlife and support more informed decision making, practitioners should make full use of available dose–response data. First, relevant studies should be compiled and data extracted. Data extractions are not trivial—practitioners must evaluate the potential use of each study or its components, extract numerous variables, and in some cases, calculate variables of interest. Second, plots should be used to thoroughly explore the data, especially in the range of doses relevant to a given risk assessment. Plots should be used to understand variation in dose–response among studies, species, and other factors. Finally, quantitative dose–response models should be considered if they are likely to provide an improved basis for decision making. The most common dose–response models are simple models for data from a particular study for a particular species, using generalized linear models or other models appropriate for a given endpoint. Although simple models work well in some instances, they generally do not reflect the full breadth of information in a dose–response data set, because they apply only for particular studies, species, and endpoints. More advanced models are available that explicitly account for variation among studies and species, or that standardize multiple endpoints to a common response variable. Application of these models may be useful in some cases when data are abundant, but there are challenges to implementing and interpreting such models when data are sparse. Integr Environ Assess Manag 2014;10:3–11. © 2013 SETAC
Hazard quotients (HQs) are commonly used to identify risks of contaminants to wildlife. A typical HQ compares a point-estimate total oral dose of a chemical (the numerator) to a point estimate toxicity reference value (TRV; the denominator), where the TRV is assumed to represent a safe dose. In some instances, the concentration of the chemical in biological media (e.g., egg concentration, dietary concentration, tissue body burden) can be substituted for the ingested dose, but the concept remains the same. In most cases the TRVs are conservatively derived (e.g., using uncertainty factors) and are based on organism-level responses, and therefore whenever HQ < 1 it is generally assumed that unacceptable adverse effects to local populations are highly unlikely. Conversely, if HQ > 1, the magnitude of potential effects is unknown and further evaluation may be warranted by examining toxicity information in a more realistic manner, or by using other lines of evidence relevant to wildlife at the organism or population level (Barnthouse et al. 2008). Accordingly, guidance in Canada (CCME 1996) envisioned that HQs would be generated only as part of a “screening” risk assessment, and that additional and detailed “quantitative” risk assessments would provide more meaningful estimates of risk in cases where HQ > 1. Similarly, guidance in the United States (USEPA 2005) states that ecological soil screening levels (that are dose-based for wildlife) are intended for screening-level risk calculations and should not be adopted or modified to drive remediation.
Unfortunately, in practice, HQs are often the only line of evidence used for evaluating risks for wildlife and have been relied on directly to make decisions in cases where HQ > 1. Many practitioners have applied more sophisticated approaches based on the HQ, such as using simulation models to generate probabilistic exposure estimates for comparison to TRVs (Tannenbaum et al. 2003). The reliance on HQ > 1 to rationalize risk management is fundamentally flawed in most cases—HQs estimated in Ecological Risk Assessment (ERA) are often overly conservative and can exceed 1 even for background exposures (Tannenbaum et al. 2003), due to the sparseness of available effects data, lack of consideration of bioavailability, conservatism often applied in TRV derivation (McDonald and Wilcockson 2003; Allard et al. 2010), as well as complexities in exposure estimation (Tannenbaum et al. 2003).
The greatest limitation of a result of HQ > 1 is that it usually provides limited insight regarding the actual probability and magnitude of potential effects. TRVs used in the HQ denominator are often poorly linked to an effect magnitude (Kapustka 2008; Allard et al. 2010). Even if the effect associated with a TRV is known, the magnitude of effects at a particular site can only be understood for the rare case where HQ = 1 and the receptor of concern was the same as used in the TRV derivation. Otherwise, the HQ is simply a dichotomous measure (Tannenbaum et al. 2003), often with high uncertainty. Furthermore, because TRVs are usually derived from studies assessing organism level effects, the linkage of an HQ to potential effects on local populations, which is typically the entity to be protected in an ERA, is tenuous (Tannenbaum et al. 2003; Suter et al. 2005; Salice et al. 2011). Equally important, because studies showing effects at the lowest doses tend to be used as the basis for derivation of TRVs (when multiple studies are assessed), the degree of conservatism in TRVs relative to the entire dose–response data set is often overlooked.
Practitioners can meaningfully address some of the limitations of HQs by explicitly evaluating available dose–response data. In a review of the development and application of wildlife TRVs, Allard et al. (2010) recommended that risk assessors should compile and integrate available data and attempt to understand underlying dose–response relationships. They highlighted general options for using dose–response information, ranging from simple plots of multispecies data sets (when data are sparse) to species-specific dose–response modeling (when data are abundant).
In practice, explicit evaluation of dose–response data is complicated. Data are usually sparse—that is, the data are spread across many species and endpoints such that definitive characterization of a dose–response relationship for each species and endpoint is not usually possible. Consequently, in most cases it is necessary to draw on data for multiple species (i.e., even those not directly included as receptors of interest in an ERA) and to consider variation in endpoints among studies. For reproductive endpoints in particular, individual studies rarely measure exactly the same endpoint. A sparse data set spread out over numerous species and endpoints is fundamentally difficult to evaluate. Typically, among the studies relevant to a wildlife receptor, a few may support quantitative dose–response modeling. Although models can be fit to data for those studies, the models do not account for the data from the other studies that may target different species and endpoints. In the context of the sparse and complex nature of wildlife dose–response data sets, what options are available for assessing risks to wildlife in a rigorous manner, that take advantage of all of the relevant data? This article aims to address this fundamental question by:
- Describing considerations relevant to compiling dose–response data in a way that is relevant for wildlife risk assessment
- Exploring options for graphical evaluation of data, with a focus on understanding variability among studies, endpoints, and species
- Exploring options for standardizing or combining data across studies, species, or even endpoints, to enhance understanding of the data as appropriate
- Reviewing simplified options for quantitative dose–response modeling, and discussing challenges associated with their application to sparse wildlife data sets
COMPILING DOSE–RESPONSE DATA
The first step in understanding potential dose–response relationships is to compile appropriate data for the contaminant of interest. Each available study must be evaluated carefully for quality, relevance, and appropriate experimental design before extraction of data relating dose to response. Importantly, the data extraction should include the full range of exposures, including control data. Considerations for data compilation include:
- Taxonomic group: In most cases, the receptors of concern in wildlife risk assessment are not the same species for which dose–response data exist. Consequently, and consistent with current practice in wildlife risk assessment (Allard et al. 2010), we recommend that for most applications, data should be gathered for all birds (for application to avian receptors) or all mammals (for application to mammalian receptors). The same principles would apply for reptiles and amphibians. Once data are gathered, it may be possible to rationalize narrower taxonomic ranges for particular receptors. For example, an assessment for deer may be based on dose–response data for ruminants only. However, it is more common that screening of data proceeds only to a coarse level of ecological or taxonomic resolution. For example, common avian subgroupings for which toxicity data are available are raptors (e.g., falcon, kestrel), passerines (e.g., swallow, robin), galliformes (e.g., quail, pheasant, chicken), and waterfowl (e.g., mallard duck, black duck).
- Duration of exposure: In general, studies with long dose durations may be expected to produce larger magnitude of adverse effects or to produce effects at a higher frequency in a test population, than otherwise equivalent studies of short duration. However, this may not always be the case, and screening of studies based on test duration can cause sensitive short-term studies to be overlooked. The magnitude of response is often linked more closely to the sensitivity of the life stage and the nature of the evaluated endpoint rather than to the temporal duration of the test. Consequently, and in light of the sparseness of many wildlife dose–response data sets, we usually include data (at least initially) for all study durations. The effect of study duration can later be evaluated graphically or with statistical tests, which may lead to exclusion of data for shorter-term studies if appropriate. To aid in this process, criteria for differentiating acute, subchronic and chronic study durations for birds and mammals are useful, such as those provided by the US Environmental Protection Agency (USEPA 2007) and Sample et al. (1996). Consistent with recent guidance for wildlife TRV development (Allard et al. 2010), we do not recommend application of arbitrary uncertainty factors to extrapolate from acute or subchronic endpoints to chronic endpoints.
- Study quality: The quality of a study can be evaluated using numerous criteria (USACHPPM 2000; USEPA 2005). Each study must have at least one control treatment for a given endpoint, and control performance must be adequate (e.g., low survival in the control would not be acceptable). Replication is also important as it affects precision of estimates of effect size. For example, a survival study with 20 organisms per treatment can differentiate effect sizes to the nearest 5%. The minimum number of replicates per treatment (and for the control) that is considered useful may depend in part on consideration of critical effect sizes (e.g., a minimum of 5 organisms per treatment would be needed to distinguish an effect size of 20%). Finally, we generally recommend exclusion of studies where confounding contaminants are administered, or where the nutrient status of the feed or other conditions related to baseline stress or health are atypical, because it may not be possible in such cases to distinguish the effects of the contaminant of interest.
- Statistical significance: When compiling dose–response data, we are interested in data showing both effects and lack of effects. Where there are actual responses, the data may or may not be statistically significant relative to control (based on hypothesis tests), dependent in part on study power. Because we are interested in the evidence provided by all of the data, statistical significance is not a criterion that should be used to screen data. Furthermore, where the experimental treatment outperforms the control (i.e., negative effect size) such data should be retained because they convey information on the variation in response (or may indicate hormesis—see discussion of hormesis later in this article).
- Method of dose administration: Studies where dose is administered through diet or drinking water are most relevant to wildlife risk assessment. The potential for confounding effects related to energy intake in ad lib studies should not be overlooked as a source of uncertainty (Keenan et al. 1996). Other oral exposure types (e.g., capsule or force-feeding by oral gavage or oral intubation) are less ecologically relevant (USEPA 2003) but have the advantage of more certainty regarding dose. We recommend that all oral dose data be compiled for initial evaluation. Injection studies should generally be excluded because they do not account for gastric bioavailability.
- Dose estimation: Many studies do not report dose but instead report concentrations of contaminants in food or drinking water. If food or water ingestion rates are not provided in the study, dose can be estimated using ingestion rates that are derived from other studies on the same species and life stage or calculated using allometric equations (Calder and Braun 1983; Nagy 1987, 2001). Importantly, dietary concentrations and food ingestion rates must be consistent with respect to moisture (i.e., both based on wet weights, or both based on dry weights) to estimate dose correctly. An additional aspect of quality control pertains to the consistency of chemical units in dose and effects values. Most studies report the dose of the individual toxic element, while excluding other components of the administered compound. For example, aluminum lactate (C9H15AlO9) contains approximately 9.2% Al by weight, so failure to standardize data across studies based on the weight of elemental Al would create large error.
- Body weight in dose estimation: Body weight information is used for estimating dose, as discussed above. For studies on juvenile organisms, body weight (and hence food ingestion rates) can change substantially during an exposure period. Because body weight does not increase linearly over time, and because different studies have different durations, use of initial body weight, average body weight, or final body weight in dose estimation can affect comparisons across studies. We generally use initial body weight (that usually results in the most conservative estimate of dose), but this is not always reported. Practitioners should consider what has been reported in the available studies and make a decision that is appropriate for a particular data set. In addition, caution should be exercised in the application of body weight data or food ingestion rates to estimate dose if exposures induced appetite suppression, as this can bias the dose estimate for the onset of responses.
- Dietary concentration as a measure of exposure: Although dose is most commonly used as the measure of exposure in wildlife risk assessment, dietary concentration can also be useful as a measure of exposure (Depew et al. 2012). As discussed above, there are significant challenges to estimation of dose using food ingestion rates and body weight, particularly in studies of juvenile organisms whose body weight (and therefore dose) changes over the course of the study. Variation in assumptions and methods for dose estimation are a major source of uncertainty in ERA (Mayfield and Fairbrother 2013). These particular uncertainties in estimated dose do not apply if dietary concentration is used as the measure of exposure, because it is usually measured with relatively little error. On the other hand, the major disadvantage of representing exposure as dietary concentration is that it must assume that food ingestion rates are comparable among studies. This assumption will generally be invalid for a diverse data set, but may be reasonable for groups of similar studies. A second disadvantage of dietary concentration as the measure of exposure is that exposures from different sources and/or routes (e.g., diet vs water) cannot be compared. Furthermore, for cases of multiple food types there will be uncertainty in estimating average dietary concentration. In summary, use of dietary concentration as a measure of exposure instead of dose eliminates some uncertainties but introduces others. We use dose as our primary measure of exposure, but recommend that risk assessors consider dietary concentration as an option for particular sets of data.
- Body weight as an endpoint: In addition to its use in estimating dose, body weight is also measured directly as an endpoint in developmental studies. For long-term studies of body weight, experimental mortality can bias estimates of mean body weight for longer time periods (i.e., the organisms that die before the end of an experiment may be excluded from growth endpoint calculations). Another issue with body weight is that studies may report body weight increment rather than absolute body weight as the endpoint. In such cases, careful consideration should be given to how to record the data. Body weight gain without consideration of initial body weight can be very misleading. For example, if a treatment group animal initially weighs 50 g and gains 1 g whereas a control group animal initially weighs 50 g and gains 10 g, there is a 10-fold difference in response for body weight increment, but not much difference in final body weight. Body weight gain is probably most informative from studies conducted on juveniles during their main growth stage when initial body weight is low relative to potential gain. If growth data are compared among studies, practitioners need to explore the data thoroughly with plots, and think carefully about how to interpret the data for ecological relevance.
- Sex: Sexual dimorphism is important in dose–response data sets in at least 2 ways. First, males and females may be affected differently by the same dose (Smits et al. 2002). A study may report responses for males, responses for females, and average responses across both sexes. Caution is therefore needed if comparing multiple studies where the sex of affected organisms is not the same. Second, differential exposure of male and female adults can occur when reproductive endpoints are measured. For example, a study may measure offspring production under various treatments where only the female parent is dosed, only the male parent, or both parents to different degrees.
- Chemical form: The degree to which chemical form and speciation are considered depends on the stage of ecological risk assessment. For some chemicals, distinguishing between organic and inorganic forms of the substance (such as with methylmercury) is essential to the derivation of a meaningful dose–response profile. The chemical speciation of a metal or metalloid can have a complex effect on observed toxicity, particularly when the environmental exposure derives from a combination of inorganic metal, methylated metal, or more complex organometal components. Investigators should clearly distinguish between major forms of chemicals administered in the literature studies and screen studies for relevance to the natural environment of the receptor species (e.g., pH, redox, water hardness, and ambient chemical mixture). Bioaccessibility (i.e., the fraction of the dose ingested that becomes available for absorption) of highly soluble metal salts administered in many laboratory studies is much higher than bioaccessibility of metals in the field, leading to potential overestimation of field toxicity of these substances. In advanced stages of risk assessment, it can be helpful to refine this uncertainty through evaluation of chemical speciation or relative bioaccessibility between laboratory and field conditions (Ollson et al. 2009; Saunders et al. 2011).
- Types of endpoints: Organism level endpoints that are most directly linked to population level effects, such as mortality, reproduction, and to a lesser extent growth, are usually most relevant to wildlife risk assessment. Gross malformations, developmental anomalies, or behavioral responses may also be relevant, particularly if they can be linked to the ability of the animal to survive and reproduce. Other types of endpoints (e.g., enzyme activities, subcellular responses) may be good indictors of exposure but are not considered as relevant for risk assessments in many jurisdictions (USEPA 2003) due to questionable linkage to organism or population level effects. Practitioners should decide which endpoints to include in a given data set based on relevant policy and the ability to relate endpoints to ecologically relevant effects on organisms or populations.
- Redundant endpoints: Studies often report endpoints that are clearly redundant. For example, some studies report results for males, for females, and for both sexes combined based on the data for each sex separately. Similarly, growth measurements are often made at multiple time increments, with later measurements integrating the responses observed at earlier time periods. In such cases, it is appropriate to record all of the data, but to recognize that for any given application only a subset of the data are likely to be appropriate. If most studies measure both sexes combined, we could drop the specific data for males and females where reported. In other cases the data for one sex or the other may be relevant.
- Correlated endpoints: Many toxicity studies measure multiple response variables that are different but closely related. This is common for behavioral measures aimed at evaluating potential neurological effects of a contaminant. Methods to jointly analyze multiple correlated endpoints have been applied (Krewski et al. 2002). In our experience, these methods are not often needed because of the types of endpoints typically considered in wildlife risk assessments (survival, growth, and reproduction). Growth endpoints that are closely related and correlated can often be considered redundant for particular applications (see previous point above), and reproductive endpoints that are correlated often have sequential or hierarchal relationships that can be addressed differently (see next point below).
- Sequential or hierarchal endpoints: Many endpoints, particularly reproductive endpoints, have sequential or hierarchal relationships. For example, a study may report egg production per breeding pair as well as the proportion of eggs that hatch—these are sequential in terms of their contribution to reproductive output. We may be able to use the raw data in the study to calculate a single combined endpoint, which is the number of hatchlings per breeding pair. The opposite may also occur. If a study measures the number of hatchlings and the number of eggs, we could calculate the proportion of eggs that hatch. If a study reports all 3 metrics, or if we have 2 and calculate the third, then the relationship among the 3 must be kept in mind during later analyses. The preferred method, depending on data availability, is to use the combined endpoint as it integrates the 2 contributing components to assess cumulative effects on reproductive output. However, if there are numerous studies measuring only a particular endpoint such as egg production, specific analyses for that endpoint may be informative when aggregated across multiple studies. Our usual approach is to initially compile all of the reported data from a study and then make decisions on coalescing or dividing the data based on consideration of the entire data set.
Following the above guidance, we compiled a data set for reproductive effects of polychlorinated biphenyls (PCBs) in the 50% to 60% chlorination range on birds, a subset of which is used below to explore the use of dose–response data. The data set is typical in our experience, with a reasonable number of studies but spread over numerous species and endpoints. The data set is sufficient to illustrate methods but sparse enough to demonstrate the typical challenges in evaluating wildlife dose–response data. The data set that is used is provided in Supplement Data Table S1, and the studies from which the data were derived are listed in the supplemental reference list.
EVALUATING DATA STRUCTURE WITH PLOTS
When dose–response data are compiled from multiple studies, every study will have strengths and weaknesses in terms of informing the assessment of toxicity to the species of interest. Across studies, there may be variation in endpoints, species, sex, life stage or age of organisms tested, the frequency and duration of exposure, duration of follow-up, route of exposure and method of administration, performance of study controls, and other factors. Consequently, data from 2 separate studies are rarely comparable in all aspects. For example, our example data set for effects of PCBs on bird reproductive endpoints covers 21 specific endpoints and 6 species across 14 studies.
Consideration of multiple studies marks a departure from the approaches used in the development of many published TRVs. Point estimate TRVs are often derived from single studies that are selected as the most appropriate among a set of studies—as an example relevant to this data set, the California Department of Toxic Substances Control uses low and high avian TRVs for PCBs (recommended by USEPA in 2002) that are based on single studies (DTSC 2009). Variation in the selection of studies by risk assessors is a key source of variation among TRVs used for risk assessments (Mayfield and Fairbrother 2013). Although sound rationale is typically used to derive such TRVs, and they may be useful in initial screening assessments, it is difficult to convey the degree of conservatism (if any) without the context of the broader data set. Furthermore, reliance on a single study in general is questionable because of the likelihood that the particular results of that study have occurred by chance (Ioannidis 2005). Other approaches to TRV development, such as the Eco-SSL approach incorporate results from multiple studies but do not consider response size or associated ecological significance.
Plots are essential for understanding dose–response data. A single plot of the avian reproductive endpoints (a subset are shown in Figure 1) can capture much of the data structure—in this case, endpoints, species, and exposure duration are all portrayed in the plot. Each data point represents the response measured for a single treatment group from a single study. To allow data from different studies to be considered together, responses are normalized to study-specific control performance as , where all metrics refer to positive response. For example, if a study has 20 organisms in the control group and 19 survive, survival for the control is 95%. If the same study has a treatment group with 20 organisms and 16 survive, the survival in the treatment group is 80%. Then, using the equation above, the control-normalized response would be 0.8/0.95 = 84.2%. That value would correspond to a particular dose (estimated or measured) and would be plotted as a single data point on a figure such as Figure 1. For a continuous endpoint such as growth, the raw response for a treatment or control group is the average across the individual organisms. It is important to normalize each data point to study-specific control performance, otherwise the response for treatment groups cannot be compared across studies. The normalized response can be interpreted as the expected treatment response if the control response was 100%. In the example above, raw survival in the treatment group was 80%, but that was associated with control survival of 95%. If control survival had been 100%, we would expect the treatment response to be slightly higher (i.e., 84.2%). Using this approach, normalized positive response can be >100% if a treatment group outperforms the control. Similar conversions can be made for negative response metrics such as mortality (Wayland et al. 2007). For endpoints that have clear bounds (e.g., survival, hatching success) these conversions are simple and can be thought of as positive or negative responses (e.g., hatching success or failure). However, for endpoints such as body weight there is no obvious negative endpoint to be normalized, so care should be taken in normalizing such endpoints to control performance.
Depending on the data set, similar plots could be used to convey other aspects of data structure related to factors such as variation among studies, variation related to chemical form (e.g., different PCB Aroclor mixtures), or variation between sexes (e.g., male parent dosing vs female parent dosing). To effectively convey the complexity in a data set, several sets of plots may be appropriate for initial evaluation of the structure of the data.
Plots can be tailored to facilitate interpretation for a particular site. The vertical dashed line in Figure 1 is a hypothetical estimated dose of 2 mg · kg−1 · d−1 at a site, which can be used to provide insight into the potential magnitude of risks expected for a particular receptor at that site. A vertical line corresponding to a published dose-based TRV could also be included to provide insight into the degree of conservatism (or lack thereof) associated with use of that published TRV. The horizontal dashed line represents a hypothetical acceptable effects level (AEL) with the region below that line representing an unacceptable response (i.e., in this case the AEL is a 25% adverse effect relative to control, which in terms of normalized positive response is 75%). Data falling in the lower left quadrant of each panel represent potential concern. The intercept of the horizontal line is the same for each endpoint, which may be appropriate as a starting point or if relevant policy specifies a particular AEL that applies to all endpoints. Alternatively, the line could be developed separately for each endpoint to reflect judgment about the relative importance of each endpoint to organisms or populations.
General observations in this particular case might be:
- Few data points fall in the lower left portion of concern.
- Although data are variable, there is some indication of a dose–response relationship for some endpoints (e.g., number eggs per hen per day, proportion fertile eggs hatched).
- There are not enough data points to draw meaningful inferences for most specific endpoints, and data would not support endpoint-specific quantitative dose–response modeling for most endpoints.
- There is no evidence to suggest that duration of the study is important—in this case, such differences were not expected because all of the studies had durations of at least 50 days.
- Several studies measured multiple endpoints, some of which are hierarchically related. For example, the kestrel studies included measures of number of eggs per clutch, number of fertile eggs, number of hatchlings, and number of fledglings per breeding pair. The number of fledglings integrates biological factors, including clutch size, egg fertility, and hatching success, and is therefore useful as a single, integrated endpoint. The 3 more specific endpoints could probably be ignored in this case because there are virtually no data for these endpoints from any other studies.
At this stage, the practitioner may decide to investigate particular endpoints in more detail. As an example, we examine the proportion of eggs hatched (Figure 2). The relative precision of each data point is conveyed with error bars denoting the standard error of each normalized response, which can be computed using the approximate variance of a ratio. Specifically, for a given treatment group t or control group c, there are a total of n eggs laid of which x successfully hatch. From the binomial distribution (Mood et al. 1974), we obtain estimates for the proportion of eggs hatched and its variance . In turn, the estimate of normalized response is given by , which has an approximate variance for a ratio (Mood et al. 1974) given by .
From Figure 2 we might make the following preliminary observations:
- The mallard data point and one of the pheasant data points are cases where treatments exceeded control performance. We expect a lot of variability in the data (e.g., due to small sample sizes) but nevertheless these data points may warrant additional scrutiny.
- The single data point for the ring dove is relatively precise and may indicate that the ring dove is sensitive to PCBs for this endpoint. However, it is difficult to draw firm conclusions from a single data point. Further scrutiny of this data point may be warranted.
- The chicken data (combined) suggest that hatching may decline in a dose-dependent manner with PCB exposure. More detailed evaluation of these data would be warranted.
As another interpretation, the last 3 endpoints in Figure 1 (proportion of chicks surviving to 3 weeks, to 6 weeks, and to fledge) could be integrated as they are all essentially the same endpoint (i.e., chick survival). These data could then be plotted together, using symbols to distinguish specific endpoints and species (Figure 3). From this figure, a key observation would be the disparate responses associated with the 2 data points for pheasant at high dose. Detailed evaluation of those data points would be warranted.
FITTING DOSE–RESPONSE MODELS TO SIMPLE DATA SETS
Graphical exploration of dose–response data may be sufficient to draw conclusions for many risk assessments, particularly where measured doses are either clearly associated with negligible effects or clearly associated with unacceptably large effects. In other cases, risk assessors may seek more quantitative approaches. The simplest models can be fit to data from a particular study. For many endpoints, including all of the endpoints shown in Figure 1, generalized linear models (GLMs) are, in theory, appropriate when working with raw data. GLMs can accommodate different error structures (e.g., binomial in the case of dichotomous data, and Poisson in the case of count data) and are easily implemented using modern statistical packages (Kerr and Meador 1996; Bailer and Oris 1997). For example, Moore et al. (1997) estimated risks to mink from hexachlorobenzene by fitting a model to data from a single study that measured total kit biomass at 6 weeks of age per female as an endpoint. Moore et al. (1999) estimated risks to kingfishers by fitting data from single, unrelated studies on pheasants to evaluate effects of methylmercury (the endpoint was the number of chicks surviving to 14 days per female) and PCBs (the endpoint was the number of chicks surviving to 6 weeks per female). In these cases, the functional form of the dose–response model was appropriate for each endpoint.
Alternative models forms that reflect an understanding of the mechanism of action for a particular contaminant and endpoint may also be relevant when fitting dose–response data. For example, some chemicals may exhibit hormesis, causing adverse effects at high doses but beneficial effects at low doses (Schabenberger and Birch 2001; Calabrese and Baldwin 2003). However, for sparse wildlife data sets, detecting subtleties in dose–response such as hormesis is challenging, and model overspecification is likely if such models are applied without justification. Another factor influencing the selection of dose–response model form is the range of magnitudes of effect that are of interest. The primary objective of dose–response modeling in ecological risk assessment is to characterize dose–response relationships at intermediate levels of response to individual organisms, such as 20% inhibition. As such, precise estimation of dose associated with low incidence rates (as required in human health risk assessment, often necessitating extrapolation of the model to low doses) is usually not required. Alternative model forms can be compared quantitatively and, if appropriate, results from more than one model can be used to make inferences (Link and Albers 2007).
DOSE–RESPONSE MODELING OPTIONS FOR COMPLEX DATA SETS
Model fits to data from single studies can have limited use in ecological risk assessment. They do not take advantage of a complete dose–response data set and they may not be reflective of general relationships beyond the particular study. In some cases a single study may be purposely selected because of its sensitivity, but otherwise risk assessors may seek to combine data from multiple studies. For example, Wayland et al. (2007) combined data from several studies to estimate hatch failure rate in American Dipper and Harlequin Duck resulting from dietary exposure to Se. They control-normalized the data to account for different control performance among the studies. Moore et al. (1999) combined data from several studies to model effects of methylmercury on mink mortality, and effects of PCBs on reproductive output.
Although combining data across studies is tempting, strong caution is warranted when drawing inferences from model fits because we expect there to be fundamental differences among studies. If we had enough data from each study, we may see a different dose–response relationship for each study (Wheeler and Bailer 2009). In this way, a data set that is compiled from several studies is considered to have multiple “levels”—data points are not independent, rather they are grouped by study, and variation among studies may be important. As an extreme example, we fit a GLM with binomial error to the pooled data from 3 chicken studies reporting proportion of eggs hatched (Figure 4). For simplicity we ignored the minor differences in control performance among studies—in reality data should be control-normalized, although in that case model fitting may not be so straightforward (control-normalized response is a ratio of 2 binomial quantities, the numerator as the treatment response and the denominator as the control response, each with its own sample size). In any case, the model fit is tenuous at best, as it is driven almost entirely by 2 data points that do not seem to support a common model. It turns out that those 2 data points are from different studies. This adds even more uncertainty to our interpretation of the model fit, because we have no insight into whether observed differences between data points are due to differences among studies or random variation that would apply even within studies. In short, there is no rigorous statistical basis for fitting a model to these pooled data, and the example fit is potentially, if not dangerously, misleading. Although this case is extreme, it highlights the additional statistical and interpretive complexity required when fitting dose–response models to data combined across studies. Unfortunately, the data set shown in Figure 4 is typical of what is available for many endpoints (see Figure 1).
Options for dealing with study-related variation in the data will vary depending on the quality and quantity of the data. For example, if we have a good data set from one study, we could model only that data set and then make qualitative inferences about data from other studies by comparing them to the fitted model. Ideally, if we have multiple good data sets (e.g., from several studies), we can consider fitting a single model that explicitly incorporates the grouped nature of the data by using a mixed-effects model (Pinheiro and Bates 2000; Gelman and Hill 2007) or a Bayesian hierarchical model (Gelman and Hill 2007; Wheeler and Bailer 2009). Such “multilevel” models have been used to account for variation among studies (Coull et al. 2003; Corrao et al. 1999; Wheeler and Bailer 2009), variation among litters for developmental studies with young arising from multiple litters (Krewski et al. 2002; Hunt and Rai 2008), and variation among individuals in studies of human patients (Lalonde et al. 1999). The result of a multilevel model applied to data for several studies would be characterization of the mean relationship across studies, and variation in the relationships among studies, although the latter may be poorly characterized unless there are many studies. Either of these may be of interest for a risk assessment. When there are true differences among studies (or any other factor that would make the data multilevel), failure to account for those differences can lead to bias in the estimate of the mean curve and overestimates of confidence (Pinheiro and Bates 2000). Multilevel models are attractive because they explicitly address differences among studies (including, for example, differences in control performance). Unfortunately, although multilevel models are seemingly a logical progression in the quantitative analysis of dose–response data, our experience is that they are seldom supported by sparse wildlife toxicology data sets.
When we extend the discussion from variation among studies to variation among species, the application of multilevel models may be different. If the species of interest is in the data set and the species-specific data are good, a multilevel model is not needed because analysis can focus on the species of interest. If the data for the species of interest are poor, a multilevel model may be useful because the model can use information from other species to help in fitting a relationship for the species of interest. If the species of interest is not in the data set, the mean dose–response relationship across species may be of interest, but variation among species will also be important for understanding uncertainty about using the relationship for a species that is not in the data set. Many potential nuances are associated with modeling variation among species. For example, variation among species is likely to be confounded to some extent by variation among studies, particularly in a sparse data set where for some species there may only be 1 or 2 studies. Another nuance relates to similarities and differences among species. For example, if data are not available for a species of interest, but are available for a closely related species, data for the related species could be weighted more heavily than data for other species or used at the exclusion of other data.
Although rational arguments can be made for modeling multilevel data across studies or species, much greater caution is required for modeling data across endpoints. For example, we would expect very different dose–response relationships for growth endpoints versus reproductive endpoints. However, there may be opportunities to convert subsets of endpoints to some kind of meaningful common metric. Several studies have standardized data across endpoints and then fit models to those data. Dillon et al. (2010), in examining effects of Hg on fish, converted data on mortality, severe developmental abnormalities, fry hatch failure, and spawning failure to a common metric “percent injury.” The various endpoints were considered severe enough to be potentially equivalent to mortality. A nonlinear model was then used to relate percent injury (as a continuous variable) to Hg concentrations in fish tissue. Other examples have used ordinal approaches. Krewski, Chambers et al. (2010), in examining effects of Cu deficiency and excess on human health, converted data for numerous endpoints into 5 ordinal severity scores ranging from no effects to gross toxicity. Krewski, Chambers, and Birkett (2010) subsequently used categorical regression (USEPA 2000) to evaluate the ordinal response data against Cu intake, focusing on the most sensitive endpoint (i.e., highest ordinal score) for each dose group. Caux et al. (1997) characterized effects of suspended sediment on freshwater fish and invertebrates using 14 ordinal categories representing severity of ill effects, and then fit a nonlinear model to relate the ordinal data to suspended sediment concentrations. These methods offer some options that may be useful for modeling pooled wildlife dose–response data.
In our view, the overreliance on published TRVs for detailed wildlife risk assessments stems in part from the perceived effort required to better characterize dose–response data. However, compendia of wildlife toxicology studies are increasingly prevalent, and the costs of analyzing and filtering data are not necessarily prohibitive. For large or complex sites where screening-level assessments fail to provide clear answers concerning wildlife risks, the cost of collecting and rigorously evaluating dose–response data is likely to be small in comparison to the economic and environmental costs of rendering a poorly informed decision.
The authors thank Transport Canada, which contributed to the compilation of PCB data used as the example in this article. All views and opinions expressed herein are those of the authors.
Table S1. Dose-response data