Evaluation of criteria used to assess the quality of aquatic toxicity data

Authors


Abstract

Good quality toxicity data underpins robust hazard and risk assessments in aquatic systems and the derivation of water quality guidelines for ecosystems. Hence, an objective scheme to assess the quality of toxicity data forms an important part of this process. The variation of scores from 2 research papers using the Australasian ecotoxicity database (AED) quality assessment scheme was evaluated by 23 ecotoxicologists. The results showed that the quality class that the assessors gave each paper varied by less than 10% when compared with a quality score agreed a priori between the authors of this study. It was determined that the majority of the variation in each assessment was due to ambiguous or poorly written assessment criteria, information that was difficult to find, or information in the paper that was overlooked by the assessor. This led to refinements of the assessment criteria in the AED, which resulted in a 16% improvement (i.e., reduction) in the mean variation of scores for the 2 papers when compared with the a priori scores. The improvement in consensus among different assessors evaluating the same research papers suggests that the data quality assessment scheme proposed in this article provides a more robust scheme for assessing the quality of aquatic toxicity data than methods currently available.

INTRODUCTION

Aquatic toxicity data form an important component of hazard and risk assessments and in the derivation of water quality guidelines, all of which are ultimately used to manage and protect aquatic ecosystems. The quality and reliability of the above processes, however, depend largely on the quality of the toxicity data used. Therefore, schemes that can assess the quality of toxicity data, such as that used in the U.S. Environmental Protection Agency (USEPA) ECOTOX database (USEPA 2002) and the Australasian ecotoxicity database (AED) (Warne et al. 1998; Warne and Westbury 1999; Markich et al. 2002), were used to develop the current Australian and New Zealand water quality guidelines (ANZECC and ARMCANZ 2000). In these data-assessment schemes, the quality of data presented in published or unpublished research papers is assessed by awarding scores based on a series of criteria or questions, designed to ascertain the scientific rigor of the testing provided in the paper. Table 1 provides the data-assessment schemes used for the AED (Warne et al. 1998; Warne and Westbury 1999; Markich et al. 2002). In this scheme, the scores awarded for each question are summed to obtain the “total score” expressed as a percentage of the “total possible score” for that type of data (e.g., metals and freshwater biota). The data are classed as being of unacceptable, acceptable, or high quality, depending on whether the quality score is ≤50%, 51 to 79%, and ≥80%, respectively (Warne et al. 1998).

The quality and reliability of hazard and risk assessments and water quality guidelines can be improved by determining which data are of unacceptable quality and precluding their use. No matter how carefully a data quality assessment scheme is designed, however, there always remains the potential for different users to assign different quality scores or quality classes to the same data. This can occur in a number of ways: (1) assessors may have different interpretations of the information provided in the paper; (2) assessors may fail to find information provided in the paper; and (3) assessors may have different interpretations of the question and scoring scheme. The aim of this study, therefore, was to investigate whether variation in the quality assessment of toxicity data occurs between different assessors and, if so, to quantify this variation, to analyze sources of this variation, and if necessary, to modify the AED data quality assessment scheme. This was done by asking scientists working in the field of ecotoxicology to assess the quality of data in 2 different research papers describing the toxicity of metals to aquatic organisms.

Although the weighted scoring used in the data assessment scheme presented in this study may be subjective and arguable, it was nonetheless used as a benchmark to evaluate the aforementioned aims of this study, and thus, was not the focus of the work. The approach was modified (Markich et al. 2002) from the scoring scheme used to evaluate the USEPA ECOTOX database (USEPA 2002).

METHODS

The survey consisted of using the AED scoring scheme (Table 1) to assess the quality of data in 2 peer-reviewed journal articles that reported the toxic effects of metals to aquatic organisms. The 2 journal articles used were Buhl (1997) and Cheung and Lam (1998). Both articles were chosen randomly for the purpose of this study. The articles had been assessed as part of another project and classed as having acceptable quality (i.e., a score of 51–79%; Hobbs et al. 2004). The survey was sent to 55 scientists, all of whom have some experience in the field of ecotoxicology. The selected participants included postgraduate students and scientists with varying degrees of ecotoxicological experience (2–25 y) from governmental and nongovernmental organizations.

Table Table 1.. Original scheme for assessing the quality of aquatic toxicity data used in the Australasian ecotoxicity database. Taken from Markich et al. (2002)a
QuestionMark
  1. a LC = lethal concentration; NOEC = no observed effect concentration; LOEC = lowest observed effect concentration; MDEC = minimum detectable effect concentration; MATC = maximum acceptable toxicant concentration; EC = effective concentration; BEC = bounded effect concentration.

1Was the duration of the exposure stated (e.g., 48 or 96 h)?10 or 0
2Was the biological endpoint (e.g., immobilization or population growth) defined?10 or 0
3Was the biological effect stated (e.g., LC or NOEC)?5 or 0
4Was the biological effect quantified (e.g., 50% effect, 25% effect)? The effect for NOEC and LOEC data must be quantified5 or 0
5Were appropriate controls (e.g., a no-toxicant control and/or solvent control) used?5 or 0
6Was each control and chemical concentration at least duplicated?5 or 0
7Were test acceptability criteria stated (e.g., mortality in controls must not exceed a certain percentage)? Invalid data must not be included in the database5 or 0
8Were the characteristics of the test organism (e.g., length, mass, age) stated?5 or 0
9Was the type of test media used stated?5 or 0
10Was the type of exposure (e.g., static, flow through) stated?10
11Were the chemical concentrations measured?4 or 0
12Were parallel reference toxicant toxicity tests conducted?4 or 0
13Was there a concentration–response relationship either observable or stated?4 or 0
14Was an appropriate statistical method or model used to determine the toxicity?4 or 0
15For NOEC/LOEC/MDEC/MATC data was the significance level 0.05 or less? OR4 or 0
 For LC/EC/BEC data was an estimate of variability provided? 
16For metals tested in freshwater (FW), was the pH, hardness, alkalinity, and organic carbon content measured during the test and stated (3 marks each)? Award 1 mark if it is measured but not stated, or if the dilution water only is measured and stated OR3, 1 or 0
 For all other chemicals, was the pH measured and stated (3 marks)? Award 1 mark if it is measured but not stated, or if the dilution water only is measured and stated 
17For marine and estuarine water (MEW), was the salinity/conductivity measured and stated?3 or 0
18For tests not using aquatic macrophytes and alga, was the dissolved oxygen content of the test water measured during the test?3 or 0
19Was the temperature measured and stated?3 or 0
20Was the grade or purity of the test chemical stated?3 or 0
 Total score
  • Total possible score for the various types of data and chemicals:

    FW/metal/nonplant = 100. FW/nonmetal/nonplant = 91. FW/metal/plant = 97.

    FW/nonmetal/plant = 88. MEW/nonplant = 91. MEW/plant = 88

 
 Quality score ([Total score ÷ Total possible score] ṁ 100) 
 Quality class (H ≥ 80%, 51–79% A, U ≤ 50%) 

The authors of this article also individually participated in the survey. In addition, the authors collectively scored each article to establish a benchmark against which to evaluate the scores of the assessors participating in the survey. The answers to each question from the collective assessment of the 3 authors became the “agreed responses,” whereas the sum of the numerical responses became the “agreed quality score” (AQS). The agreed responses and AQS were used as the answers and quality scores to which all other responses were compared in all subsequent analyses. An attempt was made to remove all subjectivity from the agreed responses and AQS by critically assessing each paper jointly with the answer to each question proven to the satisfaction of all 3 authors. The authors have a collective experience of assessing about 2,000 research articles using the present scoring scheme.

Figure Figure 1..

Quality score for each respondent assessing the Buhl (1997) study. The agreed quality score (AQS) is shown as the solid horizontal line, and the quality classes are indicated by the broken lines.

The quality score and classification for each paper was determined by each of the assessors using the scheme provided in Table 1. The variability of the quality scores for each paper was presented by plotting the quality scores against the respondent number and comparing scores to the AQS. Simple linear regression analysis was used to determine whether a relationship existed between the years of ecotoxicological experience of the assessors and the absolute value of the deviation from the AQS for each paper. The number of assessors that gave an answer different to the agreed response was determined for each question (Table 1). The responses and question were then examined to determine why different answers were given. The questions in the data assessment scheme were then modified to improve the usability of the scheme and to reduce assessor variation. A follow-up survey using the revised assessment scheme was conducted using the same 2 articles (Buhl 1997; Cheung and Lam 1998) 6 months after the original survey. The survey was distributed to 13 of the original assessors, and responses were received from 7 of the 13. Each assessor was asked to review each paper independently and to not refer to the original surveys.

Figure Figure 2..

Quality score for each respondent assessing the Cheung and Lam (1998) study. The agreed quality score (AQS) is shown as the solid horizontal line, and the quality classes are indicated by the broken lines.

RESULTS AND DISCUSSION

Twenty three surveys out of a possible 55 were returned (i.e., 42% return rate). The 23 surveys included the individual responses of the 3 authors of this study (i.e., there were 20 other respondents). The AQS was 65 for Buhl (1997) and 70 for Cheung and Lam (1998). For the Buhl (1997) study, the median quality score awarded by the assessors was 70, and the mean and 95% confidence interval was 70 ± 3.1. The range of scores was 57 to 83. Figure 1 shows that 16 assessors gave the paper a higher mark than the AQS. Six assessors scored the paper lower than the AQS, and 1 assessor gave the same score. For the Cheung and Lam (1998) study, the median quality score awarded by the assessors was 71, and the mean and 95% confidence interval was 69 ± 3.5. The scores ranged between 54 and 82. Figure 2 shows that 12 assessors scored the paper higher than the AQS, 10 awarded a lower score, and 1 assessor gave the AQS score.

Figure Figure 3..

Absolute deviation of individual quality scores from the agreed quality score (AQS) for both studies as a function of the number of years of experience in ecotoxicology of the assessors.

Despite the variation in scores awarded (Figures 1 and 2), only 2 assessors for each article (8.7%) gave a score that would result in a different data quality classification. In 4 instances, the quality of the data was overestimated with the classification given as “high quality” (≥80%). This low level of misclassification of data quality suggests the overall robustness of the data quality assessment scheme and the ability of the assessors to form a common basis for judging the data.

The relationship between the number of years of experience in ecotoxicology and the absolute deviation of the quality score from the AQS (Figure 3) was not significant (p > 0.05) for either article, that is p = 0.76 for Buhl (1997) and p = 0.051 for Cheung and Lam (1998). There was no significant (p = 0.21) relationship when AQS values for each article were combined. Therefore, the degree of experience of the assessors did not affect the ability to assess the quality of the data.

The percentage of assessors that gave answers that differed from the agreed response for each question was examined and is presented in Table 2. The authors in this study arbitrarily determined that those questions in which >20% of assessors gave the “incorrect” answers (i.e., different from the agreed response) would be investigated further to determine the cause of the error. It was possible that the question was ambiguous or poorly written or, alternatively, that the assessors overlooked the information. It can be seen that questions 2, 7, 13, and 16 all had >20% of assessors giving incorrect answers on the Buhl (1997) article. For the Cheung and Lam (1998) article, questions 7, 10, 11, 16, and 19 had >20% of assessors giving incorrect answers.

In some cases, it appears that the wording of the question led to the difference in responses. For example, question 2 asks if the biological endpoint is defined. Almost 80% of assessors answered that the biological endpoint had been defined in the Buhl (1997) paper, even though the endpoint was simply stated as “mortality” and a definition of mortality was not supplied. Thus, the agreed response for this question for Buhl (1997) was a value of 0. Nearly 61% of assessors responded to question 19 for the Cheung and Lam (1998) article indicating that temperature had been measured in the testing and stated in the protocol. Cheung and Lam (1998) stated, however, that the environmental chamber was maintained at 25°C but not whether that was measured in the test chamber or in the test solutions.

Table Table 2.. Percentage (%) of the 23 ecotoxicology assessors whose answers to questions differed from the agreed response established by the authors to judge the data quality of 2 randomly selected ecotoxicity studies used in this article to evaluate the data quality evaluation process
Question No.Buhl (1997)Cheung and Lam (1998)
  1. a *=The authors believe that deviation >20% from the agreed response for a question signifies a lack of consensus among experts.

1013
278*a0
300
4134.3
54.30
6130
726*78*
804.3
904.3
104.343*
118.748*
1200
1330*4.3
144.34.3
154.38.7
1687*100*
1700
184.34.3
198.761*
208.74.3

Other answers from the assessors that differed from the agreed responses may be due to the assessor not reading the paper carefully and, therefore, missing the relevant information or data. This was evident in question 16 in both articles and in questions 10 and 11 in Cheung and Lam (1998). Of all the questions in the assessment scheme, question 16 had the largest amount of variation in the answers provided by the assessors. This may be, in part, because the question has multiple parts and multiple scores depending on the answers. Both papers provided at least some of the appropriate information. The Buhl (1997) paper clearly provides the physicochemical parameters of the test solution (Buhl 1997, table 2), and the Cheung and Lam (1998) paper indicated the physicochemical parameters for the dilution water but not for the test solutions.

Table Table 3.. Revised scheme for assessing the quality of aquatic toxicity data with the modified questions italicizeda
Note: To determine the quality of data, the entire article should be read 
QuestionMark
  1. a LC = lethal concentration; NOEC = no observed effect concentration; LOEC = lowest observed effect concentration; OECD = Organization for Economic Cooperation and Development; MDEC = minimum detectable effect concentration; MATC = maximum acceptable toxicant concentration; EC = effective concentration; BEC = bounded effect concentration.

1Was the duration of the exposure stated (e.g., 48 or 96 h)?10 or 0
2Was the biological endpoint (e.g., immobilization or population growth) stated and defined (10 marks)? Award 5 marks if the biological endpoint is only stated10, 5 or 0
3Was the biological effect stated (e.g., LC or NOEC)?5 or 0
4Was the biological effect quantified (e.g., 50% effect, 25% effect)? The effect for NOEC and LOEC data must be quantified5 or 0
5Were appropriate controls (e.g., a no-toxicant control and/or solvent control) used?5 or 0
6Was each control and chemical concentration at least duplicated?5 or 0
7Were test acceptability criteria stated (e.g., mortality in controls must not exceed a certain percentage) OR Were test acceptability criteria inferred (e.g., test method used [USEPA, OECD, ASTM etc] uses validation criteria) (award 2 marks). Note: Invalid data must not be included in the database5, 2 or 0
8Were the characteristics of the test organism (e.g., length, mass, age) stated?5 or 0
9Was the type of test media used stated?5 or 0
10Was the type of exposure (e.g., static, flow through) stated?4 or 0
11Were the chemical concentrations measured?4 or 0
12Were parallel reference toxicant toxicity tests conducted?4 or 0
13Was there a concentration-response relationship either observable or stated?4 or 0
14Was an appropriate statistical method or model used to determine the toxicity?4 or 0
15aFor NOEC/LOEC/MDEC/MATC data was the significance level 0.05 or less? OR4 or 0
15bFor LC/EC/BEC data was an estimate of variability provided?4 or 0
16aFor metals tested in freshwater (FW), were the following parameters measured? 
(i)pH,3, 1 or 0
(ii)hardness,3, 1 or 0
(iii)alkalinity, and3, 1 or 0
(iv)organic carbon concentration3, 1 or 0
 Award 3 marks for each variable that was measured during the test and values stated 
 Award 1 mark for each parameter if it is measured but not stated or if they are measured and values are stated for the dilution water only OR 
16bFor all other chemicals, was the pH measured and values stated? 
 Award 1 mark if it is measured but not stated or if the pH of the dilution water only is measured and stated3, 1 or 0
17For marine and estuarine water (MEW), was the salinity/conductivity measured and stated?3 or 0
18For tests not using aquatic macrophytes and alga, was the dissolved oxygen content of the test water measured during the test?3 or 0
19Was the temperature measured and stated (3 marks)? Award 1 mark if only the temperature settings of the room or chamber are stated3, 1 or 0
20Were analytical reagent grade chemicals or the highest possible purity chemicals used for the experiment?3 or 0
 Total score
  • Total possible score for the various types of data and chemicals:

    FW/metal/nonplant = 100. FW/nonmetal/nonplant = 91. FW/metal/plant = 97.

    FW/nonmetal/plant = 88. MEW/nonplant = 91. MEW/plant = 88

 
 Quality score ([Total score ÷ Total possible score] ṁ 100) 
 Quality class (H ≥80%, 51–79% A, U ≤ 50%) 
Figure Figure 4..

Absolute deviation of individual quality scores from the original and revised agreed quality scores (AQS) for the Buhl (1997) study.

For question 10, approximately 44% of the assessors missed that the acute tests were conducted under static conditions in the Cheung and Lam (1998) article. Additional confusion may have arisen when answering question 11 (i.e., whether the chemical concentrations had been measured) for the Cheung and Lam (1998) paper. The article states that concentrations were measured “at the time of solution renewal,” which only occurred in the chronic tests. The concentrations were not measured for the acute tests.

For other questions, the cause for differences among survey respondents from the agreed response could only be attributed to assessor error. Question 7 asks whether test acceptability criteria were stated. For the Buhl (1997) paper, 26% of assessors declared that the test acceptability had been stated even though this was not the case. Assessors may have assumed that the first sentence of the results section stating “there were no mortalities in any of the control treatments” was a test acceptability criterion. Technically, this is not a test acceptability criterion and, therefore, a score of 0 was the appropriate response to question 7. For the Cheung and Lam (1998) article, 78% of assessors stated incorrectly that test acceptability criteria had been provided. Similarly, for question 13, 30% of assessors stated that the Buhl (1997) article provided a concentration-response relationship either observable or stated, but neither was evident in the paper.

After examining the feedback from the assessors, it was clear that the AED quality assessment scheme was a useful and robust method for assessing the quality of ecotoxicity data included in any database or used for any risk or hazard assessment. It was also clear that, in some cases, the assessment questions were contributing, to some degree, to the variation in data quality scores. As a result, questions 2, 7, 15, 16, 19, and 20 were revised to improve the clarity of the question (Table 3). For example, question 2 was reworded to include the option of awarding 5 marks if the biological endpoint was stated but not defined (Table 3). This was done to help reduce the error of awarding marks in studies in which the biological endpoint is stated but not defined as well as not penalizing a paper for only stating the endpoint. A 2nd option was included for question 7 that allowed 2 marks to be awarded to a study that did not state the test acceptability criteria in the results but that inferred that test acceptability criteria have been considered because of the test method used (e.g., USEPA, Organization for Economic Cooperation and Development, and ASTM test methods contain test acceptability criteria).

Figure Figure 5..

Absolute deviation of individual quality scores from the original and revised agreed quality scores (AQS) for the Cheung and Lam (1998) study.

Question 16 was rewritten as a 2-part question so that each physicochemical parameter for which data should be specified by a study are clearly indicated. These changes should make it less confusing for the assessors (Table 3). Question 19 was modified to remove any confusion that may have arisen regarding whether the question referred to measuring the temperature of the test media or the test chamber (Table 3). Question 20 was reworded to encourage full marks for experiments that used chemicals of the highest available purity, irrespective of the actual level of purity (Table 3). To address the issue of assessors overlooking information that was stated, but not necessarily in the Methods section, a note was added to the data assessment scheme informing users to read the entire paper (Table 3).

The results of the follow-up survey that used the revised data assessment scheme showed that the mean variation of scores (and their range) from the AQS was reduced from 6.2 (original) to 4.2 (revised), that is, by 32% for Buhl (1997) (Figure 4), but showed no improvement (i.e., 4.3 to 4.3; 0%) for Cheung and Lam (1998) (Figure 5). Perhaps the lower original mean variation of scores for the Cheung and Lam (1998) paper reflect a high level of data quality and presentation of test information, which made it more difficult to improve upon the original data quality score.

CONCLUSIONS

The AED quality assessment scheme should be an effective tool for assessors to confidently assess and classify the quality of aquatic toxicity data reported in a published or unpublished research paper, no matter what their degree of ecotoxicological experience. In this study, the majority of the variation in the 2 assessment case studies occurred in responses to only 5 of the 20 questions used to evaluate data quality. In these 5 questions, the criteria appear to have been ambiguous or poorly written, the information was difficult to find, the information was overlooked by the assessor, or the assessors interpreted the criteria differently. Revision of these questions in the AED quality assessment scheme led to improvement in the consensus of the assessors.

Acknowledgements

We are grateful to Meg Burchett for discussions that inspired this study and to our colleagues in the field of ecotoxicology who participated in the survey: M. Aistrope, G. Batley, D. Bellifemine, M. Binet, J. Chapman, A. Colville, N. Cooper, H. Doan, C. Doyle, A. El Merhibi, S. Gale, C. King, K. Leung, M. Mortimer, P. Ralph, K. Ross, R. Smith, J. Stauber, R.M. Sunderam, and M. Woods. We also acknowledge the New South Wales Department of Environment and Conservation (formerly NSW EPA) for funding. We would also like to gratefully acknowledge Buhl (1997) and Cheung and Lam (1998) for being unwitting victims of this study.

Ancillary