Reliability and accuracy of visual methods to quantify severity of foliar bacterial spot symptoms on peach and nectarine

Authors


E-mail: hngugi@yahoo.com

Abstract

The objectives of this study were to assess the reliability and accuracy of visual methods used to quantify the severity of bacterial spot (Xanthomonas arboricola pv. pruni) symptoms and evaluate the effects of rater experience on the quality of disease estimates. Three cohorts of raters differing in experience with disease assessment rated three sets of peach or nectarine leaves (n ≥ 103; disease severity levels from 0% to 100%) by direct estimation of percentage leaf area with symptoms. Four of the experienced raters also rated the leaves using a 1–7 interval scale. Actual disease severity on the leaves was obtained with the APS assess image analysis software. Equivalence tests based on a bootstrap analysis were used to compare the rating scale and direct estimation methods, and to evaluate the effects of rater experience, computer training and human instruction on accuracy and reliability of disease estimates. In concordance analysis of continuous variables, with data from scale converted to percentage, the direct estimation method resulted in more accurate and reliable estimates than the interval scale. Analysing the scale data without conversion to percentage improved the concordance statistics for the scale but not sufficiently to match the direct estimation method. Accuracy was affected more by rater experience and intrinsic ability than reliability. Instruction on disease symptoms resulted in the largest improvement in estimates from inexperienced raters. Accurate and reliable direct estimation of bacterial spot severity on peach and nectarine can be made by raters with varying levels of experience provided they receive sufficient instruction.

Introduction

Bacterial spot of stone fruits is caused by the Gram-negative bacterium Xanthomonas arboricola pv. pruni. Considered the most important bacterial disease of peach and nectarine, bacterial spot can be devastating, especially when highly susceptible cultivars are grown in regions with warm, wet temperate climates. Symptoms on peach and nectarine occur on leaves, twigs and fruit. Foliar lesions are angular in shape, delimited by the veins, and often surrounded by a chlorotic halo (Zehr et al., 1996). As lesions age, the centres abscise from the leaf, developing a ‘shot-hole’ appearance (Ritchie, 1995). Yellowing of all or part of the leaf is also common (Ritchie, 1995). Severe foliar infections ultimately reduce yield due to reduced photosynthetic competence and carbohydrate uptake (Crisosto et al., 1995).

Evaluating bacterial spot intensity (incidence and severity) is a challenging task. Estimates of foliar disease incidence may not be very informative because nearly all plants within a plot will have at least one infected leaf, thereby necessitating the determination of ‘the degree of infection’ (i.e. disease severity; Madden et al., 2007). As with most leaf spot diseases, bacterial spot severity can be quantified by counting the number of lesions or spots per leaf. However, bacterial spot lesions often coalesce, making counts difficult and prone to errors during visual estimation. Moreover, in severe epidemics, symptoms are often characterized by chlorosis and leaf yellowing, which cannot be quantified by counting lesions. The gradations of yellow and light green lack clearly demarcated boundaries against visually healthy leaf tissue, a situation that complicates the estimation of disease severity.

Common methods of visually estimating disease severity include direct estimation with or without the aid of standard area diagrams and the use of interval scales (Nita et al., 2003; Madden et al., 2007; Bock et al., 2008, 2010b). Interval scales have normally been used to quantify foliar disease severity in epidemiological studies on bacterial spot (Zehr et al., 1996; Battilani et al., 1999). Preliminary studies in this laboratory suggested that using a rating scale was a faster method of obtaining estimates of disease severity. However, recent studies with other bacterial diseases showed that direct estimation often resulted in more reliable and accurate estimates of disease severity and that poorly designed interval scales can bias data and lead to false conclusions (Bock et al., 2009b, 2010a). With the exception of the study by Citadin et al. (2008), bacterial spot interval scales have not been assessed for reliability (i.e. reproducibility of an estimate) and accuracy (i.e. the closeness of an estimate to the actual value), despite the importance of data obtained with these scales in making disease management decisions, determining cultivar susceptibility, and comparing chemical treatments.

No matter what method of visual estimation is used, or the experience of the rater, disease severity assessments should be reliable and accurate (Campbell & Madden, 1990). Reliability is the degree to which the measurements of the same diseased individuals obtained under different conditions produce similar results (Everitt, 1999). Intra-rater reliability is the agreement between measurements (e.g. disease severity) taken repeatedly by the same raters (Madden et al., 2007), while inter-rater reliability is the agreement between measurements of the same diseased specimen as rated by multiple raters. Accuracy is the closeness of an estimate to the actual value (Everitt, 1999). However, actual disease severity is difficult to measure and often a ‘gold standard’, such as computer image analysis, is used as an accepted measurement of actual disease severity (Lin, 1989; Bland & Altman, 1999; Shoukri, 2004; Bock et al., 2009a).

The task of collecting field data on bacterial spot severity may be split among multiple raters with different ability and experience. Personnel may include experienced research scientists or persons with limited or no experience such as temporary student workers. Numerous phytopathological studies have documented large differences among raters evaluating the same set of diseased samples (Bock et al., 2010a,b). Reduced accuracy and reliability have been associated with: (i) rater intrinsic ability (Nutter & Littwiller, 1993; Nita et al., 2003; Bock et al., 2009b); (ii) time interval between disease assessments (Parker et al., 1995); (iii) complexity of symptoms assessed (Bock et al., 2008, 2009b); and (iv) interactions among multiple factors (Bock et al., 2010b). However, these associations were made without explicit statistical tests to evaluate their hypotheses. Knowledge of inter-rater reliability and accuracy when assessing bacterial spot symptoms is required to assess the quality of the data obtained from multiple raters. The objectives of this study were to: (i) assess and compare the reliability and accuracy of visual methods used to quantify bacterial spot severity by comparing results of visual estimates to actual values obtained with computer image analysis; and (ii) evaluate the effects of rater experience with bacterial spot and training on reliability and accuracy of visual estimates of bacterial spot severity.

Materials and methods

Leaves used for assessments

Leaves were obtained from non-treated plots in a peach and nectarine planting at the Penn State Fruit Research and Extension Center in Biglerville, PA, USA. Leaf samples were taken from nectarine cv. Easternglo and the peach cultivars Beekman and Sweet Dream. Before collection, the leaves were rated based on a 1–7 disease severity interval scale (1 = 0%, 2 = 1–3%, 3 = 4–8%, 4 = 9–15%, 5 = 16–25%, 6 = 26–45%, 7 = >45% lesion area) by an individual familiar with bacterial spot symptoms. Experience with bacterial spot in the eastern USA indicated that on most cultivars, leaves with about 50% diseased area become chlorotic and abscise, hence the choice of 45% as the maximum disease severity. A set of 105 leaves, comprising 15 leaves of each scale category, was collected from each of the three cultivars to establish three separate sets used throughout the study. Leaves were pressed using large books in order to ensure a completely flat surface to facilitate digital imaging, scanned at 300 dpi and saved as Tagged Image File Format (tiff) files.

Image analysis

The APS assess 2.0 image analysis software for plant disease quantification (L. Lamari; American Phytopathological Society, St Paul, MN, USA) was used to determine the actual percentage of diseased area of each leaf. The program uses the Hue-Saturation-Intensity (HSI) colour model, mathematical transformations of the RGB model and algorithms to delineate diseased area. The HSI model was used to separate the leaf from the background, and to subsequently separate the diseased area from the leaf. Threshold levels for leaf and lesion were set accordingly for each individually scanned leaf and recorded for purposes of reproducibility. The scanned images of the three sets of leaves (total 315 leaves) were uploaded to a Microsoft PowerPoint presentation set to display a randomly selected image of a single leaf on each slide.

Visual assessment

Three cohorts of raters were defined by their experience and extent of training and instruction on bacterial spot symptoms and disease severity assessment. Group I: six raters familiar with bacterial spot symptoms with some training in plant pathology. Group II: six inexperienced raters who received a detailed 15 min explanation of foliar bacterial spot symptoms from an experienced rater as well as a 15 min period to practise disease severity assessment with the computer program distrain (Tomerlin & Howell, 1988) prior to rating the samples. Group III: a different cohort of seven inexperienced raters who rated the leaves a year after groups I and II following a step-by-step training and instruction process; they first rated the three sets of leaves without any instruction or practice on the computer program (before training, BT). Two days after rating the leaves without training or instruction, group III undertook another rating exercise, but this time after a 15 min training on distrain (after training, AT). One month after the AT rating, members of group III were given 15 min of instruction about bacterial spot symptoms from an experienced rater followed by a further 15 min of practice on distrain before completing another rating exercise (AI). Each rater worked alone with images displayed on a computer screen and assessed each set of leaves twice using the direct estimation (DE) method. A 10–15 min break was given to raters between each set of leaves. In addition, four of the experienced raters evaluated the same leaves twice using the 1–7 interval scale for comparison of the two visual disease assessment methods.

Data analysis

Two approaches were used for the analysis of reliability and accuracy of data obtained with the interval scale. First, the data were converted from the 1–7 scale to percentage disease severity and subjected to the analyses for continuous measurements used for the direct estimation data. However, several studies have shown that transforming data from interval scales to percentage introduces an error, as midpoints of each percentage severity must be used (Bock et al., 2009b). Therefore, non-transformed data obtained with the interval scale were assessed for accuracy using concordance statistics for categorical data (Svensson, 2000). To facilitate the assessment of accuracy in data obtained with the interval scale, data on the actual disease severity were converted to the 1–7 interval scale with a minor modification. In this transformation, values from the APS assess≤0·25% were considered to represent leaves with no disease and assigned a score of 1 on the interval scale. The analysis thus assesses the ability of the raters to classify leaves into the ‘correct’ accuracy category.

For the comparison of DE and rating scale methods using continuous data, intra-rater reliability was assessed based on Pearson’s moments correlation coefficient (r), a measurement of precision which quantified variability between severity estimates from the first and second assessments (Rousson et al., 2002; Nita et al., 2003). Intra-rater reliability and accuracy of the two rating methods as well as the accuracy of assessments made by all raters using the DE method were determined by Lin’s concordance correlation coefficient (ρc), which measures the extent to which two sets of observations align on the line of concordance (i.e. x; Lin, 1989). Lin’s concordance correlation coefficient is a product of r (as defined above) and Cb, the bias coefficient which measures how much the best-fit line deviates from the concordance line; a Cb value of 1 represents a relationship with no bias (Nita et al., 2003; Bock et al., 2008). The bias coefficient was calculated as:

image

where

image

The terms μ1 and μ2, and σ1 and σ2 are the means and the standard deviations for the estimated values and the actual values, respectively. A perfect agreement between the actual and estimated severity is represented by ρc = 1. The actual relationship, or the best fitting line, and the Pearson’s correlation coefficients were determined with linear regression of the estimated severity values against the actual values. For analysis of data obtained with the interval scale, the midpoint values of each class were used. All concordance analyses were computed using GenStat v. 12 (VSN International) while graphs were plotted using SigmaPlot v. 10 software (Systat Software).

Inter-rater reliability was determined in three ways. First, it was assessed based on the mean pairwise correlation coefficient (r), calculated as the average of the pairwise correlation coefficients determined from the two ratings of the same set of leaves by the same rater. Secondly, the intra-class correlation coefficient (ρ) was calculated for all raters in each cohort combined, based on the variance components formulated from a two-way random effects anova model (Nita et al., 2003). Variance components were calculated using the varcomp procedure in sas v. 9.2 (SAS Institute Inc.) and values of ρ were determined as follows:

image

where inline image, inline image, and inline image are the variances for leaf, rater, and error, respectively. The values were calculated for each repetition of each set of leaves for the combined data for each cohort of raters. Thirdly, Kendall’s coefficient of concordance (W) (Kendall & Babington Smith, 1939) was computed for each cohort of raters. Kendall’s W is defined as the sum of ranks R of n individuals (i.e. ranked values of the scale rating assigned to individual leaves in this study) computed first by calculating: inline image where S is a sum of squares statistic over the sums of ranks Ri and inline image is the mean of the Ri values. Kendall’s coefficient was obtained from the formula:

image

where n is the number of leaves and m is the number of raters. T is a correction factor for tied observations.

For analysis of data obtained with the interval scale, accuracy was also assessed by computing the Spearman’s rank correlation coefficient (rs) and Kendall’s correlation coefficient (τ), also referred to as Kendall’s tau-a, without converting the data to percentage. Kendall’s τ is a correlation coefficient for rank-ordered categorical data, that may be equal to or <1, and was computed as:

image

where nc and nd refer to the number of concordant and discordant pairs without adjustment for ties, respectively, and n is the number of paired observations (Kendall, 1945). The paired observations in this case refer to two scores assigned to a single leaf; one the estimate assigned by a rater, the other the actual disease severity for that leaf. The Lin’s concordance correlation coefficient (ρc), the bias coefficient (Cb), and the Pearson correlation (r) were also calculated as previously described for continuous variables.

A bootstrap analysis of reliability and accuracy statistics (ρc, r, Cb,ρ and W) comparing data obtained with the interval scale and the DE methods was performed to compute 95% confidence intervals necessary to determine statistically if accuracy, intra- and inter-rater reliability were higher for bacterial spot severity estimates based on DE or the interval scale, and to discern the effects of rater experience and training. In each analysis, a total of 2000 balanced bootstrap samples were created in GenStat v. 12. Comparisons between the inter-rater reliability of estimations made with DE and the interval scale as well as between the accuracy of estimations made with DE and non-transformed scale data from individual experienced raters (i.e. comparisons involving >10 individual values of each statistic for each rating method) were also carried out with the non-parametric Kolmogorov–Smirnov (K-S) test and a t-test statistic based on permutations of all possible combinations. Plots of the difference in rater statistics before and after training were used to examine the relationship between intrinsic rater ability and expected improvement in performance (C. Bock, USDA-ARS-SEFTRNL, Byron, GA, USA, personal communication).

Results

Estimates of disease severity ranged from 0% to 100% in all three sets of leaves and average values ranged from 15·0% to 36·6% across repetitions of multiple raters. Overall, the estimated maximum disease severity was always higher for direct estimation than the scale due to the limited nature of calculating statistics using the midpoint. Images of two leaves in sets 1 and 2, and one of the leaves in set 3, did not readily display on the computer screen so the actual number of leaves evaluated was 103, 103 and 104 for sets 1, 2 and 3, respectively. In the following sections, data for set 1 are presented in figures and data from sets 2 and 3 are presented in tables.

Comparison of visual estimation methods

Reliability

Intra-rater reliability determined by Lin’s concordance coefficient, ρc, and the correlation coefficient, r, was high for all three sets of leaves and for both DE and the interval scale methods (Table 1; Fig. 1). DE r-values ranged from 0·95 to 0·99, while interval scale r-values ranged from 0·85 to 0·97. Lin’s concordance correlation coefficient (ρc), ranged from 0·95 to 0·99 and from 0·85 to 0·97 for the DE and interval scale, respectively. There was little or no bias as Cb values approached 1 (Table 1; Fig. 1). Intra-rater reliability was higher using DE compared to the interval scale based on a bootstrap analysis of r and ρc values (Table 1). Results of the non-parametric K-S test (χ2 = 13·5; = 0·001) and the permutation t-test (= 3·43; = 0·003 for both r and ρc) based on the 12 values of the reliability statistics for each method (i.e. three sets of leaves and four raters) were congruent with those of the bootstrap analysis (data not shown).

Table 1. Concordance statistics comparing the intra-rater reliability of direct estimation (DE) and use of a 1–7 interval scale for evaluating the severity of bacterial spot on peach and nectarine leaves
Rater ρca ra Cba
DEScalebDEScaleDEScale
  1. aρc = Lin’s concordance coefficient (Lin, 1989), = Pearson’s correlation coefficient, and Cb = bias coefficient.

  2. bData obtained with the interval scale were transformed to a percentage before computing the concordance statistics.

  3. cThis rater did not use the interval scale.

  4. d95% confidence interval of the mean difference based on 2000 random bootstrap samples. Intervals including zero indicate the two statistics are equivalent at α = 0·05.

Second set of leaves (= 103)
 G1A0·990·850·990·850·990·99
 G1B0·960·940·970·950·990·99
 G1C0·990·960·990·970·990·99
 G1D0·980·970·980·970·990·99
 G1Ec0·990·991·00
 Mean0·980·930·980·930·990·99
 Difference0·0590·0580·001
 95% CId0·021 to 0·1170·022 to 0·115−0·002 to 0·003
Third set of leaves (= 104)
 G1A0·990·930·990·940·990·99
 G1B0·970·950·970·961·000·99
 G1C0·950·910·950·920·990·99
 G1D0·950·970·950·960·990·99
 G1Ec0·990·991·00
 Mean0·970·940·970·950·990·99
 Difference0·0290·0290·0029
 95% CId0·005 to 0·0590·005 to 0·055−0·0002 to 0·007
Figure 1.

 Estimated disease severity for the first and second assessments of the first set of leaves (= 103) rated for bacterial spot severity using the direct estimation (DE) method and the interval scale (scale) by the four experienced raters. The line of concordance is indicated by the solid line while the best-fit line of the actual relationship between the two assessments is represented by the broken line. Points in each graph overlap. The correlation coefficient, r, is a measure of precision and Lin’s concordance coefficient, ρc, measures accuracy by combining the effects of precision (r) and bias, Cb.

Inter-rater reliability was also higher for DE compared to the interval scale. Based on data from the four experienced raters that used the two methods, the intra-class correlation coefficient (ρ) ranged from 0·92 to 0·96 for DE, while estimates obtained using the interval scale after transformation to a percentage ranged from 0·84 to 0·92. Mean values of Kendall’s coefficient of concordance (W) were 0·98 and 0·95 for DE and the interval scale, respectively. Both the non-parametric K-S test (= 0·002) and the bootstrap analysis confirmed that the estimates of both ρ and W for the DE method were significantly greater than those for the interval scale (Table 2).

Table 2. Analysis of the intra-class correlation coefficient and Kendall’s coefficient of concordance comparing inter-rater reliability among four experienced raters using direct estimation (DE) or the 1–7 interval scale (SCALE) to estimate severity of bacterial spot on peach and nectarine leaves
StatisticIntra-class correlation coefficient (ρ)aKendall’s coefficient of concordance (W)b
  1. aBased on variance components obtained with the varcomp procedure of sas; scale data were transformed to percentages before computing the variance components.

  2. bKendall’s coefficient of concordance (W) computed with adjustment for ties (Kendall & Babington Smith, 1939). Data obtained with the scale were analysed without transformation to percentages.

  3. c95% confidence interval for the mean difference based on 2000 random bootstrap samples of six values corresponding to two ratings of each of the three sets of leaves. Intervals including zero indicate no difference at α = 0·05.

  4. dKolmogorov–Smirnov two-sample test χ2 based on concordance values computed for each of two ratings of each set of leaves.

Mean
 DE0·9430·981
 SCALE0·8700·952
Mean difference0·0730·028
SED0·0160·007
95% CIc0·043–0·1040·016–0·043
K-S test (χ2)d12·012·0
P-value for K-S test0·0020·002

Accuracy

Based on Lin’s concordance analysis, the ρc values measuring accuracy were generally high for all six assessments (two assessments for each set of leaves, done by the experienced raters) and for both methods of assessment (Table 3; Fig. 2). However, accuracy was, in general, less than reliability regardless of the estimation method (Tables 1 & 3). For all six assessments, the ρc values, ranging from 0·83 to 0·98 when using DE, were greater than the ρc values for the interval scale, which ranged from 0·65 to 0·92. The precision (r) and bias (Cb) parameters were also higher for DE than for the interval scale, although r was only significantly greater for comparisons of the first and third leaf sets but not the second. Evaluating data obtained with the scale without transformation (i.e. comparing scale estimates to the actual category for that disease severity) improved the accuracy statistics for the scale but not sufficiently to equate it to that of DE based on comparable statistics (ρc, r and Cb; Table 4). For example, values of ρc and τ for the DE method were not statistically greater than those for the non-transformed interval scale data (= 0·053 and 0·111 for ρc and τ, respectively; Table 4). However, the mean Pearson correlation coefficient (r) was 0·94, which was significantly higher than 0·92 calculated for the same data following transformation to a percentage scale, but still significantly lower than 0·97 obtained for DE (Tables 3 & 4). Similarly, the average bias coefficient (Cb) of interval scale estimates also increased from 0·95 to 0·98 after categorical data analysis but remained significantly less than that of DE (Tables 3 & 4).

Table 3. Concordance statistics comparing the accuracy of direct estimation (DE) and the 1–7 interval scalea (SCALE) methods for estimating severity of bacterial spot of peach and nectarine based on assessments carried out by six experienced raters
Rater ρc r Cb
1st rating2nd rating1st rating2nd rating1st rating2nd rating
DESCALEDESCALEDESCALEDESCALEDESCALEDESCALE
  1. aData obtained with the interval scale were transformed to percentage before computing the concordance statistics.

  2. bThis rater did not use the 1–7 interval scale.

  3. cMean difference between the estimates for DE and the interval scale based on 2000 random bootstrap samples.

  4. d95% confidence interval for the mean difference of DE and scale estimates within a rating based on 2000 random bootstrap samples. Intervals including zero indicate no difference at α = 0·05.

Second set of leaves (= 103)
 G1A0·980·650·980·760·980·710·980·811·000·911·000·93
 G1B0·920·920·960·910·930·980·960·950·980·941·000·96
 G1C0·960·850·960·920·970·950·960·970·990·890·990·95
 G1D0·930·870·920·860·950·900·950·890·980·960·980·97
 G1Eb0·890·890·920·920·970·97
 G1Fb0·830·890·860·900·970·98
 Mean0·920·820·930·860·930·880·950·910·980·930·980·95
 Differencec0·0970·0680·0500·0400·0540·032
 95% CId0·004 to 0·2080·007 to 0·141−0·041 to 0·169−0·018 to 0·1080·027 to 0·0810·015 to 0·049
Third set of leaves (= 104)
 G1A0·980·880·980·860·990·910·980·901·000·980·990·96
 G1B0·950·890·960·880·950·960·970·930·990·930·990·94
 G1C0·910·820·940·880·930·920·960·960·980·890·980·92
 G1D0·970·870·930·880·970·900·940·901·000·970·990·98
 G1Eb0·910·930·940·950·970·98
 G1Fb0·950·960·950·961·000·99
 Mean0·940·870·950·870·960·920·960·920·990·940·990·95
 Differencec0·0800·0750·0350·0360·0490·039
 95% CId0·047 to 0·1190·058 to 0·0920·007 to 0·0600·007 to 0·0620·016 to 0·0860·017 to 0·060
Figure 2.

 Relationship between estimated bacterial spot severity based on direct visual estimation or the rating scale and actual severity obtained with APS assess software for the first (inline image) and second (○) assessments of the first set of leaves by the experienced raters. The solid line indicates the line of concordance and the broken line is the best-fit line of the relationship between the two variables. Points in each graph overlap. The correlation coefficient, r, is a measure of precision and Lin’s concordance coefficient, ρc, measures accuracy by combining the effects of precision, r, and bias, Cb.

Table 4. Statistical comparisons of concordance statistics measuring rater accuracy when using direct estimation (DE)a or the 1–7 interval scale (SCALE) where data obtained with the scale were analysed without transformation to percentage
Test statisticsConcordance statisticb
ρc Cb r rS τ
  1. aPercentage data obtained with the direct estimation method were converted to the 1–7 interval scale before the analysis.

  2. bρc = Lin’s concordance coefficient (Lin, 1989), = Pearson’s correlation coefficient, Cb = bias coefficient, rS = Spearman’s rank correlation, and τ = Kendall’s rank correlation coefficient without adjustment for ties.

  3. cStandard error of difference (SED), and 95% confidence interval (95% CI) of the difference between the mean of DE and SCALE statistics based on 2000 random bootstrap samples. Intervals including zero indicate no difference at α = 0·05.

  4. d95% confidence interval (CI), Student’s t-statistic (T) and its approximate degrees of freedom (df) based on permutation test of all possible combinations, and probability (P) of value for T. Intervals including zero indicate no difference at α = 0·05.

  5. eKolmogorov–Smirnov two-sample test and χ2 statistic for the hypothesis that the DE and SCALE values are drawn from the same distribution.

Mean
 DE0·950·990·970·980·89
 SCALE0·920·980·940·940·88
Mean diff0·0310·1190·0330·0370·013
SEDc0·0160·0030·0060·0060·008
Bootstrap 95% CIc−0·003 to 0·0570·006 to 0·0180·023 to 0·0440·026 to 0·050−0·001 to 0·031
Permutation test
 95% CId−0·0004 to 0·0630·006 to 0·0180·022 to 0·0440·024 to 0·050−0·003 to 0·030
 T (df)d2·02 (29·1)3·91 (34·5)5·91 (46·0)5·75 (25·6)1·64 (29·8)
 Pd0·053<0·001<0·001<0·001)0·111
K-S teste
 χ2 (2 df)e21·3314·0821·3340·334·08
 P<0·001<0·001<0·001<0·0010·130

Effects of rater experience, instruction and training

Experienced raters had the highest level of both intra- and inter-rater reliability among the three cohorts (Table 5), but intra- and inter-rater reliability among the experienced raters was not statistically higher than that among inexperienced raters in group II, although it was greater than that obtained for group III even after the latter raters had received both training and instruction on how to assess bacterial spot symptoms. However, intra-rater reliability among inexperienced raters in group II, while generally higher, was not statistically greater than that in group III after instruction (AI). Effects of rater training and instruction are illustrated as changes in intra-rater reliability and accuracy for members of group III as they progressed through the training and instruction steps followed by rating (Table 6). Both estimates of intra-rater reliability (ρc and r) were lowest for disease estimates obtained before training (BT), and increased significantly after training on distrain (AT), and again following instruction (AI) on how to rate disease symptoms (Table 5). Likewise, inter-rater reliability measured with Kendall’s coefficient of concordance (W) significantly increased from 0·73 before training to 0·77 and 0·89, after training and instruction, respectively (Table 5). In contrast, values of ρ, which were much lower than those of the inexperienced group II raters, decreased slightly from 0·29 to 0·25 after training but increased dramatically to 0·85 after instruction. Across all statistics used, the largest improvement in inter- and intra-rater reliability occurred following instruction on disease symptoms (Table 5).

Table 5. Effect of rater experience, instruction on disease symptoms, and training with a computer simulated program on intra- and inter-rater reliability of visual estimates of bacterial spot severity on peach and nectarine leaves obtained with the direct estimation method
Bacterial spot rating experienceaIntra-rater reliabilitybInter-rater reliabilityc
ρc 95% CI r 95% CI ρ 95% CI W 95% CI
  1. aThere were three cohorts of raters (groups I, II and III) and group III carried out three independent ratings of the same sets of leaves (before any training, following a 15 min training on distrain, and finally following a 15 min instruction about bacterial spot symptoms from an experienced rater).

  2. bρc = Lin’s concordance coefficient (Lin, 1989), = Pearson’s correlation coefficient, and 95% confidence intervals (95% CI) based on 2000 random bootstrap samples from values obtained from statistics of three sets of leaves each rated independently by individual raters. Means within a column followed by same letters are not significantly different (α = 0·05).

  3. cρ = Intra-class correlation coefficient based on variance components method, and W = Kendall’s coefficient of concordance (Kendall & Babington Smith,1939) and their respective 95% confidence interval (95% CI). Values within a column followed by the same letters are not significantly different (α = 0·05).

Experienced (I)0·98 a0·954–0·9850·98 a0·969–0·9930·94 a0·933–0·9530·98 a0·974–0·986
Inexperienced (II)0·97 ab0·954–0·9850·97 a0·962–0·9870·95 a0·939–0·9580·97 b0·967–0·978
Group (III)
 After instruction0·96 b0·942–0·9730·96 ab0·949–0·9730·85 b0·843–0·8630·89 c0·886–0·897
 After training0·88 c0·867–0·8980·90 c0·886–0·9100·25 d0·237–0·2570·77 d0·761–0·772
 Before training0·76 d0·743–0·7740·81 d0·795–0·8200·29 c0·281–0·3010·73 e0·727–0·738
Table 6. Effect of rater instruction on disease symptoms and training with a computer simulated program on Lin’s concordance statistics assessing the accuracy of direct visual estimates of bacterial spot severity on peach and nectarine leaves obtained with the direct estimation method
Bacterial spot rating experienceaStatisticb
ρc 95% CI 95% CI Cb 95% CI
  1. aThere were three cohorts of raters (groups I, II and III) and group III carried out three independent ratings of the same sets of leaves (before any training, following a 15 min training on distrain, and finally following a 15 min instruction about bacterial spot symptoms from an experienced rater).

  2. bρc = Lin’s concordance coefficient (Lin, 1989), = Pearson’s correlation coefficient, and Cb = bias coefficient. Values are means and 95% confidence intervals (95% CI) based on 2000 random bootstrap samples from values obtained from statistics of three sets of leaves each rated independently by individual raters with the direct estimation method compared with actual disease estimates obtained with APS assess software. Means within a column followed by the same letters are not significantly different (α = 0·05).

Experienced (group I)0·94 a0·907–0·9750·95 a0·923–0·9830·99 a0·957–1·017
Inexperienced 2010 (group II)0·91 ab0·876–0·9450·93 ab0·903–0·9630·98 a0·946–1·005
Inexperienced 2011 (group III)
 After instruction0·85 b0·816–0·8840·89 b0·858–0·9180·92 b0·915–0·921
 After training0·32 d0·282–0·3510·49 d0·464–0·5240·51 d0·503–0·535
 Before training0·55 c0·532–0·6000·625 c0·609–0·6690·78 c0·772–0·797

Regardless of rater experience, instruction or training, the statistical results of accuracy were generally lower than intra-rater reliability across all the cohorts, with mean ρc values for accuracy ranging from 0·32 to 0·94 (Table 6; Figs 3–5) compared to 0·76 to 0·98 for reliability (Table 5). As with reliability, the experienced raters were also the most accurate, although mean estimates of ρc, r and Cb for this group were not statistically greater than those for inexperienced raters in group II (Table 6; Figs 3 & 4). Estimates of disease severity for individual leaves by experienced raters were randomly and closely scattered about line plot of actual values for group I and group II raters even though the latter raters tended to overestimate disease in the mid-severity ranges (40–80%; Figs 3 & 4). In contrast, group III raters were highly inaccurate before they received instruction from an experienced rater (Fig. 5). Mean values of ρc, r and Cb, measuring rater accuracy among group III raters declined after training with the computer simulated program, before substantially increasing following instruction on how to rate bacterial spot from an experienced rater (Table 6). For example, the ρc decreased from 0·55 obtained before training to 0·32 after training, and then increased to 0·85 following instruction (Table 6; Fig. 5). Plots of disease severity estimates alongside actual severity values for group III revealed that low accuracy resulted from the raters underestimating disease severity for leaves with high levels of disease (Fig. 5). Upon receiving instructions, members of group III improved their accuracy but also tended to overestimate disease in the midpoint ranges akin to members of group II (Fig. 5). Moreover, plots of the difference in values of ρc or r obtained after instruction and those obtained before instruction revealed a strong negative relationship, indicating the increase in ρc and r was largest for those raters with the lowest values before training (Fig. 6). That is, improvement in rater accuracy was most notable for the raters that performed worst prior to receiving instructions.

Figure 3.

 Estimated and actual disease severity values for bacterial spot of peach and nectarine for the first (inline image) and second (○) assessments of the first set of leaves (= 103) rated by four experienced raters using the direct estimation method. Each point represents the estimated severity for a single leaf, while the line represents the actual disease severity for the same leaf based on APS assess computer image analysis. Lin’s concordance coefficient (ρc), a measure of accuracy, and r, a measure of precision, are included for each assessment.

Figure 4.

 Estimated and actual disease severity values for bacterial spot of peach and nectarine for the first (inline image) and second (○) assessments of the first set of leaves (= 103) rated by six inexperienced group II raters using the direct estimation method. Each point represents the estimated severity for a single leaf, while the line represents the actual disease severity for the same leaf based on APS assess computer image analysis. Lin’s concordance coefficient (ρc), a measure of accuracy, and r, a measure of precision, are included for each assessment.

Figure 5.

 Estimated and actual disease severity values for bacterial spot of peach and nectarine for the first (inline image) and second (○) assessments of the first set of leaves (= 103) rated by four inexperienced group (III) raters before and after training and instruction on bacterial spot symptoms. Each point represents the estimated severity of a single leaf based on direct estimation, while the line represents the actual disease severity for the same leaf based on APS assess image analysis. Many points on each graph overlap. Lin’s concordance coefficient (ρc), a measure of accuracy, and r, a measure of precision, are included for each assessment.

Figure 6.

 Relationship between Lin’s concordance and Pearson’s correlation coefficient measuring intra-rater reliability (inline image) and accuracy (○) of four inexperienced (group III) raters before (BT) and after training and instruction (AI). Each point represents the difference in value of the coefficient obtained before and after instruction (i.e. AI − BT) plotted against the value obtained before instruction (BT) for each of the three sets of leaves evaluated by four raters and values may overlap.

Discussion

Estimates of severity for bacterial leaf spot diseases are commonly obtained with the use of disease rating scales, presumably because lesions coalesce making their counts impractical. For bacterial spot of stone fruits, leaves with >50% diseased area are generally uncommon (Battilani et al., 1999) or often abscise, making impractical the use of the Horsfall–Barratt (H-B) scale which has values from 0% to 100% (Horsfall & Barratt, 1945). Thus, scales with fewer categories and lower maximum disease levels have been used (Zehr et al., 1996; Battilani et al., 1999; Citadin et al., 2008). The 1–7 interval scale was created to allow for smaller intervals in severity levels <45% and to terminate at the point where leaves would abscise. The resulting scale was similar to those of Battilani et al. (1999) and Zehr et al. (1996) and while non-linear, could be linearized with a natural logarithm transformation.

Any new method of disease measurement should be evaluated for accuracy and reliability, and for compliance with a ‘standard’ (Bland & Altman, 1999). However, previous phytopathological studies comparing different disease assessment methods or disease assessments from multiple raters did not provide equivalence tests, a key requirement for agreement studies (Yi et al., 2008). No statistical tests were used to evaluate the hypothesis of agreement in the coefficients estimated for the different rating methods or groups of raters. Instead, comparisons were based on subjective descriptions about the magnitude of coefficients (r and ρc; Nita et al., 2003; Bock et al., 2008, 2009b). In this study, two approaches were adopted to allow statistical evaluation of equivalency hypotheses, the bootstrap and permutation-based confidence intervals, and the non-parametric rank tests. In equivalence testing for agreement studies, the goal is to provide evidence that assessments obtained, for example, by the same group of assessors using two different rating methods, are statistically the same. The null hypothesis, therefore, was that there was a difference in concordance statistics (r, ρc, Cb, etc.) obtained with data from the two rating methods or the different groups of raters, that was outside the zone of scientific indifference. This is best demonstrated with the use of confidence intervals. The use of bootstrap analysis and permutation tests allowed computation of 95% confidence intervals around the mean differences of the concordance statistics for different sets of data being compared, without making assumptions regarding the unknown distributions of the concordance statistics. In general, the results of hypothesis tests based on the bootstrap and permutation tests were congruent with those of the non-parametric K-S test when used for the comparison of two sets of parameters. However, the bootstrap analysis could easily be extended to compare multiple groups as illustrated in the comparison of different cohorts of raters.

Based on bootstrap and permutation t-tests analyses, the results showed that the Pearson’s correlation coefficient (r) and Lin’s concordance coefficient (ρc) statistics assessing intra-rater reliability were higher for the data obtained with the DE method than for the interval scale. Comparisons of the Kendall’s coefficient of concordance (W) and the inter-class correlation coefficient (ρ) showed that inter-rater reliability was also higher for data obtained with DE than the interval scale. A bootstrap analysis also confirmed that ρc and r values measuring accuracy were significantly greater for the DE method than the interval scale. The DE method was, therefore, judged to be more reliable and more accurate than the interval scale as a means of measuring bacterial spot disease severity based on data analysed as continuous variables.

To facilitate comparison of the interval scale with the DE method, the commonly used procedure of midpoints conversion was used (Campbell & Madden, 1990; Nita et al., 2003; Madden et al., 2007; Bock et al., 2008). When actual disease severity is similar to or the same as the scale midpoint, the scale is accurate; if actual disease is closer to the scale interval boundaries, accuracy is substantially reduced (Bock et al., 2009b, 2010b). Thus the bias will be greater for scales with fewer intervals, so it was no surprise that the DE method was consistently more accurate and reliable compared to the rating scale, in agreement with other studies (e.g. Nita et al., 2003; Bock et al., 2009b, 2010a,b).

Once the introduction of bias through the use of midpoints is acknowledged, it brings into question whether interval scales should be judged using actual values based on continuous data. This question was examined by evaluating the accuracy and reliability of the interval scale by comparing it to the ‘actual’ scale values converted from the image analysis of the percentage severity for each leaf, and using methods suitable for categorical concordance analysis (Spearman’s and Kendall’s rank correlation coefficients (rs and τ)) that do not require conversion of the scale data to a percentage. The τ statistic has a direct interpretation in that it assesses the probability that a single observational unit (in this case a diseased leaf) is assigned the same score in two assessments (Newson, 2002). When assessing accuracy in the data here, this means the rater assigns the leaf to the correct severity category relative to actual values converted to the interval scale. The analysis of agreement for categorical variables resulted in improved levels of reliability and accuracy of the non-transformed data obtained with the interval scale (Table 4). Indeed, for the four experienced raters who used both methods, both the bootstrap analysis and a permutation t-test indicated no differences in the ρc and τ statistics for accuracy between the methods, confirming the loss of accuracy for the scale during the midpoint transformation (Table 4). Presumably, this improvement results from the rater having fewer categories or values to choose from, which is an advantage of the interval scales over DE based on a continuous variable (Campbell & Madden, 1990). However, the bootstrap analysis comparing accuracy statistics also showed that precision (r and rs) was higher for the DE method than the interval scale with or without conversion to a percentage, indicating that overall, the DE had significantly greater levels of accuracy than did the interval scale. Forbes & Korva (1994) arrived at a similar conclusion after converting DE data on potato late blight severity to H-B scale in order to compare the two rating methods.

The DE method is a more suitable method to assess disease because many plant disease models assume disease severity data are continuous variables, despite some statistical methods for categorical data analysis being available (e.g. Stokes et al., 2000; Agresti, 2002). Data obtained from interval scales must be converted into percentages for use in these models, thereby introducing systematic error (Forbes & Jeger, 1987). Thus, because comparable levels of reliability and accuracy cannot be retained in data obtained with interval scales after midpoint transformations, estimates based on DE methods are ultimately superior to those obtained with the scale. Nevertheless, where the objective for disease assessments precludes analysis techniques that necessitate conversion of data from an interval to continuous scale, estimates based on the interval scale may be a sufficient and faster method for obtaining data. An example of this would be data intended to rank treatments such as germplasm being evaluated for susceptibility to bacterial spot in breeding programmes.

In numerous studies, rater experience, familiarity of symptoms and training in disease assessment were found to influence the reliability and accuracy of disease estimates. In general, estimates made by experienced raters were more accurate and reliable than those made by inexperienced raters (Sherwood et al., 1983; Newton & Hackett, 1994; Nita et al., 2003; Bock et al., 2009b). For example, Bock et al. (2009b) found that experienced plant pathologists were more precise than people with little or no prior experience in disease severity assessment. The results here concur with these studies. However, more importantly, the data also indicate that: (i) accuracy was affected more by rater experience and intrinsic ability than reliability, and (ii) instruction about how to evaluate disease symptoms resulted in the largest improvement in performance of inexperienced raters. Indeed, such was the improvement in rater performance following instructions on how to handle disease symptoms that inexperienced raters in group II performed as well as the experienced raters (group I) in all comparisons except in inter-rater reliability evaluated with Kendall’s coefficient of concordance (W) (Table 5; Fig. 4).

The lack of differences between experienced (group I) and inexperienced (group II) raters for accuracy and reliability of disease severity estimates was intriguing given the evidence from previous studies that rater experience is an important factor in quality and consistency of disease assessments (Sherwood et al., 1983; Newton & Hackett, 1994; Parker et al., 1995; Nita et al., 2003; Bock et al., 2009b). A potential explanation for this observation is that inexperienced raters in this study were given training on visual disease assessment using the computer program distrain (Tomerlin & Howell, 1988), as well as a detailed explanation of bacterial spot symptoms prior to commencing disease assessments. Indeed, analysis of data from the different stages of training group III raters revealed that both computer training and instructions on disease symptoms contributed significantly to the improvement in rater accuracy and reliability, with inter-rater reliability benefiting most from the instructions (Table 5). There was a surprising reduction in accuracy for group III raters following training with the computer simulated program, probably because the lesions simulated by distrain do not sufficiently mimic those of bacterial spot. This result suggests training on computer programs and human instructions can affect rater performance and the quality of severity estimates differently, which warrants further investigation. The data also suggest that intrinsically poor raters gained most from computer training and instructions on disease symptoms, presumably because they had the greatest room for improvement (Fig. 6).

With or without experience, raters commonly vary in their ability to assess disease severity (Sherwood et al., 1983; Newton & Hackett, 1994; Nita et al., 2003; Bock et al., 2008). This might explain why group III raters never quite attained the level of accuracy and reliability observed for group II raters, even after the instructions. Besides rater ability and experience, variability may be due to several factors including, but not limited to, value preference (Bock et al., 2008) and gender, but these factors were not investigated. Taken together, the results indicate that estimates of bacterial spot severity on peach and nectarine leaves may be made by multiple raters with varying levels of experience without loss of accuracy and reliability when using the DE method, provided that the raters receive sufficient instructions and/ or training on disease symptoms before commencing the assessment.

Acknowledgements

This work was partially supported by the USDA though the South-Eastern Regional IPM Grant Program, and by Pennsylvania fruit growers. The authors thank B. L. Lehman for technical support, and all the raters who participated in the study.

Ancillary