Development and validation of standard area diagrams to aid assessment of pecan scab symptoms on fruit

Authors


E-mail: clive.bock@ars.usda.gov

Abstract

Pecan scab (Fusicladium effusum) causes losses of pecan nutmeat yield and quality in the southeastern United States. Disease assessment relies on visual rating, which can be inaccurate and imprecise, with poor inter-rater reliability. A standard area diagram (SAD) set for pecan scab on fruit valves was developed. A set of 40 images of diseased fruit valves with known severity was assessed twice by 23 raters. The first assessment was conducted without SADs, and the second assessment was made using the SADs as an aid. SADs improved rater accuracy (correction factor, Cb = 0·86 and 0·97, without and with SADs, respectively) and agreement (Lin’s concordance correlation coefficient, ρc = 0·79 and 0·89 without and with SADs, respectively) with true values. SADs improved inter-rater reliability (intra-class correlation coefficient, ρ = 0·77 and 0·96 without and with SADs, respectively). The least accurate and precise raters without SADs improved more using SADs compared to the most accurate and precise raters. Experienced raters had significantly higher accuracy and precision compared to inexperienced raters, but only when unaided by the SAD set. There was no significant difference in time to assess images without SADs, but experienced raters using SADs were faster compared to inexperienced raters. There was a slight tendency for faster raters to assess more slowly, and slower raters to assess faster when using SADs. SADs improve rater estimates of pecan scab severity on fruit, and this SAD set should be useful for assessment where greater precision, accuracy and inter-rater reliability are required.

Introduction

Pecan (Carya illinoinensis) is grown commercially throughout the southern United States (Wood, 2001, 2003). The pecan nut is nutritious and of high value, and is gaining popularity (Lin et al., 2001; O’Neil et al., 2010). Pecan scab, caused by the pathogenic fungus Fusicladium effusum, is the most serious disease reducing yield of pecan in the southeastern and south-central US, and infection causes discrete dark grey to black lesions that can develop on shoots, leaves or fruit. Lesions are initially small, but can expand and coalesce to infect large surface areas of an organ. The most economically damaging effect is typically on fruit, where disease can interfere with assimilate movement to fruit, reducing nut size, causing poor kernel filling, or even total crop loss due to fruit abortion (Gottwald & Bertrand, 1983; Stevenson & Bertrand, 2001).

Estimates of disease severity on a percentage scale from 0 to 100% are needed for several purposes, including estimating crop loss relative to disease severity, comparing treatments in experiments, comparing relative susceptibility of cultivars, studying pathogen epidemiology, predicting disease development, and in surveys. Accuracy (closeness of estimated values to the true values; Nutter et al., 1991; Everitt, 1998) and reliability (extent to which the same estimate obtained under different conditions yields similar results; Nutter et al., 1991; Everitt, 1998) of disease severity estimates are needed if data are to provide realistic representation of true disease value. When comparing estimates to true values, the concept of agreement is defined as the product of precision (defined as variability in the estimates) and accuracy (Madden et al., 2007).

Visual estimation is probably the most widely used method to estimate disease severity and is recommended for estimating severity on pecan fruit (Bertrand et al., 1999). While technologically sophisticated methods are available, they are not as widely used (Nilsson, 1995; Bock et al., 2010a). Disease assessment by individual raters can vary greatly (Nutter et al., 1993; Nita et al., 2003; Bock et al., 2009a), and several sources of error are known to exist among raters (Sherwood et al., 1983; Forbes & Jeger, 1987; Hock et al., 1992; Nita et al., 2003; Bock et al., 2010a). Thus, raters should be aware of these sources of error and take appropriate measures to maximize accuracy and reliability. Furthermore, raters should be thoroughly trained in recognizing and delineating diseased areas from healthy tissue. To this end, various methods are used to improve agreement of visual assessments of disease with true values. These include use of computer programs for assessment training, leaf grids and standard area diagrams (SADs; James, 1971; Nutter & Schultz, 1995; Parker et al., 1995; Nutter & Litwiller, 1998).

Standard area diagrams were initially developed to improve the accuracy and reliability of disease assessment (Dixon & Dodson, 1971; James, 1971; Bock et al., 2010a), and have been developed and demonstrated to help with assessment of several diseases on different crops (Stonehouse, 1994; Godoy et al., 1997; Nutter et al., 1998; Leite & Amorin, 2002; Vereijssen et al., 2003; Gomes et al., 2004; Pethybridge et al., 2004; Belasque et al., 2005; Michereff et al., 2006; Correa et al., 2009; Spolti et al., 2011). A SAD consists of an image depicting a unit of assessment, usually a plant part such as a leaf, leaflet or fruit. Images are most often black and white, so that the healthy and diseased tissues are represented by two contrasting colours (James, 1971; Stonehouse, 1994; Godoy et al., 1997; Leite & Amorin, 2002; Michereff et al., 2006), although Nutter et al. (2006) also describe using the disease severity program, severity.pro, to print out colour images as SADs. The percentage area that is diseased is accurately determined for each diagram in a SAD set and displayed as the actual value for reference (Stonehouse, 1994; Godoy et al., 1997; Leite & Amorin, 2002; Belasque et al., 2005; Michereff et al., 2006; Spolti et al., 2011). Raters are guided to use the SADs as a means to estimate disease severity on sample leaves to the nearest percent (not to categorize disease at the severities indicated by the SADs, which has been done in some studies). The diagrams are generally designed such that they represent the range of commonly observed disease severities, although the number of diagrams between SAD sets varies (Stonehouse, 1994; Godoy et al., 1997; Vereijssen et al., 2003; Gomes et al., 2004; Michereff et al., 2006; Spolti et al., 2011). Correa et al. (2009) compared three different SADs for assessing late blight and found that a SAD set with six diagrams was as good as one with eight; however, raters in that study rated severity on each leaf as the severity of the perceived closest diagram, and did not estimate disease severity to the nearest percent.

Currently, no SADs with specific disease severities have been developed for pecan scab, although interval-based logarithmic scales with examples of severity for each category do exist (Hunter & Roberts, 1978; Bertrand et al., 1999). Logarithmic-based category scales have been shown to be of dubious value as assessment aids (Nita et al., 2003; Nutter & Esker, 2006; Bock et al., 2010b). A SAD set for pecan scab on fruit valves illustrating specific disease severities will be useful as an assessment aid, but as with other SAD sets, will require validation and testing.

Experiments typically require timely assessment of hundreds or even thousands of samples, so even if SADs caused a slight time loss during assessment, this might be perceived as a significant drawback to their use. There is no information on the time taken to assess disease without or with SADs for any disease, although time has been compared for other disease assessment techniques (Nutter et al., 1993; Martin & Rybicki, 1998; Bock et al., 2009b). There are claims that SADs speed up assessments (Stonehouse, 1994; Godoy et al., 2006; Correa et al., 2009), but as no timed studies were done, these claims are not substantiated. If there were to be a time loss due to the use of SADs, they might be less attractive for certain purposes due to this inefficiency, despite previously demonstrated improvements in agreement and reliability.

Furthermore, very few studies have addressed the effect of rater experience (Sherwood et al., 1983; Newton & Hackett, 1994; Nita et al., 2003; Bock et al., 2009a), although generally differences were found to exist, with more experienced raters being more precise. Comparing experienced and inexperienced raters with and without SADs will be useful for providing further information on whether there is a difference in rater ability with experience, what form this takes (bias or precision, or both) and whether any difference is retained even when SADs are used.

The objectives of this study were to: (i) develop a SAD set for assessing pecan scab severity on pecan fruit valves; (ii) compare the accuracy, precision, agreement and reliability of assessments of pecan scab without and with the SADs; (iii) compare time taken to assess disease without and with the SAD set; and (iv) compare accuracy, precision, agreement and reliability for experienced and inexperienced raters.

Materials and methods

Pecan fruit

Fruit was collected in October 2010 from cv. Desirable, from an orchard of mature trees at the United States Department of Agriculture – Agriculture Research Service farm in Byron, GA, USA. Pecan fruit typically comprise four valves separated by sutures running along ridges from the peduncle to flower scar (Fig. 1); hence, fruits were carefully separated into individual valves aided by a scalpel to tease the valves apart. All fruit were diseased with lesions of pecan scab, with individual fruit valves exhibiting <1 to >90% symptoms. Each valve was placed on a blue background and photographed with a digital camera (using a Sony Cybershot DSC-S700) positioned on a tripod so that the images produced were photographed from the same height, producing a uniform scale.

Figure 1.

 Pecan fruit illustrating the valves separated by sutures (sutures run along the ridges seen in the fruit image). In this study fruits were carefully separated into individual valves with the aid of a scalpel.

Image analysis and estimation of true values

A total of 40 images was used in the analysis. Each image was rated using assess v. 2.0 Image Analysis Software (APS Press) to determine the true percentage disease for each fruit valve. assess calculates the percentage diseased area by allowing the user to threshold an area for the entire fruit valve, and then an area of disease lesions based on differences in hue, saturation and colour. A macro was created to convert the background of all images to blue (Lamari, 2008). The range and frequency of true disease severities is shown in Figure 2. The images were arranged randomly and compiled into individual booklets with eight images per sheet, each image representing the true size of the valves.

Figure 2.

 The frequency and range of disease severity in a set of 40 images of pecan valves used to compare agreement with true values and to compare inter- and intra-rater reliability of assessments by experienced and inexperienced raters without or with the use of standard area diagrams as an assessment aid.

Construction of the SAD set

The SAD set was constructed using 10 additional images (Fig. 3). Images were selected to represent a typical distribution of disease on the valve surface. Pecan scab can cover 100% of the area on a fruit valve, although as with the majority of diseases, scab is most frequently encountered at lower severities (Kranz, 1977; Correa et al., 2009). Thus the SAD set was constructed to reflect this range of disease, and eight images had disease in the range c. 1–50%, and two images had disease >55%. These images were processed with assess to delineate diseased and non-diseased areas. The images were then transferred to paint.net image editing software (http://www.getpaint.net/) and converted to black and white diagrams, with the black and white areas representing diseased and healthy tissue, respectively. The images were arranged on a single page from lowest percentage disease to highest.

Figure 3.

 Pecan fruit valve standard area diagram set to aid assessment of pecan scab. Percentage area diseased indicated by the black areas.

Validation of the SAD set

A group of 23 raters (comprising 13 ‘experienced’ and 10 ‘inexperienced’ raters) assessed the same set of images of scab-diseased pecan fruit valves twice. Raters were considered experienced if they had received previous formal training and practice in disease severity or pest damage assessment, and were familiar with the symptoms of pecan scab. Inexperienced raters had no formal training or familiarity with plant disease symptoms. The first assessment was done without the use of the SAD set. The disease symptoms of pecan scab were described to raters, who were instructed to provide an estimate of percentage diseased area for each image. From 1 to 6 months later, the same 23 raters were asked to assess the same set of images again. The baseline instructions remained the same. However, during this assessment, raters were also provided with the SAD set and instructed to use the SADs as a visual aid to assist in their assessment of the images by comparing the unknown image to the diagram in the SAD set that most closely represented it, and then estimating the severity to the nearest percent (i.e. use those values as reference points to help with their estimation). Six of the most experienced raters repeated the assessment of the leaf images, both with and without the use of SADs. Time taken to complete the assessments, as well as demographic information such as age and gender, and assessment experience (described as having any prior experience in estimating areas of plant parts diseased or damaged by pests) was also recorded.

Data analysis

All analyses were performed either in sas v. 9.2 (SAS Systems) or excel 2007 (Microsoft Corp.). Precision, bias, agreement and inter- and intra-rater reliability were compared without and with the use of SADs. Lin’s concordance correlation coefficient (LCCC; Lin, 1989; Nita et al., 2003) provides a method to judge precision, bias and agreement with a true value. Lin’s concordance correlation (LCC) calculates and evaluates the degree to which pairs of observations fall on the concordance line of 45° (slope = 1, intercept = 0), and the LCCC, ρc combines measures of bias and precision to assess fit to the line of concordance (45° = perfect concordance):

image(1)

where Cb is a bias correction factor that measures how far the best-fit line deviates from 45° (a measure of accuracy), and r is the correlation coefficient between x and y, which measures the precision of the best fit line. Cb is derived from measures of bias:

image(2)

where inline image and inline image

The coefficient υ defines the scale bias (or slope shift where 1 = perfect relation between x and y), and μ is the location bias (height shift) relative to the perfect relationship (0 = perfect relation between x and y), and σ is the standard deviation of x or y. Each of these statistics describing precision, bias and agreement was used in subsequent analyses. The frequency of each statistic was calculated with and without SADs. Absolute error (estimate minus true disease severity) and relative error (absolute error/true severity ×100) with and without SADs was calculated for all rater and pecan valve assessments.

For all 23 raters, regression analysis (proc reg) was used to estimate inter-rater reliability without and with SADs based on the coefficient of determination (R2). Regression analysis was also used to measure intra-rater reliability for the subgroup of six of the experienced raters who repeated the assessment of the images with and without the use of SADs (the same inexperienced raters were not available to perform the repeat). In addition the intra-class correlation coefficient was calculated to measure inter-rater reliability without and with SADs by analysing each data set with a two-way random effects anova (proc anova) as described by Nita et al. (2003). The intra-class correlation coefficient (ρ) was calculated using the variance components inline image.

Prior to analysis of precision, bias, agreement, time taken and intra-rater reliability without and with SADs, and for rater experience, a test of normality (proc univariate) was applied to these data. Based on the quantile – quantile plots and the value of the Shapiro–Wilk (SW) statistic, the null hypothesis of a normal distribution was not rejected for υ (SW = 0·93, < WS = 0·1), ρc (SW = 0·94, < WS = 0·2) or intra-rater reliability (SW = 0·87, < WS = 0·2). However, the SW statistic did reject the null hypothesis of a normal distribution for μ (SW = 0·71, < WS = 0·0001), r (SW = 0·84, < WS = 0·0018), Cb (SW = 0·88, < WS = 0·0124) and time (SW = 0·81, < WS = 0·0006). Thus depending on the variable distribution, either a non-parametric or parametric test was used to explore differences in measures of precision, bias, agreement, inter-rater reliability and time taken without and with SADs, and for rater experience (H. K. Ngugi, personal communication). For the non-parametric analysis, a Kolmogorov–Smirnov (K-S) test was used to compare the two underlying distributions from the rater groups (null hypothesis, H0, being that the sample distributions are drawn from the same population). The K-S was calculated using proc npar1way specifying the empirical distribution function (EDF) option and the probability (P) of a greater KSa calculated (if the probability is ≤0·05, the distributions are considered different). For the parametric analysis (the normally distributed variables) a t-test (proc ttest) was used to compare the means of the two groups for each statistic. For all variables including inter-rater reliability, the difference between the means was calculated (with – without SADs) and an equivalence test (Yi et al., 2008; Bardsley & Ngugi, 2012) was used to calculate 95% confidence intervals (CIs) for each statistic by bootstrapping using the percentile method (with an equivalence test, the null hypothesis is the converse of H0, i.e. the null hypothesis is non-equivalence). A test of normality (proc univariate) for the difference between the means for each variable demonstrated that all were normally distributed (SW = 0·87–0·93, < WS = 0·08–0·4) except for inter-rater reliability (SW = 0·93, < WS = 0·0001). In all analyses 2000 balanced bootstrap samples were taken using proc surveyselect/proc univariate. The 95% CIs were calculated on the difference between the groups, so if the CIs spanned zero, there was no significant difference (P = 0·05).

The relationship between time taken for assessments and each LCC statistic for the 23 raters both without and with SADs was explored using regression analysis. To investigate the value of using SADs in relation to ability to assess disease without SADs (i.e. the effect of SADs on inherently accurate and precise raters compared to raters who performed poorly without SADs), a linear regression analysis was performed on the relationship between the LCCC value for the assessment without SADs and the difference between the value from the assessment with SADs minus the value from the assessment without SADs (i.e. with SADs – without SADs). The effect of SADs on time taken was analysed in the same way.

Results

Effect of SADs

All bias and accuracy component statistics of Lin’s concordance correlation coefficient improved significantly as a result of using SADs to assess symptom severity of scab on pecan valves (Table 1). Although there was a slight numerical improvement in precision (= 0·91 and 0·92 with and without SADs, respectively), it was not significant. Scale and location bias were both improved, and the combined estimate of accuracy (Cb) was closer to 1 when using SADs. Overall agreement with the true values measured by image analysis was greatly enhanced (ρc without SADs = 0·79, with SADs = 0·89). The bootstrap analysis, and the K-S test and t-test where applied, all indicated a significant improvement in each statistic. However, time taken to assess disease was not significantly different without or with SADs, and the tests failed to show a significant effect of use of SADs affecting the time taken to perform assessments.

Table 1. Effect of standard area diagram (SAD) assessment aids on the bias, precision, agreement of, and time taken for assessments of pecan scab severity on 40 diseased valves of pecan fruit by 23 raters
VariableMeanaDifferenceb between means95% CIsc of the differenceKSa value (P)d t-value (P)e
No SADsWith SADs
  1. aStandard deviation in parentheses.

  2. bMean of the difference between each rating, with standard errors in parentheses (bootstrap calculated value).

  3. cConfidence intervals (CIs) were based on 2000 bootstrap samples. If the CIs embrace zero, the difference is not significant (P = 0·05).

  4. dThe Kolmogorov – Smirnov asymptotic test statistic (KSa) measures the difference in the cumulative distribution functions of the pairs of assessments, and the value in parentheses indicates the probability (P) of a greater KSa. If the probability is ≤0·05, the distributions can be considered different (non-parametric test for non-normal data).

  5. eStudent’s t-test statistic and P-value (parametric test for normally distributed data).

  6. fScale bias, or slope shift (υ, 1 = no bias relative to the concordance line).

  7. gLocation bias, or height shift (μ, 0 = no bias relative to the concordance line).

  8. hThe correction factor (Cb) measures how far the best-fit line deviates from 45° and is thus a measure of accuracy.

  9. iThe correlation coefficient (r) measures precision.

  10. jLin’s concordance correlation coefficient (ρc) combines both measures of precision (r) and accuracy (Cb) to measure agreement with the true value.

  11. kMean time taken to assess the 40 images.

Scale (υ)f1·70 (0·15)1·02 (0·10)0·149 (0·0008)0·10–0·20 3·92 (0·0003)
Location (μ)g0·45 (0·42)0·21 (0·14)0·242 (0·0026)0·08–0·421·62 (0·01) 
Coefficient of bias (Cb)h0·86 (0·14)0·97 (0·02)−0·11 (0·001)−0·17–−0·051·77 (0·004) 
Correlation coefficient (r)i0·91 (0·06)0·92 (0·03)−0·011 (0·0003)−0·03–0·010·59 (0·9) 
LCCCj0·79 (0·17)0·89 (0·04)−0·098 (0·001)−0·17–−0·04 −2·69 (0·01)
Time (min)k6·78 (3·02)7·91 (3·72)1·16 (0·030)−0·70–3·000·74 (0·65) 

Use of SADs increased the frequency of lower bias values (Fig. 4a,b). The frequency of improved accuracy, precision and agreement statistics increased with the use of SADs (Fig. 4c–e). Furthermore, the data demonstrate that raters with the poorest ability tended to improve the most, based on measures of bias, accuracy and precision (Fig. 5a–e). As would be expected, those raters who were characterized by low bias, high levels of accuracy, precision and agreement without SADs did not respond as much to using SADs as the raters who had the poorest assessment ability. Indeed, some of these better raters experienced a slight increase of bias, and slight loss of accuracy, precision and agreement. This demonstrates that SADs help standardize assessments from diverse raters. Overall there was a reduction in both absolute and relative error when using SADs (Fig. 6). Relative error was greatest at <20% scab severity.

Figure 4.

 The frequency of bias, precision and agreement values without and with use of standard area diagram (SAD) assessment aids by 23 raters who assessed 40 images of scab-diseased valves of pecan fruit. (a) Scale bias, (b) location bias, (c) correction factor, (d) correlation coefficient, (e) Lin’s concordance correlation coefficient.

Figure 5.

 The relationship between bias, precision and agreement without the use of standard area diagram (SAD) assessment aids and the difference (with SADs – no SADs) demonstrating raters with the least good scores benefitted the most for all variables. (a) Scale bias; (b) location bias; (c) bias correction factor; (d) correlation coefficient; (e) Lin’s concordance correlation coefficient. Disease was assessed on a set of 40 images of pecan scab-diseased valves of pecan fruit by 23 raters.

Figure 6.

 (a,b) The absolute error (estimate minus true disease) and (c,d) relative error (absolute error/true severity ×100) of assessments of a set of 40 images of scab-diseased valves of pecan fruit by 23 raters without (a,c) and with (b,d) the use of standard area diagram (SAD) assessment aids.

Rater experience

Without SADs, experienced raters demonstrated significantly less bias compared to inexperienced raters (Table 2; Cb = 0·92 and 0·78, respectively). Scale bias (υ) was marginally improved (t-test = 2·11, P = 0·05; difference between the means = 0·108, 95% CI = 0·007–0·225), but location bias (μ) was improved substantially (K-S value = 1·83, = 0·003; difference between the means = 0·462, 95% CI = 0·219–0·716). Furthermore, without SADs experienced raters had greater precision and agreement compared to inexperienced raters. The bootstrap analysis, K-S test and t-test indicate significant improvement in these statistics, but there was no difference in the time taken to assess the valve images. However, with SADs, there was no significant difference in bias, accuracy, precision or agreement according to the bootstrap analysis, K-S and t-tests. With SADs, the bootstrap analysis showed experienced raters were significantly faster compared to inexperienced raters (difference between the means = 3·17, 95% CI = 0·67–5·50, although the non-parametric K-S test was not significant).

Table 2. Effect of rater experience on the bias, precision, agreement of, and time taken for assessments of pecan scab severity on 40 diseased valves of pecan fruit by 23 raters both without and with the use of standard area diagram (SAD) assessment aids
SADsVariableMeans of experienceaDifferenceb between means95% CIsc of the differenceKSa value (P)d t-value (P)e
None (= 13)Yes (= 10)
  1. aStandard deviation in parentheses.

  2. bMean of the difference between each rating, with standard errors in parentheses (bootstrap calculated value).

  3. cConfidence intervals (CIs) were based on 2000 bootstrap samples. If the CIs embrace zero, the difference is not significant (P = 0·05).

  4. dThe Kolmogorov–Smirnov asymptotic test statistic (KSa) measures the difference in the cumulative distribution functions of the pairs of assessments, and the value in parentheses indicates the probability (P) of a greater KSa. If the probability is ≤0·05, the distributions can be considered different (non-parametric test for non-normal data).

  5. eStudent’s t-test statistic and P-value (parametric test for normally distributed data).

  6. fScale bias, or slope shift (υ, 1 = no bias relative to the concordance line).

  7. gLocation bias, or height shift (μ, 0 = no bias relative to the concordance line).

  8. hThe correction factor (Cb) measures how far the best-fit line deviates from 45° and is thus a measure of accuracy.

  9. iThe correlation coefficient (r) measures precision.

  10. jLin’s concordance correlation coefficient (ρc) combines both measures of precision (r) and accuracy (Cb) to measure the degree of agreement with the true value.

  11. kMean time taken to assess the 40 images.

No SADsScale (υ)f1·24 (1·16)1·12 (0·12)0·108 (0·002)0·007–0·225 2·11 (0·05)
Location (μ)g0·70 (0·38)0·26 (0·34)0·462 (0·004)0·219–0·7161·83 (0·003) 
Coefficient of bias (Cb)h0·78 (0·15)0·92 (0·10)−0·155 (0·001)−0·251–−0·0741·83 (0·003) 
Correlation coefficient (r)i0·88 (0·06)0·93 (0·05)−0·061 (0·0006)−0·099–−0·0281·54 (0·02) 
LCCCj0·69 (0·17)0·86 (0·13)−0·19 (0·002)−0·296–−0·098 −2·72 (0·01)
Time (min)k5·50 (1·18)7·77 (3·63)−1·34 (0·03)−3·63–0·520·92 (0·4) 
With SADsScale (υ)1·04 (0·09)1·00 (0·11)0·042 (0·001)−0·029–0·123 0·89 (0·4)
Location (μ)0·21 (0·08)0·21 (0·18)0·026 (0·002)−0·106–0·1861·17 (0·1) 
Coefficient of bias (Cb)0·97 (0·02)0·96 (0·03)0·009 (0·0003)−0·011–0·0320·93 (0·4) 
Correlation coefficient (r)0·90 (0·03)0·93 (0·03)−0·025 (0·0004)−0·053–0·0020·86 (0·5) 
LCCC0·88 (0·03)0·89 (0·05)−0·016 (0·0006)−0·052–0·023 −0·8 (0·4)
Time (min)9·10 (4·53)7·00 (2·80)3·17 (0·039)0·67–5·501·12 (0·2) 

Inter-rater reliability

Inter-rater reliability was improved by using SADs (Table 3). Both the intra-class correlation coefficient (by anova) and the mean pairwise coefficient of determination (by bootstrapping) demonstrated a significant effect of SADs, which thus provided raters with a way to assess disease more uniformly. The frequency of improved inter-rater reliability increased (Fig. 7a), and the regression analysis showed that the improvement was consistently greatest for those pairs of raters who were least reliable without SADs (Fig. 7b).

Table 3. Inter-rater reliability of visual assessments by 23 raters of pecan scab on 40a valves of pecan fruit both without and with the use of standard area diagram (SAD) assessment aids. Inter-rater reliability is measured by the intra-class correlation coefficient (ρ) and coefficient of determination (R2)
StatisticsNo SADsWith SADs
  1. aThe same set of 40 valve images was used in the two assessments.

  2. bF-value for LN = leaf, A = rater.

  3. cMean coefficients of determination estimated from pairwise comparisons of assessments by all visual assessors.

  4. dMean of the difference between each rating, with standard errors in parentheses (bootstrap calculated value), confidence intervals (CIs) were based on 2000 bootstrap samples. If the CIs embrace zero, the difference is not significant (P = 0·05).

Intra-class correlation coefficient (ρ)0·77 F, P > F:
LNb = 158 (<0·0001);
A = 46 (<0·0001)
0·96 F, P > F:
LN = 191 (<0·0001);
A = 7 (<0·0001)
Mean inter-rater coefficient of determination (R2)c0·79 (0·46–0·97)0·82 (0·64–0·94)
Mean differenced = 0·03 (0·0002), 95% CIs 0·013–0·043
Figure 7.

 The (a) frequency of the inter-rater reliability measured by the coefficient of determination (R2) without and with use of standard area diagram (SAD) assessment aids by 23 raters who assessed a set of 40 images of scab-diseased valves of pecan fruit and (b) the relationship between the coefficient of determination without SADs and the difference between the coefficients of determination (with SADs – no SADs) demonstrating that raters with initially poorer inter-rater reliability benefitted the most from using SADs.

Intra-rater reliability

With the subgroup of experienced, trained raters there was no difference in intra-rater reliability (Table 4; mean R2 = 0·940 and 0·935 without and with SADs, respectively). As the previous data suggest, raters who are familiar with the disease symptoms and the process of estimation of disease severity are less likely to improve as a response to using SADs. This is congruent with the fact that these six raters had consistently close agreement with the true values, obtaining LCCC values of 0·77–0·96 and 0·88–0·97 in the first and second assessments without SADs, and 0·89–0·96 and 0·85–0·97 in the first and second assessments with SADs, respectively.

Table 4. Intra-rater reliability of visual assessments by six trained and experienced raters of pecan scab on 40a valves of pecan fruit compared to the true values measured by image analysis both without and with the use of standard area diagram (SAD) assessment aids. Intra-rater reliability was measured by the coefficient of determination (R2)
RaterIntra-rater reliability (R2)a
No SADsWith SADs
  1. aCoefficient of determination measures the reliability of repeat assessments of the same leaf images by the same rater.

  2. bDifference between the means for assessments (with SADs – no SADS). Standard error of the mean in parentheses (bootstrap calculated values).

  3. cConfidence intervals (CIs) were based on 2000 bootstrap samples. If the CIs embrace zero, the difference is not significant (P = 0·05).

  4. dStudent’s t-test statistic and P-value (parametric test for normally distributed data).

10·980·97
20·900·90
30·920·93
40·940·96
50·950·95
60·950·90
Mean0·9400·935
Differenceb between means−0·006 (0·0003)
95% CIsc of the difference−0·025–0·01
t-value (P)d0·30 (0·8)

Effect of SADs on time taken

Regression analysis showed no relationship between time taken to assess the images and bias, accuracy, precision or agreement statistics (data not shown), and time taken to complete the assessments was not significantly different between raters either with or without SADs. However, there was a slight tendency (R2 = 0·38) for faster raters to assess slower when using SADs, and slower raters to assess faster when using SADs (Fig. 8; the range in time taken with no SADs was 3–15 min, and with SADs was 2–17 min).

Figure 8.

 The relationship between time taken to assess severity of pecan scab on 40 images of diseased pecan valves without a standard area diagram (SAD) assessment aid and the difference between the time taken with and without using SADs (SADs – no SADs), showing a slight tendency for faster raters to assess slower when using SADs, and slower raters to assess faster when using SADs. The range in time taken with no SADs was 3–15 min, and with SADs was 2–17 min.

Discussion

This SAD set of pecan fruit valves, showing a spectrum in severity of symptoms of scab, improved the ability of raters to accurately and precisely assess disease severity. Several studies have previously validated the improvement in assessor ability when using SADs (Stonehouse, 1994; Godoy et al., 1997; Nutter et al., 1998; Leite & Amorin, 2002; Vereijssen et al., 2003; Gomes et al., 2004; Pethybridge et al., 2004; Belasque et al., 2005; Michereff et al., 2006; Correa et al., 2009; Spolti et al., 2011) and these studies are in agreement with the results presented here. However, this is the first time that LCC analysis has been applied to analyse data using SADs, and this is also the first time that statistical tests (95% CIs, K-S and t-tests) have been done between the different LCC statistics to provide a level of significance and thus validity to these SAD comparisons (Bardsley & Ngugi, 2012), which demonstrated significant improvements in agreement with the use of SADs.

The amount of time taken to assess disease was recorded and there was no significant effect on time taken to assess disease without SADs, and no relationship between time taken by a rater and bias, accuracy, precision or agreement without or with SADs. However, experienced raters using SADs were faster compared to inexperienced raters. It is likely that experienced raters are more confident using assessment aids like SADs. The observation that there was a slight tendency for faster raters to assess slower, and slower raters to assess faster when using SADs suggests that some individual raters might be impeded by using SADs, which would only be a disadvantage if these raters were also inherently accurate and precise without SADs. Although time taken to assess images has been recorded before to demonstrate differences between assessment methods (Nutter & Schultz, 1995; Martin & Rybicki, 1998; Bock et al., 2009b), no comparison had been made previously without and with the use of SADs.

There was a significant effect of rater experience on rater accuracy and precision without SADs – experienced raters were more accurate and precise. However, when using SADs, there was no significant difference between experienced and inexperienced raters. Raters have been compared on several other occasions, and there is diversity in the ability of individuals to assess disease severity, and most often some difference has been found between experienced and inexperienced groups (Sherwood et al., 1983; Newton & Hackett, 1994; Nita et al., 2003; Bock et al., 2009a). Indeed, the study of Nita et al. (2003) found that precision was different between the two groups. Use of SADs improves the accuracy and precision of assessments (Godoy et al., 1997; Nutter et al., 1998; Leite & Amorin, 2002; Gomes et al., 2004; Belasque et al., 2005; Michereff et al., 2006; Spolti et al., 2011). Although an inexperienced rater might be inherently capable of assessing disease with a high degree of accuracy and precision, inexperienced raters can also be the worst raters, and it is those who had the greatest response to using the SADs in this study, thus making assessments more uniform. As a result, inter-rater reliability was significantly improved with use of the SAD set, as the raters individual assessments were more often closer to the true values. However, no effect of SADs on intra-rater reliability was observed, but this is most likely attributable to the fact that the subgroup of six raters included in the study of intra-rater reliability were all well trained and familiar with pecan scab symptoms and the use of SADs.

Standard area diagrams are most often computer generated images or drawings of a plant part with diseased area imposed by hand or copying disease symptoms (James, 1971; Godoy et al., 1997; Nutter et al., 1998; Leite & Amorin, 2002; Gomes et al., 2004; Belasque et al., 2005; Michereff et al., 2006; Spolti et al., 2011). This leads to a great variety in SAD quality, with some being somewhat stylized and others a close mimic of an actual assessment unit. Converting an actual image, as done in the current study, gives a more accurate portrayal of disease morphology and the unique patterns of a specific disease on a particular plant part.

There are 10 diagrams in this SAD set that cover the range 1–100%. Precision will be directly impacted by the number of SADs, so a SAD set with fewer diagrams will doubtless give different results, and at some point the number of diagrams may be too few to allow raters to estimate disease severity with sufficient precision compared to not having a SAD set. Correa et al. (2009) found that diagram number affected precision and accuracy when using the SADs like an interval scale rather than estimates to the nearest percentage by interpolation (six diagrams were found to be sufficient). Conversely, more than 10 diagrams might be excessive and result in loss of time. Furthermore, the distribution of severities in a SAD set has not been studied, but it should probably relate to the frequency of disease severity likely to be anticipated in the field (Kranz, 1977; Correa et al., 2009). Most often this will result in more diagrams at the lower end of the percentage scale to differentiate disease at a point where over estimation is known to be an issue (Sherwood et al., 1983; Bock et al., 2008). There are advantages to basing a SAD set on a linear series (Pethybridge et al., 2004) rather than logarithmic (which many are), as logarithmic scales have been demonstrated to detract from the quality of data obtained (Nita et al., 2003; Nutter & Esker, 2006; Bock et al., 2010b). SADs with at least 10 diagrams which cover the range of disease can provide a basis for more precise and accurate assessments.

Use of the SAD set for pecan scab on valves of pecan fruit is straightforward. The diagrams reduced bias, improved the accuracy and agreement of raters assessing symptoms of pecan scab on fruit, and increased the inter-rater reliability. This SAD set should be a useful tool for aiding assessment where greater accuracy, precision and repeatability of pecan scab rating is required, including comparing disease management treatments and studies of disease epidemiology.

Acknowledgements

The authors appreciate the time of colleagues at the USDA-ARS-SEFTNRL and colleagues and students at Fort Valley State University who assessed the diseased leaf images. The image of pecan fruit was kindly provided by Dr Mike Hotchkiss, USDA-ARS-SEFTNRL. The authors also acknowledge several useful discussions with Dr Henry K. Ngugi of Penn State University.

Ancillary