Improving visual estimation through active feedback


  • Bonnie C. Wintle,

    Corresponding author
    • Environmental Science, School of Botany, University of Melbourne, Parkville, Victoria 3010, Australia
    Search for more papers by this author
  • Fiona Fidler,

    1. Australian Centre of Excellence for Risk Analysis (ACERA), School of Botany, University of Melbourne, Parkville, Victoria 3010, Australia
    Search for more papers by this author
  • Peter A. Vesk,

    1. Centre of Excellence for Environmental Decisions (CEED), School of Botany, University of Melbourne, Parkville, Victoria 3010, Australia
    Search for more papers by this author
  • Joslin L. Moore

    1. Australian Research Centre for Urban Ecology (ARCUE), Royal Botanic Gardens Melbourne, c/o School of Botany, University of Melbourne, Parkville, Victoria 3010, Australia
    Search for more papers by this author

Correspondence author. Email:


  1. In field surveys, ecological researchers and practitioners routinely make quantitative judgements that are known to vary in quality. Feedback about judgement accuracy is crucial for improving estimation performance yet is not usually afforded to fieldworkers. One reason it is rare lies in the difficulty of obtaining ‘true values’ (e.g. percentage cover) to learn from. Often, the only information we can access is other people's estimates of the same thing. Group average estimates tend to be remarkably accurate. By extension, receiving feedback about group averages may improve the estimation performance of individuals, dispensing with the need for ‘true values’ to learn from.
  2. In experiment 1, we tested whether feedback using group averages might improve estimates of species abundance as much as feedback using true values. However, not all feedback approaches are effective. In experiment 2, we compared two feedback formats for presenting information about group estimates of percentage cover. In both experiments, we used a novel 4-point interval estimation approach to quantify uncertainty that is known to reduce overconfidence but is yet to be applied in ecology.
  3. Results from experiment 1 show that feedback about group averages improved performance (calibration and accuracy) almost as much as feedback about the truth, despite the fact that group averages were generally not close to true values. In contrast, group averages in experiment 2 were remarkably close to true values, but the only participants who improved their estimates were those who evaluated their own performance during the feedback session, using active feedback with a calibration component.
  4. Feedback reminds surveyors not to give over-precise estimates and to appropriately reflect uncertainty. It improves calibration and accuracy of abundance estimates and could reasonably improve estimates of other quantities. Drawing on the wisdom of crowds, group averages could be used as a proxy for true values in feedback procedures. However, the format for delivering feedback matters. Actively engaging participants by having them evaluate their own estimation performance appears critical to improving their subsequent judgements, compared with passive feedback. We advocate the introduction of feedback into the training of ecologists.


Quantitative estimation in ecology

Estimation plays a major role in ecology. Investigations often require a quantitative vegetation description to monitor status, trends and dynamics. Researchers and practitioners routinely approximate species abundance in a variety of ways, including counts, percentage cover (projected foliar, canopy or basal), density and biomass (see Mueller-Dombois & Ellenberg 1974). Visual cover estimates are common because they are rapid and not labour intensive but are known to vary in quality. In some cases, this variation is tolerable, but in others, it affects the conclusions we can draw from a study. For example, Vittoz et al. (2010) determined that observer variation was such that changes in alpine vegetation would only be detected for abundant species (> 10% cover) or if relative changes were large (> 50% cover).

Measurement error should be disentangled from vegetation variation by determining the source and size of the error (Kennedy & Addison 1987). The size of the error is investigated either by i) evaluating the repeatability of estimates between observers, and assuming that high variability indicates high error (deviation from the true value), or ii) by comparing observer estimates with more precise objective measurements resembling ‘true values’, obtained from, for example, point quadrat sampling. Past studies show cover estimates to vary substantially between observers (Sykes, Horrill & Mountford 1983; Bråkenhielm & Liu 1995; Helm & Mead 2004; Cheal 2008) and also within observers over repeated judgements (Hees & Mead 2000). While the degree of variability is highly inconsistent between studies, average cover estimates tend to reflect 10–20% error (Sykes, Horrill & Mountford 1983; Kennedy & Addison 1987). Furthermore, error is not contained to novices. A one-off cover estimate of the same Triodia field from 16 experienced observers ranged from 20% to 60% in Cheal's (2008) study, even though three observers thought they could reliably discriminate to within 5% cover intervals. Estimation errors are unpredictable and vary across environments and scales (Klimeš 2003), indicating that the source and magnitude of the error depends on vegetation type, sampling area, species characteristics (e.g. morphology, misidentification), total cover, time pressure, assessment scale and recording methods (e.g. Hope-Simpson 1940; Daubenmire 1959; Sykes, Horrill & Mountford 1983).

Management decisions can only be made confidently if methods provide reliable estimates that appropriately reflect uncertainty but most do not. Common methods of abundance estimation can either promote false precision (in point estimates or too-narrowly defined classes) or sacrifice information (in large, inflexible classes). Classification boundaries can be arbitrary and lead to large boundary errors (Helm & Mead 2004) that undermine decisions. For example, boundary errors have implications for threshold-based weed management programmes (Andujar et al. 2010). The use of classes has also been criticised for overestimating cover of rare species (Floyd & Anderson 1987).

In the present study, we used a 4-point interval estimation approach that allows the estimator to quantify their own uncertainty (see 'Materials and methods'). Tested in epidemiology, marine biology and biosecurity, it reduced overconfidence from around 40–50% – typical of interval judgements (e.g. Teigen & Jørgensen 2005) to 5–12% (e.g. Speirs-Bridge et al. 2010). It has not yet been applied in ecology. We propose that using this technique will avoid some of the issues accompanying arbitrary cover interval classes, allowing the observer to reflect different levels of uncertainty associated with different species morphologies, detectability and total cover.

Feedback for improving estimation

Apart from quantifying estimation error, procedures to reduce it are warranted. Previous research shows that feedback is important for learning and generally improves estimation (Kopelman 1986). Expertise is slow to accumulate when systematic feedback is not provided to fieldworkers. However, not all types of feedback are equally effective. Indeed, some approaches to providing feedback may be detrimental, prompting a distinction between two main types: outcome feedback and cognitive feedback (Todd & Hammond 1965; Balzer, Doherty & O'Connor 1989). Outcome feedback simply refers to learning the results or true value (e.g. ‘actual’ species abundance in a quadrat). Cognitive feedback focuses on the relational aspects of the results, such as the relationship between the outcome or truth and the judgement (estimation error) or the features of the task (e.g. trends and variability) (Bolger & Önkal-Atay 2004).

Outcome feedback is common, but tests have shown it to be ineffective in improving probability forecasts (Fischer 1982), as it does not provide the information forecasters need to understand environmental relationships (Brehmer 1980), nor a series of long run outcomes for the forecaster to better calibrate their probability forecasts with relative frequencies of occurrence (Benson & Önkal 1992). Outcome-only feedback is less structured and can be ignored (e.g. Jacoby et al. 1984). Worse, it may even detract from learning under uncertainty (Brehmer 1980), because people's biases prevent them from interpreting results objectively.

Cognitive feedback, on the other hand, has seen much more success (Balzer, Doherty & O'Connor 1989; Newell et al. 2009), perhaps because engaging with the task can accelerate learning, as demonstrated in the education research discipline. We test a specific form of cognitive feedback called calibration feedback (e.g. Lichtenstein & Fischhoff 1980). It involves comparing a person's overall correct answers–known as percentage ‘hits’–with their confidence levels. If a person is 80% confident in their judgments and they answer correctly 80% of the time, they are well calibrated. If they answer less than 80% correct, they are overconfident. Note that in the case of interval judgements (which is what we use in this study), correct answers–or hits–are measured as those where the interval captures the truth.

Researchers assert that ‘calibration feedback appears to be a promising means of improving the performance of probability forecasters’ (Benson & Önkal 1992, p. 560). While much of the literature introduced earlier has been examined in forecasting and general knowledge tasks, we believe the benefits of calibration feedback would also translate to quantitative estimates in the field.

We are not aware of any research that has experimentally tested cognitive feedback in ecology and environmental science; particularly calibration feedback. On-ground training in vegetation condition assessment protocols such as the ‘Habitat Hectares’ approach (Parkes, Newell & Cheal 2003) is routinely conducted in government agencies, but training benefits are not tested. Studies suggest that experience and training can reduce error within individuals (Smith 1944; Kennedy & Addison 1987; Cropper 2009), but general field experience does not necessarily correlate with performance (Gorrod & Keith 2009) or consistency of observer biases (Sykes, Horrill & Mountford 1983). The high observer repeatability reported by Symstad, Wienk & Thorstenson (2008) was considered a product of rigorous training, although this claim was not specifically tested within the study.

Learning from the crowd

One reason why feedback is rare in reality lies in the difficulty of obtaining ‘true values’ (such as percentage cover) that can be used to learn from. As feedback about the truth is essential for building expertise, it would be useful to know whether the type of feedback that we usually have access to (other people's estimates of the same thing) functions in the same way. Fortunately, we know that the group average of multiple judgements tends to be very close to the truth, because random and systematic errors of individuals tend to cancel each other out. This statistical sampling phenomenon is remarkably robust. On examining 800 estimates of the weight of a fat ox at a country fair in England, Francis Galton (1907) marvelled that the median (and mean) was within 1% of the true value, outperforming most participants and even the best cattle experts in the crowd, a phenomenon known as the ‘Wisdom of Crowds’ (Surowiecki 2005).

Fortunately, we do not require 800 people at a country fair to see an improvement in judgement. The average judgement from two people is better than one (Soll & Larrick 2009), and even the average of two judgements from a single person tends to be closer to the truth over the long run than adopting a single estimate (Herzog & Hertwig 2009). Sykes, Horrill & Mountford (1983) found mean cover values from ten observers to correspond closely with measured point quadrat values in 4-m2 quadrats. By extension, we suggest that the group average could be substituted for the true value in feedback to improve cover estimates.

In experiment 1, we test whether feedback using different information (true values or group average estimates from participants) similarly improves estimation performance. Presumably the effectiveness of group average feedback to improve accuracy depends on how close other people's estimates are to the truth. However, feedback about inaccurate group estimates might still improve judgements if it illustrates variability and prompts the participants to adjust their interval widths and so reduce overconfidence. In experiment 2, we wished to identify the components of feedback that are critical for improving individual estimation, to explore how to best structure a feedback session when we do not have true values. We compared two formats for feeding back information about other people's judgements to participants, and hypothesise that participants will respond most positively if they actively evaluate their own performance (hereafter active feedback, based on calibration feedback), rather than simply observing other people's estimates (hereafter passive feedback, analogous to outcome feedback).

Materials and methods

Participants and study area

Thirty-seven volunteers from Australian Conservation Volunteers (ACV) participated in experiment 1. Responses for two participants were incomplete and omitted from the analysis, leaving a final = 35. Experience varied between participants, but all had some interest in and/or familiarity with alpine vegetation (Table 1). The participant group was gender-balanced, and many had backgrounds in natural sciences (38% of participants described botany, natural resource management or outdoor education as their primary discipline). The study was conducted as part of a weekend survey of an invasive species (Grey Sallow willow, Salix cinerea) throughout alpine bogs in the Bogong High Plains (Australia), in collaboration with Parks Victoria and the Victorian National Parks Association (VNPA).

Table 1. Demographic summary of participants in both experimentsa
 FemaleAgeYears Exp.Bachelor/DiplomaPhD/MastersExperience ratingb
Alpine vegAlpine willowPlant ID
  1. a

    Data presented as percentage of participants or mean [standard deviation].

  2. b

    Self-rated familiarity with three skill areas, measured on a scale of 0–10.

Exp 1 (= 35)54%55·6 [11]19·9 [13·4]49%22%5·2 [2·5]5·8 [3·6]4·8 [2·6]
 Coastal veg% cover estimationPlant ID
Exp 2 (= 37)65%21·5 [1·4]Third year undergradN/AN/A4 [2·2]5·8 [1·7]5·5 [2]

Thirty-seven final year botany students from the University of Melbourne participated in experiment 2. All students had some experience with field surveys from other course field trips, but less experience with the vegetation at the site (Table 1). The study was integrated into an annual field trip to Altona Coastal Park. While attendance was a course requirement, students were aware that their responses were to be de-identified and would not contribute to assessment.


In experiment 1, ten plots were marked out with flags, each 10 m radius circles. They were split across two willowed areas (five in each) for a counterbalanced 2 × 2 Before-After design. Prior to the experiment, each plot was surveyed independently by three field ecologists to obtain ‘true values’, who counted individuals and measured heights for allocation into each of four size classes: seedlings (single stem), small shrubs (< 0·5 m), medium shrubs (0·5–1·5 m), large shrubs (> 1·5 m). The average of the three was taken if the counts differed (the possibility of double-counting prevented us from using the maximum).

Participants were randomly allocated into two groups. Group 1 commenced estimations at Area 1 (with two facilitators), and Group 2 simultaneously started at Area 2 (with another two facilitators). Abundance was estimated by each participant using a 4-point technique (Speirs-Bridge et al. 2010) that is designed to mitigate two of the most pervasive and influential sources of estimation bias; anchoring and overconfidence (Soll & Klayman 2004; Teigen & Jørgensen 2005). The technique elicits an interval in four stages: (1) lowest plausible number of willows (a), (2) highest plausible number of willows (b), (3) best estimate of number of willows (r), (4) confidence that the interval contains the actual number of willows (50–100%) (c). Participants were asked to spend only 1 min on each 4-point estimate, imposing a time constraint to resemble a ‘rapid assessment’. At each plot, participants formulated abundance interval estimates for each of the four willow size classes.

On completion of the first five plots, the two groups were randomly split into two sub-groups, which independently received feedback about either (1) the actual abundances (true values) or (2) the average best estimates of the other participants in their group for each of the five plots they had just estimated (Fig. 1). During the feedback session, participants calculated their hit-rates for the two most abundant size classes (small and medium shrubs) by counting how many of their interval estimates contained either the ‘true value’ or ‘group average estimate’, depending on which treatment group they were assigned to. Participants then compared their hit-rates with their average confidence, and assessed whether their interval widths (ranges) were appropriate and levels of confidence warranted. This process will hereafter be called ‘calibration feedback’, and it underpins one of the feedback formats we will test in experiment 2. Groups then switched areas and completed the remaining five plots.

Figure 1.

Experimental design (= 37 initially in both experiments, although different participants were used in each).

The same basic procedure was followed for experiment 2, except the feedback conditions were different. Here, we compared two feedback formats for presenting information about group estimates. Both formats had a graphical component (Fig. 2), and the second format had an additional ‘calibration feedback’ component.

Figure 2.

Two graphical formats were compared for displaying participants’ estimates during feedback in experiment 2 (intervals adjusted to 80% confidence).

All participants' data for the first five plots were entered into laptops, transformed to an 80% confidence level using a linear extrapolation, and anonymously displayed back to the groups on butcher's paper, in one of the two formats. The first format graphically displayed other people's anonymous estimates as a series of individual intervals, transformed to an 80% level of confidence for consistency. Facilitators pointed out the variability and some characteristics of different intervals (e.g. ‘the narrow interval here means that this person is relatively confident’). In this condition, participants listened and watched, but there was no active engagement in the feedback session. We call this passive feedback (PFB). We liken PFB to outcome feedback (which is also passive), but they are not strictly the same, as here we provided information about the group estimates, not the true values.

The second format was to display the group estimates as a single group average interval (first component), and in addition, to have the participants calculate and evaluate their own ‘hit-rates’ during the feedback session using the group average best estimate (second component, see ‘calibration feedback’ procedure for experiment 1). We call this active feedback (AFB).

There were some other minor differences in the design of experiment 2 compared to experiment 1. Rather than estimating abundance via counts, participants made rapid estimates of projected foliage cover (percentage cover) of three target species, also using the 4-point interval elicitation technique. The species were sufficiently common that they would occur in most of the plots, but in different abundances. Also, they represented three distinct morphologies: species 1 was a sprawling succulent, Beaded glasswort (Sarcocornia quinqueflora), species 2 was an erect succulent, Shrubby glasswort (Sclerostegia arbuscula), and species 3 was a prostrate shrub, Southern sea-heath (Frankenia pauciflora). Ten plots were split across two areas, as per experiment 1, but they were rectangular quadrats (10 × 3 m). For consistency, we will call them all plots. Prior to the task, participants were taught to identify each of the three target species, and were given printouts with further identification guidelines and specimen photos. Unlike experiment 1, participants did not receive ‘true values’ as part of the feedback session (we only used group averages). However, it was still necessary to obtain ‘true values’ to calculate estimation performance. We obtained these using high-resolution (= 200) point quadrat sampling (Elzinga et al. 1999), prior to the experiment.

In both experiments, a 20-min introduction to the task instructed participants about the process, morphological and height distinctions between size classes (experiment 1) and species (experiment 2), guidelines for visual measurement and plot inclusion. Participants were instructed not to share or discuss their estimates with each other. After the introduction, participants estimated abundance in two additional plots to control for practice effects. Data from these plots were not used. This was to ensure that the ‘before’ performance had reached a stable baseline, so we could be confident that improvement after feedback was due to the feedback intervention itself [as we did not have enough participants for a Before-After-Control-Intervention (BACI) design]. Facilitators of the feedback sessions were trained and used a script to ensure structured, consistent interventions across groups. Participants consented to their data being anonymously used in our research and were debriefed about study findings.

Statistical analysis

To enable comparison of estimates with different confidences, all intervals were adjusted to an 80% confidence level using a simple linear extrapolation (Bedford & Cooke 2001; McBride et al. 2012). Using the participants' elicited lower bound (a), upper bound (b) and best estimate (r), we extrapolated to adjusted lower (aadj) and upper (badj) bounds within which 80% of all estimates might be expected to fall, such that,

display math(eqn 1)
display math(eqn 2)

where cadj is the required probability level (80%), and c is the participants' stated confidence. We believe the linear extrapolation to be the most sensible approach for these data in terms of minimising assumptions about the participants' underlying distribution, after comparing it with log normal, beta and arcsine transformations.

Hit-rates (number of intervals per participant that contained the true value) were calculated and compared before and after feedback. Calibration signals how well the intervals specified by the participant accurately reflect their uncertainty. If the average percentage hits from their transformed intervals is substantially below 80% (the standardised confidence level), they are considered overconfident and would need to widen their intervals to become better calibrated. If it is greater than 80%, they are underconfident. Accuracy scores were also calculated before and after feedback following the standardisation methods described by Burgman et al. (2011). Accuracy was measured in terms of the distance between each ‘best estimate’ and ‘true value’, averaged over all Before estimates and all After estimates. Scoring rules that measure ‘distance from truth’ usually require standardisation to account for different response units but also different response ranges. All estimates elicited in this task were on the same scale (experiment 1: abundance counts, experiment 2: percentage cover), but the range of responses was relatively narrow when the true value was close to the lowest possible bound, zero or close to the highest possible bound in experiment 2, 100%. Conversely, the response range tended to be wide when the true value was more centralised. To ensure performance on each judgment (e.g. small shrubs, Quadrat 1) contributed equally to the overall accuracy measure, we first range-coded the best estimates (r) by each participant for each judgement. That is, we expressed each estimate as,

display math(eqn 3)

where math formula is the estimate from participant p for judgment i, math formula is the group minimum for judgement i and math formula is the group maximum of the participants' best estimates for the judgement (including the true value). We then rescaled the answers, expressing each as the average log-ratio error (ALRE),

display math(eqn 4)

where N is the number of judgements, math formula is the range-coded estimate and math formula is the observed (true) value, also range-coded by the group minimum and maximum. The error of participants' estimates is measured as a log-ratio, which is not dominated by a single estimate that is far from the truth. As scores approach zero, they indicate greater accuracy. The log-ratio scores for any given question have a maximum possible value of 0·31 (=log(2)), indicating that the true answer has coincided with the group minimum or maximum (Burgman et al. 2011).

Gender and education differences in hit-rate and accuracy improvement were explored using contrasts of means and confidence intervals. Experience measures and age were compared with performance variables using Pearson correlations.


Group averages (overall)

We anticipated that group averages would be close to true values. In experiment 1, most people underestimated abundance, so group averages were consistently lower than the true values overall. On average, small and medium shrub estimates were 1·5 times less than true values (53% underestimation), with the exception of small shrubs at the densely populated site (Area B), which were 1·4 times higher (overestimated by 40%), reflecting poor discrimination between height divisions in plots with many willows. In experiment 2, group averages of percentage cover best estimates were very close to true values, falling within 3% of the true value in approximately half (43%) of the 30 estimates made across the 10 plots. The deviation of the group average percentage cover from the true value was especially low for species 2 (mean = 1·0% points, SD = 2·5, range = −4·5 to 4·2) and species 3 (mean = 2·2% points, SD = 4·8, range = −4·6 to 9·6). Percentage cover for species 1 was systematically underestimated (mean = 7·4% points, SD = 6·5, range = −4·3 to 18·1). The average interval for each of the 30 estimates contained the true value 80, 100 and 90% of the time for species 1, 2 and 3, respectively–this mean hit-rate of 90% for the group average interval (all species) was matched by only one of the 37 participants. Given the consistently narrower (more informative) interval width of the group average, it outperformed the best performing individual in the group over the series of estimates.

General performance (overall)

In experiment 1, people with PhDs had better hit-rates (58% hits; CI95 47, 68) than those with other or no qualifications (44% hits; CI95 40, 49) but not better accuracy. There was little correlation (= −0·153 to 0·181) between the performance measures (accuracy, hit-rate, improvement) and any of the self-assessed experience measures (alpine vegetation/alpine willow/plant identification) or the other demographic variables (years experience, age), reflecting the group homogeneity on these variables. In experiment 2, there were weak negative correlations between hit-rate and coastal vegetation experience (= −0·37), and accuracy and coastal vegetation experience (= −0·28). Correlations between performance and the other self-assessed experience measures (percentage cover estimation/plant identification) were below 0·1.

Feedback effects

Overall, 69·1% of participants in experiment 1 had a better hit-rate after feedback (Table 2; Fig. 4). That is, more of their intervals captured the true value (e.g. Fig. 3), because they were both wider and closer to the truth. For those whose performance improved, hit-rate increased on average by 4·8 of 20 possible ‘hits’ (five plots, four interval estimates per plot), or 24%. In contrast, only 25·6% of participants had a worse hit-rate after feedback, which declined on average by 2·4 of 20. Two of 35 participants (5·7%) had no hit-rate change after feedback.

Figure 3.

Participants' estimates of small willow shrubs from Plot 1 before feedback (a) Group 1, = 15 and after feedback (b) Group 2, = 20 (unpaired). The true value line marks the actual number of small willow shrubs in that plot; intervals are adjusted to 80% confidence.

Figure 4.

Mean change in estimation performance after feedback (95% Confidence Intervals) for experiment 1 (a and c) and experiment 2 (b and d). Above zero indicates a performance improvement for hit-rate (a and b) and accuracy (c and d).

Table 2. Summary of results by feedback treatment group for experiment 1 [95% confidence intervals]
Feedback type
Performance measureTruthGroup averageCombined
  1. a

    Measured as mean change in percentage points for hit-rate and overconfidence, and relative% difference for accuracy.

  2. b

    Measured as the average log-ratio error, so a lower value after feedback denotes improved accuracy.

Hit-rate (%)
 Small/med shrubs28·1%43·1%15% [2·7, 27·3]24·2%37·9%13·7% [2·8, 24·6]26·1%40·3%14·3% [6·2, 22·4]
 All size classes43·1%59·1%16% [7·1, 24·9]41·6%52·6%11·1% [4·3, 18·3]42·3%55·6%13·3% [7·8, 19]
Overconfidence (%)
 Small/med shrubs51·9%36·9%15% [2·7, 27·3]55·8%42·1%13·7% [2·8, 24·6]53·9%39·7%14·3% [6·2, 22·4]
 All size classes36·9%20·9%16% [7·1, 24·9]38·4%27·4%11·1% [4·3, 18·3]37·7%24·4%13·3% [7·8, 19]
Accuracy (ALRE)b
 Small/med shrubs0·12400·1067 14·0% 0·11520·1023 11·2% 0·11960·1045 12·6%
 All size classes0·08090·08241·8%0·08230·0765 7·0% 0·08160·0795 2·6%
Change break-downTruthGroup averageCombined
Hit-rate improvement (% participants)75%63·2%69·1%
Improvement magnitude (# out of 15 possible hits)4·9 of 204·7 of 204·8 of 20
Hit-rate decline (% participants)25%26·3%25·6%
Decline magnitude (# out of 15 possible hits)2 of 202·8 of 202·4 of 20
No change (% participants)0%10·5%5·7%

Across all willow size classes, respondents were on average 37·7% overconfident prior to feedback, and 24·4% overconfident after feedback–a calibration improvement of 13·3% points. The average calibration improvement was comparable between the size classes on which feedback was received (14·3%) and all data (13·3%), indicating that the feedback effect was transferable to the other estimates made by participants (Table 2).

Improvement was similar for those who received feedback about the Truth (75% of participants improved their hit-rate) and those who received feedback about the Group Average (63·2% improved their hit-rate) (Fig. 4). The average improvement was also similar for both groups, at approximately 5 of 20 possible hits.

There was a strong gender effect associated with hit-rate improvement after feedback: 70% (16 of 23) of those who improved were female. On average, females improved by 24·7% (CI95 16·2, 33·3), while males on average showed no (0%) improvement (CI95 −7·5, 8·1). No other statistically significant differences were detected between improvement after feedback and the other demographic variables (experience measures, age, education).

In addition to hit-rate, another measure of estimation performance is accuracy, or ALRE. For small and medium shrubs this improved on average by 12·6% after feedback (measured as relative% difference, whereas hit-rate improvement was measured as a shift in percentage points). The improvement in the Truth group (14%) was similar to the Group average group (11·2%). When seedlings and large shrubs were included in the analysis, the overall improvement was diluted (Table 2; Fig. 4). Seedlings were generally absent, so there was no space for improvement (floor effect).

Table 3 displays a summary of results for the effect of feedback for both treatment groups in experiment 2 (see also Fig. 4). Unlike experiment 1, it was not appropriate to combine the data for an overall feedback effect, due to the substantial difference between groups. The results for hit-rate and accuracy are separated into two further groups–the species on which feedback was directly received (species 1 and species 2) and all data, which includes estimates on which feedback was not received (species 3). Overall 35% of participants had a better hit-rate after passive feedback (PFB), compared with 52·9% given active feedback (AFB). For those whose performance improved, hit-rate increased by 3·1 of 15 possible hits in PFB, and 3·7 of 15 in AFB. In contrast, 65% of participants had a worse hit-rate after PFB, compared with only 29·4% in AFB. The average decline after PFB (2·7 of 15) was almost twice that of after AFB (1·6 of 15). 18% of participants had no hit-rate change after AFB.

Table 3. Summary of results by feedback format for experiment 2 [95% confidence intervals]
Feedback type
Performance measurePassive feedback (PFB)Active feedback (AFB)
  1. a

    Measured as mean change in percentage points for hit-rate and overconfidence, and relative% difference for accuracy.

  2. b

    Measured as the average log-ratio error, so a lower value after feedback denotes improved accuracy.

Hit-rate (%)
 Species 1 & 261·5%57%4·5% [−10·5, 1·5]58·5%72·9%14·4% [2·4, 26·4]
 All three species62·6%58·3%4·3% [−9·3, 0·7]58·4%68·3%9·8% [1·8, 17·8]
Overconfidence (%)
 Species 1 & 218·5%23%4·5% [−10·5, 1·5]21·5%7·1%14·4% [2·4, 26·4]
 All three species17·4%21·7%4·3% [−9·3, 0·7]21·6%11·7%9·8% [1·8, 17·8]
Accuracy (ALRE)b
 Species 1 & 20·09020·0814 9·8% 0·09240·0741 19·8%
 All three species0·12210·12532·6%0·13470·1256 6·8%
Change break-downPassive feedback (PFB)Active feedback (AFB)
Hit-rate improvement (% participants)35%52·9%
Improvement magnitude (# out of 15 possible hits)3·1 of 153·7 of 15
Hit-rate decline (% participants)65%29·4%
Decline magnitude (# out of 15 possible hits)2·7 of 151·6 of 15
No change (% participants)0%18%

For species 1 and 2, respondents who received PFB were on average 18·5% overconfident prior to feedback, and slightly worse (23%) after feedback. For those who received AFB, overconfidence before feedback was 21·5% compared with 7·1% after feedback–a calibration improvement of 14·4% points. The average calibration improvement for AFB was better for the two species on which feedback was received (14·4%) than for all species (9·8%), indicating that the feedback effect was not transferable to the other estimates made by participants as it was in experiment 1.

For species 1 and 2, accuracy improved on average by 9·8% after PFB, and 19·8% after AFB. When species 3 was included in the analysis, the overall improvement was diluted (Table 3; Fig. 4). In experiment 2, no statistically significant differences were detected between improvement after feedback and the demographic variables (gender, experience measures, age, education).


Experiment 1 showed that feedback using group average estimates of species abundance equally improves participant's judgements as feedback about the truth, even when averages deviate somewhat from the true values. Usefully, group averages could be used for feedback when the truth is impractical, time-consuming, or even impossible to obtain. Experiment 2 distinguished between an effective and ineffective feedback format, namely that self-calculating individual feedback about performance—rather than simply seeing outcomes from others—is essential. While we cannot isolate which of the two components of AFB most affected performance (seeing the group average intervals or calculating hit-rates), our results–supported by experiment 1 and previous research, for example, Benson & Önkal 1992–suggest that the active calibration feedback was a key. Taken together, both components of AFB led to a superior procedure compared with PFB.

Studies in statistical cognition reveal widespread confusion about the relationship between confidence level and interval width, or precision (Fidler et al. 2005). The idea that a higher level of confidence is reflected in wider intervals is counter-intuitive to many people, as they think about confidence in an everyday sense, where people who are more confident are more precise. For example, if someone asks ‘how long does it take to walk to the train station?’, Person 1 might say 10–15 min, and Person 2 might say ‘somewhere between 5 and 20 min. Person 2 has more reason to be confident that their interval contains the actual time it will take to walk to the station, but Person 1 appears more confident. We contend that evaluating personal calibration appears to be an efficient way to develop better intuition for setting interval widths and confidence appropriately, and to update beliefs about epistemic uncertainty. Results from experiment 2 supported this hypothesis. Both groups received information about the estimates of others in their group. Plausibly, viewing this information in the form of individual intervals (PFB) could illustrate group variability sufficiently to prompt individuals to widen their intervals after feedback, improving their hit-rates. However, only the group that received calibration feedback (AFB) improved their hit-rates in subsequent judgements, consistent with the research of Benson & Önkal (1992) for probability forecasts. Importantly, our feedback procedure did not require a complex statistical understanding of calibration to improve it (González-Vallejo & Bonham 2007).

We believe that simply viewing other people's judgements without highlighting the relationship between individual judgements and the environment (as done in the PFB condition) is akin to providing ‘outcome feedback’ (Brehmer 1980), as distinct from the structured cognitive feedback people received in the AFB condition. We found calibration in the PFB group did not improve after feedback. In fact, it worsened slightly, supporting assertions that simply discovering the answer or outcome can detract from learning under uncertainty (Brehmer 1980; Balzer, Doherty & O'Connor 1989). There are a few possible reasons for this. First, Brehmer (1980) argues that people's biases prevent them from interpreting results objectively. For example, they overweight evidence that confirms what they believed to be the true answer (in this case, they may seek out the intervals that agree with theirs–recall Fig. 2), construct causal narratives that may be inaccurate, and disregard answers that counter their beliefs (see also Kahneman, Slovic & Tversky 1982). In PFB, participants may have insufficiently considered their intervals in the context of other people's intervals, or in the context of other plots. Second, participants receiving PFB were not goal-driven to improve their performance (Latham & Locke 1991) as they were when calculating their hit-rates. A third possibility is that the cognitive load of seeing so many intervals for each plot may have been too great to assimilate and learn from the information (e.g. Sweller 1988). If people are already using a rough averaging strategy when shown individual intervals, showing them the group average interval (AFB, first component) may be a superior feedback format because it reduces the cognitive load and error.

Implications for management

We have shown that calibration feedback is just as effective when using group averages as true values. As such, group averages could be used for feedback in the absence of true values, which is typical of ecological estimation. This could be especially useful when training and calibrating fieldworkers, where there are likely to be multiple people and judgements. Even the average of two independent estimates tends to outperform a single estimate (Soll & Larrick 2009), so the power of averaging can be seen with even a small number of participants.

A second consideration is whether an average from a group of varying expertise is preferable to an expert judgement. Our results show that group averages perform better than the best performing member of the group over the long run, and in experiment 2, averages were remarkably close to true values. Nonetheless, agencies may be reluctant to use an aggregated judgement over an expert, even though identifying the ‘best expert’ can be a considerable challenge. Typical measures of expertise, such as years experience, publication record, or our perception of other people's expertise do not tend to correlate with performance (Burgman et al. 2011). In fact, we found a weak negative correlation between accuracy, hit-rate and coastal vegetation experience in experiment 2. We suggest that improving the judgements of those you have is more reliable than seeking the best judge.

Future research could potentially explore structured approaches for improving the estimation of sole field workers. Where possible, this would involve intermittent measurement of ‘true values’ to calibrate oneself against. Important considerations would then be: How frequently do we need to recalibrate? Does improved self-calibration on one set of judgements transfer to another? (e.g. from one species to another). We are not aware of past literature on structured self-calibration in ecology, but this would be a useful extension of the present study.


Our results support claims that quantitative estimation in field ecology can vary substantially between observers. Some observer error is acceptable, and indeed, expected. Yet, if conclusions of a field-based study are sensitive to inaccurate judgements, too much error might undermine the recommendations the study underpins. Objective measurement techniques are often unrealistic for a field ecologist with limited time and many plots to cover. In such cases, investing in training and feedback to improve estimation is worthwhile. However, the format for delivering this information matters. People respond positively to active calibration feedback, not simply a passive display of outcomes. Given the constraints to obtaining ‘true values’ to compare with field estimates, feedback about group averages could be a useful proxy for training and improving calibration of field workers, as long as they are actively engaged in an individualised, systematic feedback process.


This research was funded by the Australian Research Council and the Australian Centre of Excellence in Risk Analysis (B.W. and F.F.), the Applied Environmental Decision Analysis group (J.M.) and the Centre of Excellence for Environmental Decisions (P.V.). We gratefully acknowledge Australian Conservation Volunteers, Parks Victoria, assisting facilitators, Victoria Hemming, Kate Giljohann, Chris Jones, Marissa McBride and Prof. Mark Burgman for their support. Suggestions from three anonymous reviewers greatly improved the manuscript. This research was approved by the Human Ethics Committee of The University of Melbourne (Application No. 0709557).