Participants and study area
Thirty-seven volunteers from Australian Conservation Volunteers (ACV) participated in experiment 1. Responses for two participants were incomplete and omitted from the analysis, leaving a final n = 35. Experience varied between participants, but all had some interest in and/or familiarity with alpine vegetation (Table 1). The participant group was gender-balanced, and many had backgrounds in natural sciences (38% of participants described botany, natural resource management or outdoor education as their primary discipline). The study was conducted as part of a weekend survey of an invasive species (Grey Sallow willow, Salix cinerea) throughout alpine bogs in the Bogong High Plains (Australia), in collaboration with Parks Victoria and the Victorian National Parks Association (VNPA).
Table 1. Demographic summary of participants in both experimentsa| | Female | Age | Years Exp. | Bachelor/Diploma | PhD/Masters | Experience ratingb |
|---|
| Alpine veg | Alpine willow | Plant ID |
|---|
|
| Exp 1 (n = 35) | 54% | 55·6 [11] | 19·9 [13·4] | 49% | 22% | 5·2 [2·5] | 5·8 [3·6] | 4·8 [2·6] |
| | Coastal veg | % cover estimation | Plant ID |
| Exp 2 (n = 37) | 65% | 21·5 [1·4] | Third year undergrad | N/A | N/A | 4 [2·2] | 5·8 [1·7] | 5·5 [2] |
Thirty-seven final year botany students from the University of Melbourne participated in experiment 2. All students had some experience with field surveys from other course field trips, but less experience with the vegetation at the site (Table 1). The study was integrated into an annual field trip to Altona Coastal Park. While attendance was a course requirement, students were aware that their responses were to be de-identified and would not contribute to assessment.
Procedure
In experiment 1, ten plots were marked out with flags, each 10 m radius circles. They were split across two willowed areas (five in each) for a counterbalanced 2 × 2 Before-After design. Prior to the experiment, each plot was surveyed independently by three field ecologists to obtain ‘true values’, who counted individuals and measured heights for allocation into each of four size classes: seedlings (single stem), small shrubs (< 0·5 m), medium shrubs (0·5–1·5 m), large shrubs (> 1·5 m). The average of the three was taken if the counts differed (the possibility of double-counting prevented us from using the maximum).
Participants were randomly allocated into two groups. Group 1 commenced estimations at Area 1 (with two facilitators), and Group 2 simultaneously started at Area 2 (with another two facilitators). Abundance was estimated by each participant using a 4-point technique (Speirs-Bridge et al. 2010) that is designed to mitigate two of the most pervasive and influential sources of estimation bias; anchoring and overconfidence (Soll & Klayman 2004; Teigen & Jørgensen 2005). The technique elicits an interval in four stages: (1) lowest plausible number of willows (a), (2) highest plausible number of willows (b), (3) best estimate of number of willows (r), (4) confidence that the interval contains the actual number of willows (50–100%) (c). Participants were asked to spend only 1 min on each 4-point estimate, imposing a time constraint to resemble a ‘rapid assessment’. At each plot, participants formulated abundance interval estimates for each of the four willow size classes.
On completion of the first five plots, the two groups were randomly split into two sub-groups, which independently received feedback about either (1) the actual abundances (true values) or (2) the average best estimates of the other participants in their group for each of the five plots they had just estimated (Fig. 1). During the feedback session, participants calculated their hit-rates for the two most abundant size classes (small and medium shrubs) by counting how many of their interval estimates contained either the ‘true value’ or ‘group average estimate’, depending on which treatment group they were assigned to. Participants then compared their hit-rates with their average confidence, and assessed whether their interval widths (ranges) were appropriate and levels of confidence warranted. This process will hereafter be called ‘calibration feedback’, and it underpins one of the feedback formats we will test in experiment 2. Groups then switched areas and completed the remaining five plots.
The same basic procedure was followed for experiment 2, except the feedback conditions were different. Here, we compared two feedback formats for presenting information about group estimates. Both formats had a graphical component (Fig. 2), and the second format had an additional ‘calibration feedback’ component.
All participants' data for the first five plots were entered into laptops, transformed to an 80% confidence level using a linear extrapolation, and anonymously displayed back to the groups on butcher's paper, in one of the two formats. The first format graphically displayed other people's anonymous estimates as a series of individual intervals, transformed to an 80% level of confidence for consistency. Facilitators pointed out the variability and some characteristics of different intervals (e.g. ‘the narrow interval here means that this person is relatively confident’). In this condition, participants listened and watched, but there was no active engagement in the feedback session. We call this passive feedback (PFB). We liken PFB to outcome feedback (which is also passive), but they are not strictly the same, as here we provided information about the group estimates, not the true values.
The second format was to display the group estimates as a single group average interval (first component), and in addition, to have the participants calculate and evaluate their own ‘hit-rates’ during the feedback session using the group average best estimate (second component, see ‘calibration feedback’ procedure for experiment 1). We call this active feedback (AFB).
There were some other minor differences in the design of experiment 2 compared to experiment 1. Rather than estimating abundance via counts, participants made rapid estimates of projected foliage cover (percentage cover) of three target species, also using the 4-point interval elicitation technique. The species were sufficiently common that they would occur in most of the plots, but in different abundances. Also, they represented three distinct morphologies: species 1 was a sprawling succulent, Beaded glasswort (Sarcocornia quinqueflora), species 2 was an erect succulent, Shrubby glasswort (Sclerostegia arbuscula), and species 3 was a prostrate shrub, Southern sea-heath (Frankenia pauciflora). Ten plots were split across two areas, as per experiment 1, but they were rectangular quadrats (10 × 3 m). For consistency, we will call them all plots. Prior to the task, participants were taught to identify each of the three target species, and were given printouts with further identification guidelines and specimen photos. Unlike experiment 1, participants did not receive ‘true values’ as part of the feedback session (we only used group averages). However, it was still necessary to obtain ‘true values’ to calculate estimation performance. We obtained these using high-resolution (n = 200) point quadrat sampling (Elzinga et al. 1999), prior to the experiment.
In both experiments, a 20-min introduction to the task instructed participants about the process, morphological and height distinctions between size classes (experiment 1) and species (experiment 2), guidelines for visual measurement and plot inclusion. Participants were instructed not to share or discuss their estimates with each other. After the introduction, participants estimated abundance in two additional plots to control for practice effects. Data from these plots were not used. This was to ensure that the ‘before’ performance had reached a stable baseline, so we could be confident that improvement after feedback was due to the feedback intervention itself [as we did not have enough participants for a Before-After-Control-Intervention (BACI) design]. Facilitators of the feedback sessions were trained and used a script to ensure structured, consistent interventions across groups. Participants consented to their data being anonymously used in our research and were debriefed about study findings.
Statistical analysis
To enable comparison of estimates with different confidences, all intervals were adjusted to an 80% confidence level using a simple linear extrapolation (Bedford & Cooke 2001; McBride et al. 2012). Using the participants' elicited lower bound (a), upper bound (b) and best estimate (r), we extrapolated to adjusted lower (aadj) and upper (badj) bounds within which 80% of all estimates might be expected to fall, such that,
(eqn 1)
(eqn 2)
where cadj is the required probability level (80%), and c is the participants' stated confidence. We believe the linear extrapolation to be the most sensible approach for these data in terms of minimising assumptions about the participants' underlying distribution, after comparing it with log normal, beta and arcsine transformations.
Hit-rates (number of intervals per participant that contained the true value) were calculated and compared before and after feedback. Calibration signals how well the intervals specified by the participant accurately reflect their uncertainty. If the average percentage hits from their transformed intervals is substantially below 80% (the standardised confidence level), they are considered overconfident and would need to widen their intervals to become better calibrated. If it is greater than 80%, they are underconfident. Accuracy scores were also calculated before and after feedback following the standardisation methods described by Burgman et al. (2011). Accuracy was measured in terms of the distance between each ‘best estimate’ and ‘true value’, averaged over all Before estimates and all After estimates. Scoring rules that measure ‘distance from truth’ usually require standardisation to account for different response units but also different response ranges. All estimates elicited in this task were on the same scale (experiment 1: abundance counts, experiment 2: percentage cover), but the range of responses was relatively narrow when the true value was close to the lowest possible bound, zero or close to the highest possible bound in experiment 2, 100%. Conversely, the response range tended to be wide when the true value was more centralised. To ensure performance on each judgment (e.g. small shrubs, Quadrat 1) contributed equally to the overall accuracy measure, we first range-coded the best estimates (r) by each participant for each judgement. That is, we expressed each estimate as,
(eqn 3)
where N is the number of judgements,
is the range-coded estimate and
is the observed (true) value, also range-coded by the group minimum and maximum. The error of participants' estimates is measured as a log-ratio, which is not dominated by a single estimate that is far from the truth. As scores approach zero, they indicate greater accuracy. The log-ratio scores for any given question have a maximum possible value of 0·31 (=log(2)), indicating that the true answer has coincided with the group minimum or maximum (Burgman et al. 2011).
Gender and education differences in hit-rate and accuracy improvement were explored using contrasts of means and confidence intervals. Experience measures and age were compared with performance variables using Pearson correlations.