Conservation funders require methods by which to evaluate the relative success of projects within their portfolios. One approach is to develop robust indices of success that are consistent between projects and evaluators. We used three contrasting indices to evaluate factors contributing to success of projects funded by the UK Government's Darwin Initiative. The indices were: Darwin Outputs (related to the Darwin Initiative's own method of evaluating the success of the projects it supports), Impact Assessment (based on the method developed by Mace et al. for evaluating the success of projects supported by zoos) and a Ranked Outcomes (a qualitatively ranked outcome index). We evaluated the internal consistency of the indices by comparing the assessments of multiple independent scorers. We assessed their robustness by checking for differences between indices and assessors in the success level assigned to a given project. We then used mixed effects models to analyse the factors contributing to project success, as expressed by each index, and compared the factors highlighted as important by each index. Although there were systematic differences between scorers, relative rankings between scorers were consistent. The indices were in fair agreement as to project success ranks, although the success ranks assigned by subjective ranked outcome- and output-based indices were more consistent between assessors than the impact assessment index. Higher levels of funding led to projects receiving consistently higher success scores. Other variables varied in their importance between indices, although metrics of education were consistently important. This study shows that it is possible to develop robust outcome-based indices of conservation success for comparison of projects within a funder's portfolio, although the nuances picked up by different indices suggest a need for multiple indices to capture different facets of success. We also highlight the need for thorough testing of the robustness of success indices before widespread adoption.
Continual and independent evaluation of conservation interventions is a prerequisite to ensuring that conservation is appropriately targeted and effective (Saterson et al., 2004; Sutherland et al., 2004). It is often difficult, slow and costly to measure actual changes in biodiversity or human well-being as a result of conservation actions. Many project implementers still do not report outcomes consistently and as a result, there have been few quantitative comparative evaluations of the outcomes of particular conservation approaches (Brooks et al., 2006; Waylen et al., 2010). Funders wishing to evaluate the outcomes of their portfolio of projects require indices of relative conservation success that are straightforward to apply, broad enough to cover the range of activities they fund and robust to subjectivity among evaluators (Mace et al., 2007). Currently there are few such indices available within conservation, despite their potential usefulness as a complement to rigorous in-depth evaluations of the factors affecting the success of individual projects.
Evaluative frameworks such as the ‘Pressure-State-Response’ framework (Tunstall, Hammond & Henniger, 1994) require indicators in order to track and quantify change, where an indicator is a measurable metric used to represent the status of the system being monitored (Yoccoz, Nichols & Boulinier, 2001; Jones et al, 2011). An index is distinguished from an indicator by being a composite of a number of different metrics, which represent the system state in aggregate terms. Calculation of an index involves combining and weighting multiple values of a metric, or metrics, into a single value for use in decision-making (Mace et al., 2007). Indices are criticized for being broad-brush in their approach and there are practical and conceptual difficulties in their application, such as the definition of reference conditions (Martinez-Crego, Alcoverro & Javier, 2010). However, there are a number of conservation evaluation approaches in successful use that rely on indices. For example, ‘Threat Reduction Assessment’ (TRA) appraises the importance of different threats affecting a system and measures the effectiveness of different interventions in reducing those threats by creating a TRA Index (Salafsky & Margoluis, 1999).
One approach to measuring success involves distinguishing outputs from outcomes. Outputs are the amount of some specific activity delivered by a project, and can indicate whether a project has met its specific objectives. Outcomes are the consequences of the project, which are better indicators of long-term success or failure. Outputs can provide a useful starting point in cases where project activities are well focused, as the data are often relatively straightforward to collate and outputs should be achievable within the project time frame. Outcomes (e.g. improved status of target populations) are generally harder to measure, to achieve in the short-time frames of project managers, and may not yield meaningful comparisons across sites (Salafsky & Margoluis, 1998).
The Darwin Initiative (DI) was established in 1992 by the British Government, to assist countries rich in biodiversity but poor in resources to fulfill their obligations with regard to the Convention on Biological Diversity (CBD, 1992; DEFRA, 2009). It was chosen for this study because of its international reputation, promoting biodiversity conservation and sustainable resource use worldwide, and because it is both long running and well documented, providing the opportunity to compile a substantial dataset of project reports giving details of outputs and outcomes. Confounding variables are reduced as all projects have the same duration (3 years), same broad goal (set by the funder), comparable size, similar implementer backgrounds (UK-based institutions) and work with overseas counterparts. All projects must report against quantitative and consistent measures of inputs and outputs, that is, the Darwin Standard Output Measures (DEFRA, 1996). As a result of the diversity of ecosystems and species the scheme encompasses, and the range of conservation approaches used, the DI also provides an unrivalled variety and scope of conservation practice within a common framework.
The main limitation of the dataset is that information on project outputs and outcomes is self-reported by project leaders (PLs). Brooks et al. (2006) were hampered in their evaluation of conservation outcomes by lack of rigorous and quantitative reporting in the literature they reviewed. Independent evaluation can produce precise and reliable results at the level of individual projects (Gardner et al., 2007), and can be used to compare between projects through the use of indices (Mace et al., 2007). However, the indices in this approach will inevitably have been developed on the basis of a series of assumptions as to what constitutes success. In this study, both methodologies are combined by undertaking an analysis of project success that is independent of the funder or implementer, but which is based on self-reporting through DI PLs' final reports. This is similar to the approach used by Mace et al. (2007) and Salafsky & Margoluis (1999).
In this study, we used the DI project database to evaluate the potential of three indices of conservation success (one based on outputs, the other two on outcomes) to provide consistent and internally robust measures of relative success between projects. We: (1) evaluated indices' internal consistency and robustness through a comparison of the scores of multiple assessors; (2) evaluated the consistency between assessors and indices in their rating of projects' success; and (3) developed multivariate models with the different indices as dependent variables and a range of potential explanatory variables, in order to compare the determinants of project success, as measured by the different indices. Based on these analyses, we evaluated the potential of the three indices as tools for the rapid assessment of project success within a funding portfolio such as that of the DI.
The project dataset
Since its establishment in 1992, the DI has funded 674 projects in 143 countries, working with 213 UK and 862 host country organizations. Up to 2009, the scheme had invested £72 602 461, on average £110 000 per project (DEFRA, 2009). Permission to undertake the study was granted by DEFRA in January 2007.
All PLs are required by contract to complete a final report (based around a standard template) to the DI at the end of their project; these formed the basis of the database. DI funding tended to be received at either the beginning or middle of longer-term projects, or formed the entirety of shorter projects, introducing potential bias depending on which of these types a project was, which was not always easy to determine. However, collecting data on this scale is difficult and the DI currently provides one of the best available options for carrying out a global study. We explored standard 3-year projects starting between 1997 and 2004, a period with relative stasis in the aims of the DI, and for which the reporting process was complete. We approached PLs directly for permission to use their data and received 230 positive responses (66% of the total). A subset of 100 projects was chosen at random for detailed analysis, representing the minimum sample size to ensure adequate power. As a rule of thumb, we used power as defined by Kirk (1995), whereby appropriate sample size is dictated by the strata that you are trying to discern between. As some PLs did not reply, there is an unavoidable potential for bias in that all 100 projects came from PLs who were prepared to engage with the study. It is also possible that PLs of successful projects were more likely to respond; however, the personal opinion of a PL with respect to the success (or failure) of their project will not necessarily be the same as independent reviewers, and therefore projects deemed successful by their PLs may not be ranked as such by external reviewers. PLs on expensive projects might also feel under greater pressure to put a positive spin on their results, but by utilizing a suite of indices coupled with independent evaluation should limit this bias.
In order to gain a greater insight into how ‘conservation success’ is perceived by practitioners, and as part of index development, we carried out interviews with 10 PLs chosen for their expertise in carrying out DI projects (each had completed at least two projects). PLs came from a range of UK organizations including non-governmental organizations, universities, museums and botanical gardens, and had carried out a range of project types from research to species management and alternative livelihoods. PLs were asked a series of qualitative questions on ‘conservation success’ (Supporting Information Appendix S1). Caroline Howe (C. H.) carried out the interviews between October and December 2008.
Indices of conservation success
Three indices of success were developed: ‘Darwin Outputs’ based on a subjective scoring of the standard outputs provided in DI final reports (DEFRA, 1996); ‘Impact Assessment’ based on a method used to explore the success of conservation projects run by zoos (Mace et al., 2007); and ‘Ranked Outcomes’, created by ranking project outcomes stated in the text of DI final reports.
The Darwin Outputs index relates to the DI's own method of evaluating the success of the projects it supports. An output is defined as the amount of some specific activity delivered by a project (e.g. number of field stations built). Some Darwin outputs are unambiguous measures of input (e.g. number of expatriate weeks in-country), or their status is ambiguous and we defined them as inputs in order to avoid endogeneity in our models. For example, we were interested to quantify the relationship between the quantity of education and project success. Hence, training and dissemination outputs were classified as inputs (Table 1). Many DI outputs were not straightforward to compare between projects because different levels of outputs were envisaged in projects with different focuses (e.g. the number of conferences or workshops attended). In these cases, we converted the raw numbers into scores on the basis of the distribution of values for that output, to ensure none had a disproportionate influence on the index. We based our scoring on opinions concerning the relative scaling of each output expressed by DI PLs during interviews. We combined the scores for each output to give a project-level index of success (Supporting Information Appendix S1).
Table 1. Details of explanatory variables used in final models. See supplementary material for further information on variable calculation
aCalculation refers to how the variable was defined or determined.
bN = nominal; O = ordinal.
cThe Consumer Price Index was used to convert prices to 2008 pounds sterling in order to correct for inflation http://www.statistics.gov.uk/statbase. Although many of the costs were incurred in-country and therefore affected by exchange rates and varying purchasing power between countries, the questions being answered concerned expenditure from the point of view of the funder (Darwin Initiative) and therefore the cost in £ sterling was the appropriate metric.
The Mace et al. (2007) method for Impact Assessment is neither species nor project specific, and can be carried out post hoc. This index is outcome based, where an outcome is defined as the consequence of a project. The overall score for a project is a function of the project's importance (the project target's conservation status); volume or scale; and effect (level of success in terms of meeting its objectives). The Mace et al. (2007) method provides a ‘score’ chart and this was used to create an overall index for each of the 100 projects.
We developed a Ranked Outcomes index that aims to capture the textual statements that PLs are encouraged to make in their final reports, concerning the outcomes that their projects have achieved. Hence, this index captures additional, more qualitative information that can potentially illuminate the PL's view of the true project legacy. Statements referring to both positive and negative outcomes were extracted from the final reports, and were categorized into education and training, research and infrastructure, species and habitat, and legacy outcomes. Statements were ranked by C. H. according to their importance for conservation success (Supporting Information Appendix S1). Ranking was carried out within but not between categories, as it was felt that it was impossible to meaningfully compare the importance of outcomes in different categories. Each project was then given a score on the basis of the summed rankings from all the categories.
The Darwin Outputs index was not validated as the method is mechanistic enough that differences between scorers are likely to be due to mistakes rather than differences in interpretation. In order to validate the Impact Assessment index, seven students from Imperial College's MSc in Conservation Science were given a short workshop and then asked to score 10 projects each. The scores were compared with C. H.'s score using graphical methods (plotting and inspecting by eye) and Spearman's rank correlation. There is potential bias in terms of non-independence of outlook; however, this mimics the common situation in which members of a given conservation organization evaluate projects, with the same general outlook but different individual opinions.
For the Ranked Outcomes index, five conservation professionals and one professional in a related field (pest management), based at Imperial College London, were asked independently to rank the statements in the same way as C. H. Their overall scores for each project were compared with that of C. H.'s with a combination of graphical methods and the kappa statistic. There are differences of opinion in the literature as to what constitutes a reasonable level of agreement for a kappa statistic. Here, interpretation was on the basis of the usage of kappa in the medical field (McGinn et al., 2004; Viera & Garrett, 2005). Once again, the choice of experts may have potentially affected the results obtained, as they came from a relatively narrow pool; however again, this is not unrepresentative of the reality of conservation evaluation.
The three success indices were compared using the kappa statistic and Spearman's rank correlations.
A preliminary database was created, with 46 explanatory variables that might explain relative project success. Spearman's rank correlation, Mann–Whitney and chi-squared tests were used to remove variables on the basis of lack of substantive explanatory power, too many missing data points, and when another variable provided similar information more reliably. The final database consisted of 15 explanatory variables, divided into ‘project background’, ‘project type’ and ‘project resource’ (Table 1).
The three indices were modeled against the explanatory variables to explore factors predicting project success and to elucidate the differences between the indices. As all projects included some educational elements, we included educational inputs (broadly defined to include formal education, training, dissemination and community engagement) into all the models. There are issues with using quantity of education as an explanatory variable for project success, as quantity does not necessarily imply quality if educational programs are badly targeted. However, accounting for targeting and quality was beyond the scope of an approach such as this. Training PhD or MSc students is expensive and there may have been a correlation between level of funding and number of students/quantity of education; however, this was accounted for by testing for inter-variable correlation before developing the models and no correlation between level of funding and number of students/quantity of education was found.
In order to control for the effect of project initiation date, linear mixed effects (LMEs) models were used, with date as a random effect. Coding projects according to the Impact Assessment method required separating projects according to their target in order not to compare incommensurable projects (Mace et al., 2007), consequently project target was considered as a random effect for Impact Assessment.
Explanatory variables for each index were chosen from the 15 variables using tree models (Crawley, 2007). Two-way interactions, which a priori could be of interest, were included, as well as the squares and cubes of explanatory variables as necessary. Stepwise deletion was carried out, with variables with the largest P values and interactions removed first. Main effects were retained if they were involved in significant interactions. After each variable removal, the model was checked with an analysis of variance (anova) or F-test (where overdispersion occurred) to assess the significance of the subsequent increase in deviance (Crawley, 2007). Fixed effects were analysed with maximum likelihood and random effects with restricted maximum likelihood. When the random effect explained little or no variation, a generalized linear model was tested against the simplified LME with anova and accepted as the minimum adequate model if there was no significant difference between the two models (Crawley, 2007). Residuals versus fitted values plots were used for informal exploration and the Breusch–Pagan test was used to test for heteroscedasticity. R: A Language and Environment for Statistical Computing was used for all statistical analyses (R Foundation for Statistical Computing, 2007).
Projects were ordered from low to high levels of success on the basis of C. H.'s scoring (assessor A). Scores provided by other assessors (B–H) were plotted against A (Fig. 1a). The results indicate general agreement, with all scores following a similar trend (ρ = 0.825, P ≤ 0.001, n = 50). However, in general assessors B–H marked lower than A (A had direct experience of DI projects and therefore understanding of the relative difficulty in achieving different outcomes). Given this consistency in trend, A's scores were taken as the dependent variable.
Projects were ordered from low to high levels of success on the basis of A's scores (Fig. 1b). Kappa tests indicated ‘fair’ (0.21 ≤ k ≥ 0.40) to ‘substantial’ (0.61 ≤ k ≥ 0.80) agreement between A and other assessors (McGinn et al., 2004; Viera & Garrett, 2005). Considering the range of outcomes that assessors had to rank and the influence of personal opinion on what constitutes conservation success, this was considered reasonable given that the observed trend was consistent. The graph indicates a difference between E and the other assessors, with E generally being a more generous marker. E had a background in pest management in developing countries rather than conservation. The validation exercise was considered as adequate evidence to allow A's scores to be used as the Ranked Outcomes index.
Comparison of indices of success
The difference between assessors in the ranks assigned to a given project was relatively consistent for Ranked Outcomes, at around 1–15%, with 80% of projects having a difference of ≤20%. However, the Impact Assessment index was less consistent, at 10–40% difference between assessors ( Fig. 2a). There was a ‘fair’ agreement between the three indices, with the marks from the same assessor for the same project, but using different indices, differing by ≤30% for 50% of the projects (Table 2). However, the Impact Assessment was less robust than the other two indices: there was ≤20% difference between the Ranked Outcomes and Darwin Outputs for 34% of projects, but the difference was ≤20% for only 16% of projects when comparing Impact Assessment and Darwin Outputs, and for 22% for Impact Assessment and Ranked Outcomes (Fig. 2b). However, it is not possible to separate out the effect of genuine differences between the indices in the aspects of project success that they capture from the potential effects of reviewer inconsistency.
Table 2. Cross-comparison of indices of success. Kappa statistic [agreement levels taken from McGinn et al. (2004) and Viera & Garrett (2005)] and Spearman rank correlations (n = 100)
Factors influencing project success
For all three indices, amount of DI funding provided was positively correlated with success (Table 3), but for Impact Assessment the significance was lower (0.05 < P < 0.1). For Darwin Outputs and Impact Assessment, additional non-DI funding was also positively correlated with success at a lower significance (0.05 < P < 0.1); however, for Ranked Outcomes higher levels of external funding appear to have had a negative effect. The number of weeks spent by the UK PL and the number of conservation actions undertaken had a positive influence on both Darwin Outputs and Impact Assessment, respectively. For Darwin Outputs, where external funding was < £24 999, the number of weeks spent in-country had a positive influence on success. However, this relationship did not hold at higher levels of funding. Educational variables were significant predictors for all three indices, but the effect varied by index. Quantity of education, but not educational type, was significant for Darwin Outputs and Impact Assessment, while type was important for Ranked Outcomes (through an interaction with external funding). For Impact Assessment, the quantity of education provided interacted with the Human Development Index (HDI) of the host country. In countries with low (< 0.600) and high (0.800+) HDIs, quantity of education was positively related to project success, while for mid-development countries there was no effect.
Table 3. Minimum adequate models for indices of conservation successe
95% confidence interval
aModel for Darwin Outputs fitted with a generalized linear model with quasipoisson errors.
bModel for Impact Assessment fitted with a linear mixed effects (LMEs) model with Gaussian errors. The random effect of project type explains 10.17% of variation.
cModel for Ranked Outcomes fitted with a LMEs model with Gaussian errors. The random effect of date explains 5.94% of the variance.
dThe type of education provided is a nominal factor. All other factors are continuous.
One of the most important properties of a reliable and consistent index of conservation success is that a project's rank does not vary wildly between assessors. The Ranked Outcomes index was much more likely to consistently produce a similar rank for a given project than the Impact Assessment index, and is perhaps surprising given the Ranked Outcome's reliance on subjective judgments about textual material, compared with the Impact Assessment's simpler and more mechanistic method. Potentially, the Impact Assessment is too broad-brush to capture the nuanced differences that are necessary for consistent ranking. Both the Impact Assessment and Ranked Outcomes methods illustrated systematic biases between assessors: for Impact Assessment, less experienced assessors marked more harshly, whereas for Ranked Outcomes, a specialist in a related field marked differently to conservation specialists. These observed differences were drawn from a small sample size and it is difficult to draw a strong conclusion; however, as conservationists need to be accountable for their investments, these potential differences in opinion as to what constitutes conservation success should be born in mind.
The models highlighted similar variables contributing to success, regardless of whether the index was output or outcome based. The consistencies between the indices indicate the strong contribution made by certain factors, such as funding levels, to conservation success. However, the differences between indices highlight the more nuanced aspects of success that they capture: Darwin Outputs is strongly affected by PL effort, as proxied by the number of weeks spent in-country and amount of education provided. The relationship between education and HDI in affecting success as measured by the Impact Assessment requires further exploration which may involve the collection of data regarding other confounding factors that influence success in mid-development countries. The Ranked Outcomes index, being more subjectively based, was less well predicted by the explanatory variables, but highlighted the importance of employing a range of educational types; this potentially reflects the need for projects to engage a range of different stakeholders.
Subjective indices such as Ranked Outcomes may be difficult to develop, as a rank for each outcome must be agreed. However, once developed, our results suggest that the Ranked Outcomes index performs relatively reliably, with comparatively low levels of disagreement between assessors. The Impact Assessment index retains a degree of subjectivity each time it is applied, and the results bear this out, with more disagreement between assessors. Utilizing a suite of indices rather than a single index may ensure that nuances between different project types are properly captured, allowing for a more robust comparative analysis of project success. The Ranked Outcomes method, although promising, is still only at the pilot stage, consequently further exploration of how different assessors evaluate success, and how this affects the index, is required. The outcomes used in the Ranked Outcomes ranking exercise were drawn exclusively from DI projects and this list could be expanded by drawing on other projects not funded by the DI and/or through running an expert workshop. Likewise, the ranking and validation exercises were only performed by a small group, thus scaling up and including a range of professionals from different fields, a list of outcomes and associated ranks could be developed that could then be applied independently to a number of different organizations and projects. This list of outcomes may also be useful as a means to better define project objectives and to collate more informative data.
Despite the success indices and explanatory variables encompassing the full range of project activities, the level of education in a project, both in terms of quantity and quality, was a particularly strong predictor of its success. Education is one of a number of tools that contribute to the overall outcome of a conservation project. However, our analysis suggests that it may be a particularly potent tool in the conservationist's toolbox. An unexpected result was that flagship species was not a significant variable predicting the success of a project. However, this may be because the flagships had not been correctly identified. A recent paper by Verissimo, MacMillan & Smith (2011) proposes that conservationists should specify the purpose of a campaign before working with the potential target audience to identify the most suitable species. This would be interesting line of further enquiry.
Carrying out a comparative evaluation of this scale is fraught with complexities and requires many assumptions. The evaluation was based on DI final reports, which give a static insight; further data are required to assess conservation success over time. Not all PL reports were completed to the same high standard of rigour and depth. However, by utilizing self-reports, a huge range of conservation projects can be evaluated and the DI currently provides one of the best available opportunities for studying the problem of how to evaluate conservation success at the global scale. Independent evaluation, utilizing established and novel indices of success, may have potential to provide relatively objective and consistent interpretation of the PLs' self-evaluations. It is important to remember Goodhart's Law throughout, which states that once an indicator is made into a policy target it will lose the information content that qualifies it to play its role as an indicator. The use of targets in conservation policy and the associated development of indicators or indices should therefore be undertaken with caution to ensure that the information they provide is objective and reliable (Newton, 2011).
Throughout this study, we ran into a number of obstacles related to the reporting of the DI, and therefore we propose a number of suggestions that we feel would aid with the future evaluation of the programme. A reporting framework that included the variables used in the analysis would allow for rapid monitoring of the success of the initiative at regular intervals. This would be further improved by the addition of other variables that we were not able to test for given data limitations, including the habitat, religion and development level of the focal region (in contrast to national level), the background level of awareness or knowledge of conservation issues, and the reporting of actions and threats in line with IUCN guidelines (IUCN-CMP, 2006a,b). Coupled with this is the need for more rigorous and structured reporting. At the end of each report, PLs are asked whether or not they feel the project has been a success. On its own, this question does not provide much meaningful information; however, if this was expanded to include how PLs define success, it would provide more information on how PLs define outcomes rather than outputs and aid in the further development of the Ranked Outcomes index. Finally, PLs need to be provided with a clear definition and understanding of the difference between outputs and outcomes, a confusion perhaps resulting from the Darwin outputs reporting framework which may encourage the belief that achievement of outputs is equivalent to conservation success. Implementation of these suggestions would also allow for the improvement in the reliability and strength of the Darwin Outputs as a performance index. Although these suggestions have been provided with the DI in mind, the DI is not alone in its aim to ensure that its investments are both effective and long lasting, and therefore these proposals are relevant to other organizations and projects with similar remits to the DI.
We suggest, as do Mace et al. (2007), Salafsky & Margoluis (1999) and Brooks et al. (2006), that it is both vital and possible to develop methods for evaluating conservation success using a range of criteria. Our Ranked Outcomes index was surprisingly robust and correlated well with the more quantitative output-based index, suggesting it is a worthy candidate for further investigation. This study suggests that it is possible for conservation funders to develop indices that are broadly useful in evaluating the relative success of projects within their portfolios. Although detailed evaluations on a project-by-project basis are required in order to capture the nuances of influences on project success, there is a place for these broad indices that, with further validation and development, could become useful additions to the conservation evaluation toolbox.
This study was financially supported by ESRC and made possible by the assistance of the UK Government's Darwin Initiative and the Edinburgh Centre for Tropical Forests (ECTF). E.J.M-G. acknowledges the support of a Royal Society Wolfson Research Merit award. We are indebted to the 2008–2009 Imperial College MSc in Conservation Science group and to Dr J. Knight, Dr A. Kuhl, Ms A. Wallace, Dr A. Keane, Dr M. Sommerville and Dr G. Wallace. We gratefully acknowledge the advice of Dr C. Clubbe, Ms J. Willison, Prof. M. Bruford, Dr D. Minter, Dr J. Mair, Dr N. Maxted, Dr S. Tilling, Mr B. Press, Dr P. Donald and Mr B. Cooper.