Improving scientific rigour in conservation evaluations and a plea deal for transparency on potential biases

The delivery of rigorous and unbiased evidence on the effects of interventions lay at the heart of the scientific method. Here we examine scientific papers evaluating agri‐environment schemes, the principal instrument to mitigate farmland biodiversity declines worldwide. Despite previous warnings about rudimentary study designs in this field, we found that the majority of studies published between 2008 and 2017 still lack robust study designs to strictly evaluate intervention effects. Potential sources of bias that arise from the correlative nature are rarely mentioned, and results are still promoted by using a causal language. This lack of robust study designs likely results from poor integration of research and policy, while the erroneous use of causal language and an unwillingness to discuss bias may stem from publication pressures. We conclude that scientific reporting and discussion of study limitations in intervention research must improve and propose some practices toward this goal.

. Naturally, the value of synthesis research relies on the quality of the underlying evidence. In conservation research, the scarcity of experimental and longitudinal studies (Ferraro, 2009;Ferraro &Pattanayak, 2006) translates into correlative and bias-prone evidence, which is then being fed into systematic reviews and syntheses (Haddaway & Bilotta, 2016). With this in mind, it is important that systematic reviews provide a critical appraisal of internal (bias susceptibility) and external (study relevance) validity of included studies (Collaboration for Environmental Evidence, 2013). However, failing to report limitations complicates such assessments.
Concerns about the lack of disclosure of bias and other limitations in original studies have been expressed previously in the fields of epidemiology and public health sciences (where evidence-synthesis methods originated), with calls for transparent and systematic reporting of study limitations (Puhan etal., 2012;ter Riet etal., 2013). The use of observational methodologies also constrains causal inferences, but misuse of causal claims is still common across disciplines (Cofield, Corona, & Allison, 2010;Robinson, Levin, Thomas, Pituch, & Vaughn, 2007). In this paper, we use the example of agri-environmental schemes (AES) to demonstrate that these problems are also widespread in environmental sciences. AES are the primary policy instruments used to safeguard biodiversity and ecosystem services in agricultural landscapes worldwide, including North America (Stubbs, 2013), Australia (Burns, Zammit, Attwood, & Lindenmayer, 2016), Africa (Kehinde & Samways, 2014), and Asia (Nomura etal., 2013). In Europe, more than €20 billion was spent on such schemes between 2007 and 2013 (Science for Environment Policy, 2017). Considering the importance of the matter, and the high costs involved, well-designed evaluations are central to understand the mechanisms and impacts of different conservation interventions under diverse agricultural contexts.
What are the caveats that we, as researchers in environmental sciences, must acknowledge and discuss? First, nonrandom patterns of implementation of conservation programs preclude effective evaluation of their success. This situation arises from large-scale conservation programs typically being implemented before dedicated evaluations are outlined. While seldom considered, this is critical when evaluating interventions as it precludes the use of randomized experimental designs and sampling before and after an intervention. Use of designs that allow stronger causal inference, including randomized controlled trials or observational before-aftercontrol-impact (BACI) designs (Figure 1), is therefore often not possible. Instead, researchers are constrained to adopt weaker observational designs (Christie et al., 2019). These study designs, which include control-impact (CI) studies, are highly susceptible to bias from the selection of intervention areas, where selection probability correlates with conditions that themselves affect biodiversity baselines and responses (Ferraro, 2009;Ferraro &Pattanayak, 2006). For example, a conservation action is more likely to be implemented at a location where it is expected to work or where original biodiversity is high. An example of this is the targeting of biodiversity-rich areas for protection and management in conservation planning (Brooks et al., 2006;Eken et al., 2004;Groves et al., 2002;Myers, Mittermeier, Mittermeier, da Fonseca, & Kent, 2000). In agricultural and forested areas, where participation in environmental schemes is often encouraged by financial compensation, this effect may be less obvious. The landowners or managers that are more likely to participate in such incentive schemes may differ from nonparticipants across key variables that, in themselves, are important drivers of biodiversity patterns, such as management intensity, soil fertility, landscape complexity, and microclimate (Gabriel et al., 2009). Even when methods are used to adjust for any such known differences across sites, important and unknown confounding effects may still be left unaccounted for (Little & Rubin, 2000). These features of conservation programs mean that impact assessments using observational methods are at best uncertain, at worst apparently flawed, especially when there are no data recorded before the intervention occurred.
The second caveat is the potential misuse of causal language in observational studies. Observational CI studies produce potentially biased data in terms of what is driving observed effects, and where the initial selection of "impact" sites is a central problem for making causal inferences (Elwert & Winship, 2014). This begs the question: Is it right to infer causal effects of interventions, whether in primary studies or in reviews when the underlying data typically is of an observational and bias-prone nature? As mentioned in any book on study design and scientific methods, observational study designs are generally restricted in terms of their capacity for causal inference (Underwood, 1997). The main problem of implying causation from correlative observations is that it may divert attention from the real reasons for any observed effect, promoting false confidence in the drivers of the observed pattern.
More than a decade after the widespread implementation of AES across Europe, Kleijn and Sutherland highlighted the need for improved study design in conservation evaluations in their seminal review on the effectiveness of AES for the conservation of biodiversity published in 2003 (Kleijn & Sutherland, 2003). In a comprehensive search of the scientific literature, they found that inadequate research designs prevented a reliable assessment of measures that had been implemented. Since then, the number of scientific evaluations of AES has grown considerably (see Ansell, Freundenberger, Munro, & Gibbons, 2016) and includes several reviews and metaanalyses. Given the vast extent of these policy instrumentsin terms of geographic spread, financial investment, and public interest-quite some trust is placed on how we scientists Impacted site

Control site
Random placement preferred Before-After, BA Compares one or several treated sites with sampling before and after intervention. No spatial control.

Environmental impact assessment designs Spatial/temporal replication Intervention
Before-After-Control-Impact, BACI Compares impacted and control sites with sampling both before and after implementation. No random assignment of sites to treatments.
Experimental Control-Impact, Exp-CI Compares impacted and control sites in a randomized experimental design with sampling only after implementation.
Observational Control-Impact, Obs-CI Compares impacted and control sites with sampling after implementation of interventions. No random assignment of sites to treatments After Before F I G U R E 1 Study designs used to valuate effects of conservation interventions and ecosystem services evaluate these interventions. Allowing a grace period of 5 years for new studies to be carried out since their publication, we examined scientific evaluations of the effects of AES on biodiversity published over the following 10 years 2008-2017 to investigate (i) if more recent evaluations have improved in terms of study design and the extent to which potential limitations associated with selection bias are acknowledged and (ii) the prevalence of causal statements, particularly in studies with observational data. As the benefits of organic farming are regularly debated in scholarly journals (Balmford et al., 2019;Eyhorn et al., 2019) and in news media (Reganold, 2016;Savage, 2015), we were specifically interested in this policy option and therefore chose to separate studies into evaluations of organic farming and other AES, respectively. Such interventions are wide-ranging, but generally include support for extensive farming practices such as low-intensity grazing and management of landscape features of high natural or historical value. While we focus on AES, these concerns are common to environmental policies and their evaluation in other humanimpacted environments, including forests (França et al., 2016;Wikberg et al., 2009) and marine systems (de Loma et al., 2008;Osenberg, Shima, Miller, & Stier, 2011).

METHODS
We searched for original research papers published from 2008 to 2017 in peer-reviewed scientific journals using a predefined search and screening protocol (see the Supporting Information for details). From the 215 resulting studies, we extracted information on (1) intervention type (organic farming or other AES), (2) study design (observational control-impact [obs-CI], before-after [BA], experimental, and randomized control-impact [exp-CI], BACI; see Figure 1), (3) acknowledgement of, and accounting for the potential for baseline biases in biodiversity between impact and control sites (paired design, use of covariates, or other types of reducing baseline biases, Supplementary Appendix), and (4) causal terminology used by authors to describe results. Although the term BACI is reserved for observational studies (Underwood, 1992), in this category we included also two experimental studies that used before-after data to highlight the limited occurrence of collecting data before interventions. We also searched literature syntheses published during the same time period and we collected similar data as for the original studies (n = 22 reviews). For details about data coding see the Supporting Information. Concerning "causal statement coding" we searched for sentences containing definitive causal language or hedged versions (e.g., ("can", "may") in the title, abstract, and discussion sections. Similarly we searched the abstract, methods, and discussion sections to determine the rate that studies reported on study limitations relating to study design and implications for internal validity (for details, see the Supporting Information).
The full set of coded papers and the codes is available online (Supplementary Appendix).

Percentage of studies
Percentage of studies F I G U R E 2 Scientific reporting of results and bias-related limitations in (a) primary evaluations and (b) reviews of the value of AES, including organic farming, for biodiversity. In observational control-impact studies (Obs-CI), incorrect causal inference was prevailing and study limitations seldom discussed. (c) Studies were distributed globally, but with a concentration of studies to Europe, North America, and Asia. Study designs also included before-after (BA), experimental control-impact (Exp-CI), and before-after-control-impact (BACI) design

RESULTS
Of the 215 reviewed studies, 123 evaluated the biodiversity effects of organic farming, while 92 described the effects of other AES measures. A majority (74%) of the evaluations used observational control-impact designs (80% and 67% of organic farming and AES studies, respectively), while only 19% used an experimental control-impact design and 3% a BACI design (2% on observational data, 1% on experimental before-after data) (Figure 2a; Table S1 in the Supporting Information). Of the observational CI studies, only 14% explicitly mentioned the risk of unaccounted initial bias in biodiversity between control and impact sites either in the abstract, methods, or discussion sections (6% among organic farming and 27% among other AES studies; Table   S2 in the Supporting Information). On the other hand, many observational CI studies did at least use a paired design (48% of the Obs-CI studies; cf. 16% in Kleijn & Sutherland 2003), and/or included covariates in their statistical models (68%) to account for possible effects of landscape and other environmental variables on local biodiversity (Table S3 in the Supporting Information). Three additional Obs-CI studies mentioned that the selection of sites was made to keep environmental variables a similar as possible between control and impact plots. This gives a grand total of 84% of the studies (i.e., combining all possible bias reduction strategies) that potentially reduced or accounted for biases in initial conditions between control and impact sites. Still, even when bias reduction methods are used, baseline differences may still exist, but this was only acknowledged in 16% of the studies.
Importantly, despite the correlative nature of the data in the Obs-CI studies, definitive (i.e., without hedging) causal wording was common (66%). Hedged causal statements, using words such as "may," "appears to," and "indicates" to soften causal terminology, was used in another 16% of the studies (Figure 2a; Table S4 in the Supporting Information). Here, the use of definitive causal wording was highest for studies having both a paired design and model covariates (77% out of 53 studies), and lowest for those studies not accounting for any possible baseline bias (52% out of 25 studies). Last, studies covered all continents (except Antarctica), but there was a dominance of European studies (Figure 2c). Short study lengths were common (64% of the studies were only 1-year in duration; Figure S1 in the Supporting Information).
Limitations in study design and the dominance of causal language in AES studies also spilled over into reviews and meta-analyses. Only three of 22 reviews (14%; Figure 2b and Table S5 in the Supporting Information) published between 2008 and 2017 mentioned selection bias as a potential source of uncertainty in the interpretation of effects. As these reviews largely cite the same publications as here or in Kleijn and Sutherland (2003), we know that they were generally dominated by observational studies. Still, causal language was highly prevalent also in reviews when discussing any general biodiversity effects of organic farming and AES (65%, 82% including hedged statements; Figure 2b). While some of the reviews mentioned the utility of paired designs when contrasting impacted to control sites or including covariates in analyses, these approaches were generally not discussed in relation to the risk of selection bias but were mentioned in relation to the investigation of landscape dependency of effects.

DISCUSSION
Using the example of AES, our study clearly shows that impact evaluations mostly use bias-prone correlative study designs, while simultaneously failing to fully acknowledge this potential source of bias and erroneously using causal language to convey study findings. It is therefore clear that problems still remain in terms of study design, and that calls from the scientific community for the integration of impact evaluation into environmental policies have not materialised (see, e.g., Baylis et al., 2015;Ferraro, 2009;Fisher et al., 2013). A major obstacle for the development of robust evaluation studies is the lack of researcher influence in the design and implementation stages of conservation interventions (Margoluis, Stem, Salafsky, & Brown, 2009). We are aware that the execution of randomly distributed treatment and control sites is difficult considering logistic constraints and the limited funds available for conservation. It may also be untenable, as it would reduce the delivery of direct common goods, as funding would be needed to pay for randomly assigned controls that deliver no clear benefit. Whether it is more costly in the long-term to fund large-scale experiments evaluating the effectiveness of a full range of AES under different contexts that may have few direct benefits for biodiversity, or on the other hand, to implement poorly evaluated and thus possibly ineffective interventions is, however, debatable.
What are the ways forward to circumvent or solve these problems and deliver scientifically sound impact evaluations? Recent initiatives of collaborative networks including policymakers, farmers, and researchers (Berthet et al., 2018) could open up for an integration of evaluation design in the implementation process. Although the problems of self-selection (vs. randomized selection) may remain, such studies at least can be designed to collect before-after data. Another route to improve evaluation designs where an experimental approach is not feasible is to combine before-after data on impact sites with data on background trends collected from national monitoring schemes and citizen science data (i.e., a BACI design; Underwood, 1992). Including original differences in biodiversity between control and impact sites can then be used to detect and categorize the effects of an intervention even when it is hidden by a general negative trend in focal species at regional scales (i.e., at scales larger than covered by the study; Bull, Gordon, Law, Suttle, & Milner-Gulland, 2014), or by original differences in biodiversity (Chevalier, Russell, & Knape, 2019). In Box 1, we outline three scenarios of improving evaluations of conservation actions in the future.
Short of adopting these or other more-or-less causally valid study designs the scientific community, as well as other users of conservation research, would undoubtedly benefit from an open discussion of the limitations to current evaluation methodologies. Worryingly, our findings suggest that authors are generally either unaware of the limitations related to observational approaches, or that they are unwilling to discuss them. Although the use of pair-matching methods or using covariates for reducing bias could in part explain why the explicit acknowledgment of selection bias is poor, it does not support the erroneous use of casual language. It has been suggested that competition among researchers and journals for high impact publications may foster a culture to neglect inherent and fundamental flaws related to study design, or to falsely make causal claims, in order to increase the seeming significance of research findings (Cofield, Corona, & Allison, 2010;Lipton & Ødegaard, 2005;Puhan et al., 2012;Robinson et al., 2007). This is something that many of us have, at one time or another, probably been guilty of. A culture to let study design limitations go by unremarked may also be fostered at the interface between applied sciences and policy when policymakers provide funding and expect clear answers to research questions. Similarly, editors of applied journals may suggest authors to provide clear directives to practitioners (Robinson et al., 2007; but see Cofield et al.,

Box 1. IMPROVING EVALUATIONS OF AES CONSERVATION ACTIONS
We use the case of organic farming to envision a way toward better evaluations of AES effects on biodiversity. We outline three potential scenarios, starting with the most robust. Implementation of organic farming is usually administered at the national level, with governmental funding supporting the conversion from conventional to organic farming. When farmers apply for financial help to convert, we suggest this should be linked to a governmentally funded before-after (BA) inventory of biodiversity (or other target of interest) at converting farms, and, preferably, at a nearby and otherwise similar conventional farm. Selection of farms for the BA-evaluation should be made in close cooperation between the responsible authorities and researchers. All scenarios require tight links between policy makers and practitioners, researchers, and national environmental protection agencies for implementation of inventories. In scenario 1: BA evaluation among farms that apply to convert to organic farming. Random selection of some farms as "organic" and others as "control." The "organic" farms proceed with the conversion process, while the "control" farms stay conventional for a limited period. Baseline biodiversity data will be gathered before the conversion process, and farms will be resurveyed after a number of years. After the second round of surveys, the control farms can proceed with conversion to organic farming. Scenario 1 ensures an experimental design that minimizes potential biases of self-selection. All applicant farms get similar subsidies for their farm, that is those decided to initially remain conventional will get reimbursed for their delay to convert to organic farming. In scenario 2: BA evaluation among farms that apply to convert to organic farming and at selected existing conventional farms. All selected farms will be subjected to BA inventories at the same time points. This design does not preclude possible biases due to self-selection of organic farming practices, but potential original differences between organic and conventional farms can be handled within a BACI framework (Underwood, 1992) to evaluate the effect at impact sites (see Chevalier et al., 2019). In scenario 3: BA evaluation among farms that apply to convert to organic farming, no controls. Instead of controls, national monitoring data (standardized inventories at a landscape scale) or opportunistic citizen science data (should such data exist at these localities) can be used as background time series. Although background data and BA-inventory data may be collected at different spatial scales, this approach can still be useful to contrast changes at organic farms against large-scale population changes of species at the landscape level. 2010, who found no link between funding source and causal language). At this point, we want to encourage the multiple actors involved in conservation biology and similar disciplines working with impact evaluations of environmental interventions to improve the scientific rigour with which studies are reported and discussed. While it may seem an intimidating challenge to get authors to openly discuss limitations to their studies, the recognition and discussion of potentially important limitations by authors represent a crucial part of the scientific discourse and will benefit the scientific community and other users of the evidence. Here, a great deal of responsibility lies with the editors of scientific journals to make certain that peer-reviewers also review papers in terms of their internal validity. As an example, research articles in social sciences frequently include a dedicated, and mandatory, limitations section as part of the general discussion. As suggested by Puhan et al. (2012) in the field of biomedicine, we highlight discussing limitations of impact evaluations more transparently, including different sources of bias and the type of information that would be important to provide for reasons of the scientific method. Further, to increase the legitimacy and quality of systematic reviews, environmental systematic reviews should pay more attention to the internal validity of evidence used, especially relating to unaccounted selection bias. This is something that the research community must do together, in collaboration and with the support of funding agencies, policymakers, and the editors of scientific journals.

AUTHOR CONTRIBUTIONS
TP suggested the subject area, and TP, JJ, and MH designed the data collection protocol and cross-validated the data collected. JJ compiled and summarized all data. JJ, TP, and MH wrote the first drafts of the manuscript. All authors read their share of original papers to be included in the study, discussed the design of the study, and compiled the data to be included. The subultimate version of the manuscript was written by JJ and TP while all other authors commented on that version, and JJ and TP finalized the manuscript. The revised manuscript was written largely by TP and JJ with important contributions by JKn, DA, and AA.