Within study comparisons and risk of bias in international development: Systematic review and critical appraisal

Executive Summary Background Many systematic reviews incorporate nonrandomised studies of effects, sometimes called quasi‐experiments or natural experiments. However, the extent to which nonrandomised studies produce unbiased effect estimates is unclear in expectation or in practice. The usual way that systematic reviews quantify bias is through “risk of bias assessment” and indirect comparison of findings across studies using meta‐analysis. A more direct, practical way to quantify the bias in nonrandomised studies is through “internal replication research”, which compares the findings from nonrandomised studies with estimates from a benchmark randomised controlled trial conducted in the same population. Despite the existence of many risks of bias tools, none are conceptualised to assess comprehensively nonrandomised approaches with selection on unobservables, such as regression discontinuity designs (RDDs). The few that are conceptualised with these studies in mind do not draw on the extensive literature on internal replications (within‐study comparisons) of randomised trials. Objectives Our research objectives were as follows: Objective 1: to undertake a systematic review of nonrandomised internal study replications of international development interventions. Objective 2: to develop a risk of bias tool for RDDs, an increasingly common method used in social and economic programme evaluation. Methods We used the following methods to achieve our objectives. Objective 1: we searched systematically for nonrandomised internal study replications of benchmark randomised experiments of social and economic interventions in low‐ and middle‐income countries (L&MICs). We assessed the risk of bias in benchmark randomised experiments and synthesised evidence on the relative bias effect sizes produced by benchmark and nonrandomised comparison arms. Objective 2: We used document review and expert consultation to develop further a risk of bias tool for quasi‐experimental studies of interventions (ROBINS‐I) for RDDs. Results Objective 1: we located 10 nonrandomised internal study replications of randomised trials in L&MICs, six of which are of RDDs and the remaining use a combination of statistical matching and regression techniques. We found that benchmark experiments used in internal replications in international development are in the main well‐conducted but have “some concerns” about threats to validity, usually arising due to the methods of outcomes data collection. Most internal replication studies report on a range of different specifications for both the benchmark estimate and the nonrandomised replication estimate. We extracted and standardised 604 bias coefficient effect sizes from these studies, and present average results narratively. Objective 2: RDDs are characterised by prospective assignment of participants based on a threshold variable. Our review of the literature indicated there are two main types of RDD. The most common type of RDD is designed retrospectively in which the researcher identifies post‐hoc the relationship between outcomes and a threshold variable which determines assignment to intervention at pretest. These designs usually draw on routine data collection such as administrative records or household surveys. The other, less common, type is a prospective design where the researcher is also involved in allocating participants to treatment groups from the outset. We developed a risk of bias tool for RDDs. Conclusions Internal study replications provide the grounds on which bias assessment tools can be evidenced. We conclude that existing risk of bias tools needs to be further developed for use by Campbell collaboration authors, and there is a wide range of risk of bias tools and internal study replications to draw on in better designing these tools. We have suggested the development of a promising approach for RDD. Further work is needed on common methodologies in programme evaluation, for example on statistical matching approaches. We also highlight that broader efforts to identify all existing internal replication studies should consider more specialised systematic search strategies within particular literatures; so as to overcome a lack of systematic indexing of this evidence.

Results: Objective 1: we located 10 nonrandomised internal study replications of randomised trials in L&MICs, six of which are of RDDs and the remaining use a combination of statistical matching and regression techniques. We found that benchmark experiments used in internal replications in international development are in the main well-conducted but have "some concerns" about threats to validity, usually arising due to the methods of outcomes data collection. Most internal replication studies report on a range of different specifications for both the benchmark estimate and the nonrandomised replication estimate. We extracted and standardised 604 bias coefficient effect sizes from these studies, and present average results narratively.
Objective 2: RDDs are characterised by prospective assignment of participants based on a threshold variable. Our review of the literature indicated there are two main types of RDD. The most common type of RDD is designed retrospectively in which the researcher identifies post-hoc the relationship between outcomes and a threshold variable which determines assignment to intervention at pretest. These designs usually draw on routine data collection such as administrative records or household surveys. The other, less common, type is a prospective design where the researcher is also involved in allocating participants to treatment groups from the outset. We developed a risk of bias tool for RDDs.
Conclusions: Internal study replications provide the grounds on which bias assessment tools can be evidenced. We conclude that existing risk of bias tools needs to be further developed for use by Campbell collaboration authors, and there is a wide range of risk of bias tools and internal study replications to draw on in better designing these tools. We have suggested the development of a promising approach for RDD.
Further work is needed on common methodologies in programme evaluation, for example on statistical matching approaches. We also highlight that broader efforts to identify all existing internal replication studies should consider more specialised systematic search strategies within particular literatures; so as to overcome a lack of systematic indexing of this evidence.

| INTRODUCTION
Many systematic reviews include studies that use nonrandomised causal inference, hereafter called nonrandomised studies, and sometimes called quasi-experiments (QEs; e.g., Shadish, Cook, & Campbell, 2002) or natural experiments (Dunning, 2012). 1 For example, Konnerup and Kongsted (2012) found that half of the systematic reviews published in the Campbell Library up to 2012 included nonrandomised studies.
The inclusion of nonrandomised studies in Campbell reviews is increasing: 81% of reviews published between 2012 and 2018 included such studies. The inclusion of nonrandomised studies in reviews is justified by the lack of randomised study evidence for specific interventions, for example where randomisation is not considered feasible (Wilson, Gill, Olaghere, & McClure, 2016), or ethical (e.g., mortality outcomes), or to improve external validity such as in measuring long-term effects (Welch et al., 2016). 2 Occasionally it is stated that these studies might produce unbiased estimates (e.g., De La Rue, Polanin, Espelage, & Piggot, 2014). 3 However, it is not 1 clear whether nonrandomised studies typically produce comparable treatment effect estimates to unbiased estimates produced by wellconducted randomised controlled trials (RCTs), either in expectation or in practice.
There are two main types of study for quantitative causal inference (Imbens & Wooldridge, 2009): (a) Those which account for unobservable confounding by design, either through knowledge about the method of allocation or in the methods of analysis used, referred to as "selection on unobservables". These include RCTs and nonrandomised approaches such as difference studies (e.g., the difference in differences and fixed effects analysis), instrumental variables (IVs) estimation, interrupted time series (ITS) and regression discontinuity designs (RDDs).
(b) Those with selection on observables only, including nonrandomised studies that control directly for confounding in adjusted analysis (e.g., statistical matching, analysis of covariance (ANCOVA), multivariate regression).
Nonrandomised studies modelling selection on unobservables are considered more credible in theory (Dunning, 2012;Imbens & Wooldridge, 2009;Shadish et al., 2002). But many design and analysis factors determine the extent to which nonrandomised studies (with selection on unobservables or observables) are biased in practice, and by how much.
There are two main ways to empirically measure the magnitude of bias in nonrandomised studies (Bloom et al., 2002). One is in "cross-study comparison" of groups of randomised and nonrandomised studies, usually done in systematic review and meta-analysis.
For example, evidence from meta-analyses of programmes in lowand middle-income countries (L&MICs) suggests nonrandomised studies with credible means of control for confounding (including difference in differences, IVs and statistical matching) can produce the same pooled effects as RCTs, although potentially with less precision (Waddington et al., 2017). Lipsey and Wilson (1993), in a meta-analysis of meta-analyses containing a very broad range of study designs, found the point estimates calculated from metaanalyses of nonrandomised trials were on average virtually identical to those from RCTs. However, there are doubts about the validity of cross-study comparisons in quantifying bias, even when these studies find zero differences in treatment effects across randomised and nonrandomised studies on average. They are usually based on indirect comparisons from different underlying populations, and it is argued that there is no theoretical reason why one should expect any differences to cancel out on average (Cook, Shadish, & Wong, 2008). 4 The second, and conceptually preferred, approach is the "internal replication study", which assesses the validity of nonrandomised comparison group designs, with reference to a "benchmark" study that is thought to be unbiased. The most rigorous designs use data from the same underlying treatment population, hence they are also referred to as "within-study comparisons" (Bloom et al., 2002;Glazerman, Levy, & Myers, 2003). These studies benchmark the effect sizes obtained using nonrandomised comparison group designs, to estimates from designs that are in expectation unbiased, usually RCTs. It is important that the treatment sample used in the benchmark and replication studies overlap, because of potential differences in treatment effect parameter-for example, the average treatment effect (ATE) causal estimand from an RCT versus the local average treatment effect (LATE) estimand from a RDD-over and above errors to due sampling or bias (Duvendack et al., 2012).
Evidence from internal replication studies suggests that nonrandomised studies in which the method of treatment assignment is known or credibly modelled at baseline, can produce very similar findings in direct comparisons with RCTs Hansen, Klejnstrup, & Andersen, 2013). However, when inappropriately designed or executed, they are likely to yield biased effect size estimates Glazerman et al., 2003;Pirog, Buffardi, Chrisinger, Singh, & Briney, 2009). The extent of bias is likely to depend on the design of the evaluation, how the evaluation design is implemented and the quality of analysis and reporting.
Work is, therefore, needed to quantify the biases arising in different nonrandomised studies and assess the extent that these relate to estimates of bias produced in critical appraisal. This includes validating risk of bias tools for studies included in systematic reviews.
We have attempted to address this research gap by systematically reviewing internal replication studies of benchmark randomised experiments in international development, and extending a risk of bias tool for regression discontinuity (RD), a popular nonrandomised study design used in international development programme evaluation and increasingly incorporated in systematic reviews of that evidence.
The remainder of the document is structured as follows. Section 2 presents the study objectives. Section 3 presents the results of the systematic review. Section 4 presents proposed approach to assessing risk of bias for RDDs. The final section presents implications for systematic review practice and research.

| RESEARCH OBJECTIVES AND APPROACH
Our objectives were to conduct a systematic review of internal replication studies in international development, and further develop and pilot a tool to assess risk of bias for RDDs, an increasingly popular method of causal inference in international development research.
b. Systematic electronic and hand-searches for internal replication studies in international development. 4 Where the nonrandomised study uses a credible design or method to control for confounding, thus yielding an unbiased estimate for the underlying sample, any difference in effect size with the randomised study will be due to sampling error, which would tend to cancel out on average in expectation in meta-analysis. VILLAR AND WADDINGTON | 3 of 35 c. Critical appraisal (risk of bias assessment) in benchmark trials.
d. Calculation of standardised bias estimates and narrative analysis of differences in effect sizes between the benchmark and nonrandomised QE study arms.
• Research objective 2: development of a risk of bias tool in nonrandomised studies of interventions (ROBINS-I) for assessing risk of bias in RDDs. This included: a. Review of methods used to assess bias in nonrandomised studies in Campbell systematic reviews.
b. Reviewing literature on RDD and developing the tool.

| SYSTEMATIC REVIEW OF INTERNAL REPLICATION STUDIES IN INTERNATIONAL DEVELOPMENT
Internal replication studies, also called "within study comparisons", are studies which compare a nonrandomised comparison group with an unbiased "causal benchmark" study. They have been conducted in the social sciences since the 1980s, following an internal replication of the randomised evaluation of the National Supported Work Demonstration programme in the United States (Lalonde, 1986). We aimed to identify the universe of internal replication within-study comparisons of social and economic programmes in the social sciences in L&MICs. In this section, we present a review of existing literature reviews, including categories of, and sources of bias in, internal replication studies, and results of systematic searches and data collection from internal replication studies in L&MICs. Table 1 presents a list of known existing reviews of internal replication studies. Some are of particular literatures, for example, studies of labour market programmes (Glazerman et al., 2003) and education (Wong et al., 2017). Others cover particular T A B L E 1 Existing reviews of within-study comparisons by publication date

Authors
Title Publisher Bloom, Michalopoulos, Hill and Lei (2002) Can nonexperimental comparison group methods match the findings from a random assignment evaluation of mandatory welfare-towork programs? MDRC Glazerman, Levy and Myers (2002) Nonexperimental replications of social experiments: A systematic review Mathematica Policy Research, Inc. Steiner and Wong (2016) Assessing correspondence between experimental and nonexperimental results in within-study-comparisons EdPolicyWorks Working Paper Wong and Steiner (2016) Designs of empirical evaluations of nonexperimental methods in field settings EdPolicyWorks Working Paper Jaciw (2016) Assessing the accuracy of generalized inferences from comparison group studies using a within-study comparison approach: The methodology Evaluation Review Wong et al. (2017) Empirical performance of covariates in education observational studies Methodological Studies Chaplin et al. (2018) The internal and external validity of the regression discontinuity design: A meta-analysis of 15 within-study-comparisons Policy Analysis and Management methodological designs, such as RDD (Chaplin et al., 2018; and propensity score matching (PSM; Shadish, 2013), or map internal replication designs . Only one known review is dedicated to evidence from social and economic development programmes in L&MICs (Hansen et al., 2013). Hansen et al. (2013) surveyed four studies, involving two clusterrandomised conditional cash transfer programmes in Mexico and Nicaragua 5 and an individually randomised lottery balloting permanent migration visas in Tonga. 6 One study in Mexico examined the correspondence of estimates from an RDD analysis with estimates from a cluster-randomised controlled trial (Buddelmeyer & Skoufias, 2004 Wong et al. (2017), who report a systematic search strategy, and meta-analyses by Glazerman et al. (2003) and Chaplin et al. (2018), although concerns regarding the completeness of their search strategies are noted by the authors themselves. Glazerman et al. (2003) indicate that electronic searches failed to comprehensively identify many known studies. This was due to the lack of a common language to define an internal replication study. Furthermore, it is not uncommon that such studies feature as undefined empirical demonstrations in new methods papers or as a secondary piece of analysis in a broader study. Similar problems were also noted by Chaplin et al. (2018), who state that despite having searched broadly "we cannot even be sure of having found all past relevant studies" (p. 424).
Nevertheless, as the first meta-analysis synthesising this body of evidence, Glazerman et al. (2003)  Non-systematic qualitative updates of this review by  and Pirog et al. (2009) later highlighted that with "careful execution" nonrandomised estimators can recreate randomised estimates, and nonrandomised estimates based on inappropriately designed estimation procedures were associated with larger bias coefficients .
Reviews by , Pirog et al. (2009) later expanded the scope of nonrandomised estimators examined, including ITS and RDD. Drawing from a limited base of evidence, they suggested that an ITS study could also create similar results to an RCT. Similarly, they concluded that studies using RDD provide estimates that are comparable to an RCT estimate when it is made for observations close to the discontinuity (or "cut-off") in the assignment (or "forcing") variable.
Building on these findings, Chaplin et al. (2018) further assessed the statistical correspondence of 15 internal replication studies with an RDD approach (including two studies based on data collected on programmes in L&MICs) using meta-analysis. They reported that the average of the difference between RCT and RDD estimates around the discontinuity is close to zero (approximately 0.1 standard deviations) and that the variability of results was also generally quite low.
However, they warned that researchers should not assume based on these findings that individual RDD estimates will necessarily be near zero. They suggested factors such as larger samples, using nonparametric tests and the choice of bandwidths may prove important in determining the degree of bias in an individual RDD estimate.
Further reviews have included mapping internal replication designs and describing different measures of bias (or correspondence; e.g., see Jaciw, 2016;. Wong and Steiner (2016) describe broad categories of internal replication studies, including independent, synthetic, simultaneous and multisite simultaneous designs (Table 2).
However, authors such as Smith and Todd (2005, p. 306) warn against "searching for 'the' nonexperimental estimator that will always solve the selection bias problem inherent in nonexperimental evaluations". Instead they argue research should seek to map and understand the contexts that may influence studies' degrees of bias.
For instance, Chaplin et al. (2018) consider that their review says little about instances when an experiment is logistically very difficult to implement, or noncompliance is likely to be large. Hansen et al. (2013) note the potential importance of the type of dependent variable examined in studies, suggesting simple variables (such as binary indicators of school attendance) may be easier to model relative to more complex outcome variables (such as consumption 5 The programmes included Mexico's PROGRESA and Nicaragua's Red de Proteccion Social. 6 The visas enabled Tongans permanent residency in New Zealand under the Pacific Access Category (PAC; New Zealand's immigration policy which allows an annual quota of Tongans to migrate). 7 Four studies addressed the same intervention-the US National Supported Work Demonstration. 8 In some instances, approximations calculated differences equivalent to approximately 10% of annual earnings for a typical population of disadvantaged United States workers. cases, they will rarely replicate similar estimates to RCTs. In Table 3 we summarise the list of factors which internal replication studies hypothesise may be associated with bias in nonrandomised studies.
Beyond bias being determined by a particular method, or magnified by a characteristic of a study, another potential source of discrepancy between randomised and non-randomised designs concerns whether they provide the same causal quantity. In other words, a factor explaining differences in findings across randomised and nonrandomised designs, over and above bias and sampling error, is that they may provide different causal estimands for different treatment populations. For example,

| Systematic review approach
We used the following approaches for study inclusion criteria, searching and data collection.
Study design: Glazerman et al. (2003, p. 65) define a replication study as follows: "researchers estimate a program's impact by using a randomised control group and then re-estimate the impact by using T A B L E 2 Definitions of within-study comparison designs Within-study comparison type Description 1. Independent design Also known as the "four-arm" design, participants are randomly assigned into benchmark and nonrandomised arms. Participants in the benchmark arm are randomly assigned again into treatment/control conditions. Participants in the nonrandomised arm self-select or are selected by a third party into a preferred treatment option.

Synthetic design
The researcher begins with data from an RCT and then constructs a nonrandomised study by simulating a selection process and removing information from the RCT treatment and/or control group to create nonequivalent groups.

Simultaneous design
Observations from an overall population select (or are selected) to participate in the benchmark study. In the benchmark study, participating observations are randomly assigned into treatment conditions, and the estimated treatment effect serves as the causal benchmark result for evaluating the nonrandomised approach. For the nonrandomised arm of the comparison, the researcher compares the RCT treatment units with comparisons from a sample of the population that did not participate in the RCT. Here we note a special type of this design is also referred to as a 'tie-breaker' design. Further described by Chaplin et al. (2018), the initial selection into the benchmark study is determined by an eligibility criteria. This then commonly enables researchers to form a comparison between the RCT regression discontinuity design estimates.

Multi-site-simultaneous design
Beginning with a multi-site RCT in which randomisation occurs within sites (sites are purposefully selected but randomisation to a treatment condition occurs within each site), the nonrandomised arm is then constructed by comparing average outcomes from an RCT treatment group in one site to an RCT control case from another site. Here, the composition of units will differ from site to site and not be random due to the nonrandom selection of sites.
T A B L E 3 Factors associated with bias in internal replication studies

Study Characteristic Description
Methods and procedures Most common is the use of internal replication studies to examine whether a method or a new technique can reduce the bias of nonrandomised studies (e.g., Hainmueller, 2012;Diamond and Sekhon, 2012). This also includes more nuanced assessments examining the effects of specific procedures. For example, examining bandwidth size or use of calipers in RDD (e.g., Green et al., 2009;Gleason et al., 2012;Wing and Cook, 2013;Moss et al., 2014) Preintervention outcome data Reasons for assuming the inclusion of preintervention outcomes in nonrandomised analysis can reduce bias in nonrandomised studies include that preintervention outcomes will likely be highly correlated with unobserved determinants of the outcome variable. Internal study replications have examined whether including preintervention outcome data in model specifications may help to reduce and minimise the confounding effects of at least some of the unobserved factors determining observations outcomes post-intervention (e.g., Lalonde, 1986;Smith and Todd 2005;Zhao, 2006;Steiner et al. 2010;Bifulco, 2012;Wong et al., 2017;Fortson et al., 2014;Calónico and Smith, 2017) Length of preintervention history In an extension to studies' attempts to understand the extent that the inclusion of preintervention outcome data may reduce bias (see above), some research explores the effects of having preintervention data from numerous data points (e.g., Michalopoulos et al., 2004;Hallberg et al., 2016) Types of outcomes Some research now exists comparing the efficacy of quasi-experimental methods to replicate RCT estimates when using different types of outcome variables. For example, Diaz and Handa (2006) and Handa and Maluccio (2010) contrast levels of bias following applications of these methods to a broad range of common development outcomes; including student enrolment, child labour, and food expenditure. Liebman et al. (2004) also compare findings using education, behavioural, and physical health outcomes Richness and types of variable data A number of internal study replications distinguish the effects that data availability may have on bias in nonrandomised studies. This includes comparing bias in parsimonious specifications of models (e.g., Lalonde, 1986;Smith and Todd, 2005;Gordon et al., 2016), as well as examining constructs of data types. For example, Cook et al. (2009), Cook and Steiner (2010) and Steiner et al. (2010) examine the effects of including or excluding variables related to areas such as psychological predisposition, demographics, topic preferences and so forth, and Hallberg et al.
(2013) examined applying student and school level characteristics to their specifications. Calónico and Smith (2017) and Anderson and Wolf (2017) also examine whether the inclusion of common demographic data in model specifications is important in reducing bias

Geographic markets and aggregate conditions
Characteristics of local environments, such as economies and labor markets, may be pivotal in explaining the differences in the development of particular areas and their constituents. Here studies have examined the effects of adjusting model specification for aggregate conditions (such as the local areas unemployment rate-see Hill et al., 2004). Alternatively, others have considered limiting the sample that the comparison group is drawn from to the local location (such as neighbouring counties-see Lee, 2006) Higher order terms and interacted variables Increasingly internal replication studies have begun to experiment using higher order terms (such as cubed or squared variables) and interaction terms in model specifications. This reflects that some bias may be derived from the model's mis-specification of the functional form (e.g., Dehejia and Wahba, 1999;Green et al., 2009;Fortson et al., 2014). Extending on this point, other research has sought to establish and test a means for handling specification uncertainty (e.g., Kitagawa and Muris, 2016) Use of no-shows as a comparison No-shows are individuals who enrol for a programme, are accepted onto the programme, but for some reason do not participate in it. Debate exists whether this group hosts attractive qualities useful for identifying valid comparisons. For example, they derive from the same geographic locations as programme participants and have undergone the same questioning process during data collection. However, their no-show status reflects something inherent that distinguishes the individual that may bias an estimator (e.g., Heckman et al., 1997) Origin of survey data Another factor that may influence bias includes the origin of the Data set from where the nonrandomised comparison group is derived (Glazerman et al., 2003). This could lead from differences in the way that a survey is administered, the quality of data collection, and the consistency of the measurement of variables (see Heckman et al., 1997Heckman et al., , 1998Agodini and Dynarski, 2004) (Continues) VILLAR AND WADDINGTON | 7 of 35 one or more nonrandomised comparison groups". We included studies that report treatment effects for a benchmark randomised causal study, alongside treatment effects for a nonrandomly assigned comparison replication. The replicated comparisons could be constructed using any quasi-experimental method (e.g., DIDs, IVs, statistical matching, RDD).
Population: populations in L&MICs were eligible; among general programme participants or lab studies conducted among students.
The causal benchmark and comparison study needed to derive from the same study population so as to minimise the possibility of a factor other than study design confounding estimates of bias.
Intervention and comparator: studies could be of any social or economic intervention and of any comparison condition (e.g., no intervention, wait-list, alternate intervention). Where relevant, the causal benchmark and comparison study also needed to derive from the same intervention or comparison condition, and time period.
Outcome: studies could be of any outcome variable, provided the outcome was the same for the benchmark control and replication comparison groups.
Studies or study arms were excluded that: • made between-study comparisons (with no overlap in treatment group samples for causal benchmark and comparison); • were based on clinical, biomedical or health care interventions or of populations in high-income country contexts (e.g., Fretheim et al., 2015;Fretheim, Soumerai, Zhang, Oxman, & Ross-Degnan, 2013); • did not use as a causal benchmark a study with randomised or quasi-randomised allocation, or did not use "real world" data collected from participants. Studies conducting analysis of an artificial or synthetic population (such as Schafer & Kang, 2008), were therefore excluded; • did not construct the nonrandomised comparison group using quasi-experimental methods or a natural experiment (such as a retrospective RDD) to account for confounding; a typical example of an excluded comparison would be the use of adjusted regression (ordinary least squares [OLS] or limited dependent variable analysis) applied to observational cross-section data 9 ; • used a nonrandomised technique to adjust for circumstances where the randomisation process was compromised (e.g., Borland, Tseng, & Wilkins, 2013) or for study attrition (e.g., Grasdal, 2001); and T A B L E 3 (Continued)

Study Characteristic Description
Differences in measurement One of the possible reasons why using survey data from different origins could cause bias in the estimates to increase is that the measurement of variables could be inconsistent across surveys. Handa and Maluccio (2010) provide an example, using adjusted measures of total expenditure as an outcome variable, to show that reported level of bias may be sensitive even to subtle differences in measurement. Fortson et al. (2014) further investigate the magnitude of bias reduction that could be achieved from using errors-in-variables models to correct for measurement errors Management of missing data A common empirical issue in the social sciences concerns the problem of missing data. For example, this could be where answers or responses to survey questions have not been recorded, provided by the participant, or could not be obtained for a given time period. The way that researchers manage this problem may be another factor that effects bias  Grouped versus individual data Authors, such as Fraker and Maynard (1987), have compared whether compiling nonrandomised estimates on grouped data (e.g., social security cells), as opposed to individual-level data, results in greater levels of bias Target populations A number of authors have investigated whether quasi-experimental estimates degree of bias varies across different target populations or subgroups; for example, youth vs. adult and male vs. female (see Heckman et al., 1997;Gritz and Johnson, 2001;Liebman et al. 2004).

Time
Another factor that may affect bias is the length of time after the start of the intervention that the dependent variable is being modelled (see Michalopoulos et al., 2004;Hämäläinen et al., 2008) Matching Sample Size Some analysis also considers the size of the nonrandomised sample required to reproduce RCT results. Lee (2006) tests the sensitivity of estimates of bias to the number of observations available for matching in the comparison group (or the ratio of treatment group observations to comparison group observations). Deke and Dragoset (2012) investigate the sample size required for a RDD to replicate the statistical precision of an RCT Types of discontinuities Specific to RDD studies, some evidence exists distinguishing bias observed from nonrandomised estimates following the application of different types of discontinuities. For example, Black et al. (2007) consider time, geographic and marginality scoring based discontinuities Abbreviations: RCT, randomised controlled trial; RDD, regression discontinuity design. 9 The rationale for including quasi-experimental approaches with selection on observables, but excluding regression with adjustment for observable confounding, is that the former methods at least balance covariates and therefore account for biases arising from comparing observationally dissimilar groups (Heckman, Ichimura, & Todd, 1997).

| Search methods
Existing reviews of internal replication studies, such as Glazerman et al. (2003) and Chaplin et al. (2018), note a number of issues related to study identification. In particular, these issues highlight a lack of common language used to systematically index evidence. In order to identify internal replication studies, in this review, we first use a combination of conventional methods; searching electronic academic databases and the bibliographies of identified studies and literature reviews. However, to further identify studies that may not be well indexed in this literature, we supplement this search process using modern citation tracking software to identify studies citing well known reviews of internal replication studies. We also search the repositories of known institutional providers of internal replication studies and 3ie's impact evaluation repository (a specialised database of impact evaluations in international development).
Electronic searches: with the assistance of an information specialist, we searched the Research Papers in Economics (RePEc) database via EBSCO using the following search string: (nonexperiment* OR non-experiment* OR "non experiment*" OR quasi-experiment* OR "Quasi experiment*" OR observational OR non-random* OR nonrandom* OR "non random*" OR within-study OR "within study" OR replicat* OR "propensity score" OR PSM or discontinuity OR RDD) AND ("experiment*" OR random*) Snowball searches: we applied forwards citation tracking and bibliographic back-referencing. We compiled a list of well-known reviews of internal replication studies (Table 1). We then used three electronic tracking systems (Google Scholar, Web of Science and Scopus) to identify and screen articles that cite these reviews (forward citation tracking). We hand searched the reference lists of all primary studies in order to further identify studies that had been cited in the existing literature (bibliographic back referencing).
Institutional website repository searches: we extended our search strategy using findings from a unique project extending nearly 5 years of systematic searching, screening and indexing of impact evaluation across the field of international development.
Further described by Cameron et al. (2016) and Sabet and Brown (2018), the 3ie Impact Evaluation Repository provides an index more than 4,000 impact evaluations populated through a project of systematic screening of more than 35 databases, search engines and websites. It also reports descriptive information on studies key characteristics, including study design, country of origin, sectoral focus and so forth. We use this database to identify evidence from studies in international development that are not yet recorded in the boarder internal replication literature. We screened all studies in the repository recorded as using both a randomised and nonrandomised design.
We also sought to identify literature by searching the web repository of a known producer of internal replications. Preliminary searches suggested that Mathematica Policy Research Inc. had published several internal replication studies. Therefore, we handsearched Mathematica's website using the search function to identify pages, documents and articles featuring the term "within-study".

| Data collection and analysis
We collected summary information from eligible studies on the populations, interventions, comparisons, outcomes and study designs.
All studies reported treatment effects for a causal benchmark study (a sample randomly assigned in an experimental or natural experimental context), and for a nonrandomly assigned comparison replication. The replicated comparisons could be constructed using any method. We also collected outcomes data effect sizes for benchmark and nonrandomised replication study design, treatment compliance (see coding tool in A). We tabulated this information in a database format using Microsoft Excel. Finally, we used Cochrane's revised tool for RCTs ROB2.0 (Higgins, Savović, Page, & Sterne, 2016) or cluster-RCTs (Eldridge et al., 2016) to assess biases in the benchmark RCTs. 10  Contacting authors of existing studies, and hand searches of our own personal repositories of known studies, identified 13 additional references. After removal of duplicates, a total of 3,904 records were included for screening at title and abstract.

| Search results
During screening at the title and abstract 3,328 records were excluded. The remaining 576 records were screened at full text. A further 443 records were then excluded during full-text screening, leaving 129 records which were assessed for eligibility. Of these 133 studies, 21 were removed due to being working papers of now published articles, and 102 internal replication studies were excluded due to the geographic location of the RCT not deriving from an L&MIC context. We eventually included 10 studies of social and economic programmes in L&MICs. 10 We considered possibilities of blinding coders by removing the numeric results and the descriptions of results (including relevant text from abstract and conclusion), as well as any identifying items such as author's names, study titles, year of study and details of publication. We were not able to undertake this blinding under the project due to the team's existing knowledge about the included studies. Another study by Cintina and Love (2014) also created nonrandomised treatment and control groups from an RCT by Banerjee et al. (2015), and as such, did not provide an estimate of effect of the same intervention using randomised and nonrandomised groups. Similarly, Glewwe et al. (2004) was also excluded because it formed a between-study comparison, examining differences in effects of similar but different interventions.

| Study descriptive information
Included studies are summarised in Table 4. Four of these studies featured in the previous review of internal replication studies in international development (Hansen et al., 2013). These are based on data from conditional cash transfer schemes, PROGRESA in Mexico (Buddelmeyer & Skoufias, 2004;Diaz & Handa, 2006) and Red de Proteccion Social in Nicaragua (Handa & Maluccio, 2010), and a randomised lottery balloting permanent migration visas in Tonga (McKenzie et al., 2010). 11 One study on Mexico's cash transfer programme examines the correspondence of estimates from an RDD analysis with estimates from an RCT (Buddelmeyer & Skoufias, 2004) and we were also able to locate an additional six replications of RCTs Oportunidades programme to construct "simultaneous design" internal replication study of an RDD. To provide some context, at its inception in 1997 when it was known as PROGRESA, the evaluated programme contained multiple components including conditional cash transfers, a nutrition programme and a free essential healthcare package. The distribution of the programme resources was determined by an eligibility criterion. First, a community was assessed as to whether it had sufficient healthcare facilities and schools to host the programme. Using a cluster-randomised phase-in design, eligible communities were then randomly assigned to begin the programme F I G U R E 1 Study search flow for nonrandomised internal replication studies 11 The visas enabled Tongans permanent residency in New Zealand under the PAC (New Zealand's immigration policy which allows an annual quota of Tongans to migrate). The study's results show a lack of statistical correspondence between the RCT and RDD estimators. While the randomised experiment indicated that the prevalence of contraceptive use significantly increased among eligible treatment communities compared with control communities, conversely the RDD analysis estimated a large significant negative effect comparing eligible and noneligible observations. For example, the results of the randomised experiment implied the programme caused a 5% increase, on average, in contraceptive use among eligible households (p < .1).
Meanwhile, the RDD estimates using the OLS model with the smallest window around the threshold (50 points) estimated an average decrease of approximately 22%.
The magnitude of the difference between the randomised estimate and the RDD estimate of the programme's effects decreased as the window around the threshold widened in the RDD analysis. However, the RDD estimate nevertheless remained negative, though not statistically significant, even in the specification with the largest window around the threshold (estimating a 9% decrease). The RDD 2SLS estimates provided qualitatively the same conclusions as the OLS estimates and quantitatively they were very similar in magnitude (with less than a 2% difference in estimated effects across specifications). Abbreviations: DID, difference-in-difference; GDD, geographical discontinuity design; RDD,regression discontinuity design.
12 Lamadrid-Figueroa et al. (2010) also provide randomised estimates of Oportunidades effects in a broader community sample (including noneligible households). This estimate was found to be positive but statistically insignificant. We focus on comparing the experimental effects estimated for eligible households. VILLAR AND WADDINGTON | 11 of 35 the programme on eligible households. 13 However, this study limits the sample to households with at least one woman reporting a birth between 1997 and 2000 and, rather than report RDD estimates using a 2SLS specification, it controls for the potential issues of endogeneity of household eligibility status close to the threshold using a DID model on a balanced panel (i.e., among a sample of women who had births in both the baseline and the follow-up periods). The results contrast findings from both cross-sectional and panel data.
Using windows of 20, 30, 50, 75, 100 and 120 points around the poverty threshold, Urquieta et al. (2009)  Here schools were randomly assigned to either Phase 1 (starting in 2008/09) or Phase 2 (starting in 2009/10). The 103 schools randomly assigned to Phase 1 (the treatment group) were further randomised to either a poverty-based scholarship system or a meritbased scholarship system. In the poverty-based system, the household characteristics of fourth-grade students (third at the time of the baseline assessment) were scored by a centrally contracted firm using a poverty index to ascertain the student's poverty status. Half of the students deemed most impoverished in each school were eligible to receive the scholarship. For schools in the merit-based system, the students with the highest baseline test scores were eligible to receive the scholarship. All scholarships equated to a value of approximately US$20 per annum and were conditional on students maintaining a certain level of attendance and grades. The remaining 104 schools randomly assigned to Phase 2 of the experiment formed the randomised control group. Barrera-Osorio et al. (2014) estimate the effects of the scholarships on mathematics scores and the average highest grade completed 2 years after the start of the programme. They follow the outcomes of the Grade 3 cohort assessed at baseline, reflecting that the randomised control group for this cohort (the children in Phase 2 schools) remained intact over time and did not receive the scholarships (despite their younger peers in their school becoming eligible for the scholarship in future years). In contrast to the two internal replication studies described above, they also report randomised estimates local to the threshold in addition to the randomised estimates from the broader distribution of observations. Households were eligible for an annual per-child cash transfer 13 Again, RCT results are reported for communities (including noneligible households). Here we focus on randomised results of the effects of the programme on eligible households. of approximately US$50 for up to three children for children between 6 and 12 enrolled in primary school grades 1-4.
To develop another example of a "simultaneous design" internal replication study, they contrast the differences between children's outcomes from eligible households in the randomised treatment and control municipalities 14 to estimates obtained from a geographic GDD. Here, nonrandomised comparisons are drawn comparing the outcomes of observations near to either side of a geographic boundary (which acts as a geographic discontinuity).
Given that the Honduran census does not record the precise location of dwellings, they use the latitudinal and longitudinal coordinates of caserios (hamlets) to determine whether a household rests within 2 km to both municipal borders and larger department borders. The latter reflects that the management and financing of public schools may vary between departments given that the centralisation of some public administrative and governance functions in Honduras. Results report robustness tests excluding households near department borders to control for this potentially confounding factor. The study also contrasts findings from the full sample of eligible households, with two limited samples, the first limited to the two "poorest" strata of households and the second limited to the third to fifth strata of households.
Overall, the results suggest that the randomised and nonrandomised GDD estimates statistically correspond, generally offering similar conclusions regarding the statistical significance of estimated effects across different samples and outcomes. More broadly, it was also concluded that the magnitude of the effects was relatively similar, although the GDD estimates were mildly smaller (downward biased). A further placebo analysis compiled a GDD analysis between the untreated nonrandomised caserios and randomised control caserios within 2 km from a municipal border. Coefficients in the placebo analysis were both very small and statistically insignificant. (2013): Impact of conditional cash transfers on child school and labour outcomes estimated by RDD Galiani et al. (2017) were based on an experiment first analysed by Galiani and McEwan (2013). Galiani and McEwan (2013) also created an RDD analysis to compare their experimental estimates. They used the nutrition-based eligibility criteria initially imposed to select municipalities as the discontinuity in this analysis, examining the schooling outcomes of children ages 6-12 years who have not completed fourth grade residing in municipalities within (+/−) half a standard deviation of the cut-off score imposed during municipalities selection into the RCT.

| Galiani and McEwan
The results show that experimental results statistically correspond with the RDD analysis when examining experimental estimates for observations close to the cut-off. Notably, RDD and RCT estimates in the vicinity of the cut-off do not estimate significant changes in school enrolment, work outside or work inside the home.
Although, despite their statistical insignificance, when examining the estimated coefficients, the correspondence between estimators is less apparent given their opposite signs.
The authors also highlight that the study offers another example where the localised treatment effect estimates taken from a nonrandomised study do not approximate ATEs estimated from a randomised experiment. Here Galiani and McEwan (2013)  The analysis also contrasts the correspondence of the nonstatistical matching approach described above with statistical matching.
The latter uses nearest neighbour PSM to attribute a nonrandomised comparison group to the randomised control group. It also compares the correspondence of statistical matching estimates having 14 Randomised experimental estimates are also provided for broader samples that do not share a municipal border with untreated nonexperimental caserios and that are located >2 km from a municipal border. Here we simply focus our attention on the experimental estimates using the sample most comparable to that used in the GDD analysis (i.e., those that do share a boarder and are <2 km away). VILLAR AND WADDINGTON | 13 of 35 controlled for pretest outcomes, as well as a rich set of covariates (covering individual, household and community characteristics) and a geographically local comparison group (a community located within 40 km of the control communities). This includes comparing matching estimates having used any combination of these design elements.
The study finds that the differences between the randomised control group and matched comparison groups are, on average, statistically significant across outcomes domains. The correspondence of the matching estimator also decreases when using a lowquality comparison sample without electricity lines (e.g., the coefficients of the average magnitude of the bias increased from 0.086 to 0.120 using the approach without statistical matching).
Statistical matching generally improved the degree of correspondence between groups across outcomes and samples. In particular, statistical designs using a rich list of covariates generally increased the degree of correspondence, as did use either local geographic matching and pretest outcomes. However, matching estimators using combinations of these elements generally performed better than those using a single element and those using all three elements nearly always increased the correspondence of the matching estimator by about as much as any other combination. Statistical matching did not, however, eliminate the difference between the groups and differences were still larger when matching on communities without electricity. The latter further highlights the limitations in such statistical techniques in accounting for unobserved differences in comparison groups. In other instances, concerns may be more difficult to address. For example, none of the studies address the issues of blinding outcome assessors and it is unknown to what extent this could influence assessments of outcomes and participant selection. Furthermore, it is challenging and rare that widely implemented social programmes such as those that feature in many of the cluster-randomised trials assessed here can sufficiently capture data on confounding issues such as migration between clusters. This latter point may give rise to the argument that perhaps future within-study comparisons in L&MICs would also benefit from making use of smaller and more controlled environments in order to develop internal replication studies.

| Risk of bias assessment in benchmark studies
Finally, a caveat of the published risk of bias tool for RCTs is that it does not provide questioning to discern the sufficiency of the application of IV estimation as a correction for noncompliance. It merely provides a decision score of "some concerns". This means that the IV results provided by McKenzie et al. (2010) would not be able to achieve a "low risk" assessment in relation to bias arising due to deviations from intended interventions, regardless of the rigour of the analysis done.
The rest of this section states the key factors, uncertainties and decision points influencing the scores associated with each domain of bias for each experiment assessed (Table 5). It proceeds by discussing each bias domain in turn.
There were some concerns in Chaplin et al. (2017) where statistical tests of the equality of means between treatment and control group baseline characteristics indicated that more frequent differences arose than would be expected by chance. Nevertheless, it is notable that even small differences may appear significant in relatively large samples (for more detailed discussion on such issues see Bruhn & McKenzie, 2009). For this reason, we consider that it would be more appropriate for the authors in these instances to analyse treatment and control group similarity using distance metrics. However, insufficient presentation of such evidence leaves us unable to conclude confidently whether the study has a low risk of bias with regards to the randomisation process (since the standard, albeit erroneous, approach is to present statistical significance testing).
Chaplin et al.
Bias arising from the randomisation process These findings would typically be of concern if our purpose was to generalise a statement about the effectiveness of these interventions. interest-the migration decision) is the one that is incorporated in subsequent analysis and hence is presented in Table 5. An appropriate method of analysis using approaches such as IV to correct for noncompliance is scored as of "some concerns" according to the tools decision matrix.

| Missing outcome data and measurement of the outcome
With regards to bias due to missing outcome data, studies were assessed of "low risk" of bias where missing outcomes data were similar across treatment, and of "some concern" where information was not available. We score the experiment in Galiani and McEwan (2013) and Galiani et al. (2017) of "low risk" reflecting the analysis was based on census data. The study by McKenzie et al. (2010) performs purposeful sampling of the control group during the followup survey because of concerns that the method of follow-up (using a telephone directory) may lead to bias in selection into the study (for those that do not have telephones). They elect to deliberately include a sample of participants from the outer islands of Tonga in order to correct for a possible bias this may introduce. However, we remain unclear as the effect this purposeful sampling may have had on the composition of the control group and their outcome data during the follow-up. Robustness checks and further details are not available, and therefore the study is considered to be of "some concerns".
Across all but one experiment assessed, bias in outcomes measurement were considered to be of "some concerns". This is largely due to the issue of lack of blinding of assessors. It is also unknown (there is insufficient evidence) to confidently state whether outcomes were likely to be influenced by knowledge of intervention received, since outcomes data were usually collected at the household level through self-report respondent survey, rather than more rigorous methods such as formal tests. 15 The exception is for Barrera-Osorio, which administered mathematics and working memory tests and hence was classified as being of low risk of bias. Evidence from metaepidemiological studies suggests that biases in nonblinded studies are problematic when outcomes are self-reported (Savović et al., 2012).

Another exception related to the way the experiment in Galiani and
McEwan (2013) and Galiani et al. (2017) was conducted. Since the data for this experiment used census data retrospectively, it is not expected that participants or assessors would associate the data collection with household treatment status. In all of these cases, we assigned "low risk of bias" in outcomes measurement.

| Selection of the reported results
Here we consider that the reporting quality generally offers low-risk bias across the studies assessed, due to the large number of effects usually reported for different outcomes and samples. For example, all studies 15 In some instances, outcomes were collected at community level, for example, Chaplin et al. (2017) household electricity grid connections, but these were not used in the within-study comparison.
reported results of RCTs across multiple outcome domains, which they then used to compare with nonrandomised replications. Where particular subgroups were reported, for example, by sex in Buddelmeyer and Skoufias (2004), they were justified as common practice.

| Quantitative estimates of bias in nonrandomised within study replications
We collected data on treatment effects for the benchmark study and each corresponding nonrandomised replication presented, from 604 specifications. These data included outcome means in treatment and control/comparison (or treatment effect estimates from an analysis), outcome variances, sample sizes, tests of statistical significance, types of variables used in adjusted analysis and available measures of treatment compliance for RCTs (see coding tool in A). We used the estimate of effect from the RCT which most closely corresponded with the population for the nonrandomised arm (e.g., the bandwidth around the treatment threshold in the case of Buddelmeyer & Skoufias, 2004). Where there was nonadherence, we used the CACE, 16 as in the case of McKenzie (2010).

| Quantitative measures of bias
There are two main types of measures of difference between the benchmark and nonrandomised replication study arms : distance metrics which quantify the difference between the effect size estimates between the benchmark and nonexperimental replication; and conclusion-based measures which use information on sign, statistical significance or an effect threshold.
We calculated the standardised absolute difference between treatment effects in experimental and nonrandomised replication samples. We define D as the primary distance metric measuring the size of the bias between the nonexperimental and experimental results. D s is a standardised measure of D which is defined in recent reviews by Wong et al. (2017) and Steiner and Wong (2016). We used the absolute difference in D to ensure consistency across studies reported effects; for example, Chaplin et al. (2017) only reported absolute direct standardised measures. In addition, in the subsequent results, we report averages over the large number of values of D collected in each study; we want a measure of the deviation of randomised and nonrandomised estimators, and not one that on average "cancels out" positive and negative deviations, hence potentially obscuring important differences. Formally D s was computed as follows: Where an appropriate standard deviation of the outcome in the experimental group or the pooled standard deviation of the experimental treatment and control group were not reported, the standardised mean difference (d) for each estimator was calculated using the following formula and then subtracted from one another to compute D s , as follows: where n denotes the sample size of treatment group (g) or control/ comparison (c) and t is the t-statistic for the difference in group means in either nonexperimental (nx) or experimental (re) samples.
Where t  assumed the ATE is theoretically zero given that the control group is not exposed to the treatment. Any differences that then arise between the two matched groups is attributed to an inconsistency between estimators.
We extracted data relating to estimates using such strategies to minimise the differences in causal quantities between experimental and quasi-experimental strategies. However, such estimates are not 16 CACE may be estimated using IVs (Angrist, Imbens, & Rubin, 1996) or calculated by deflating ITT using available information on adherence (Bloom, 2006). For example, if nonadherence C in the treatment group is observed at a rate of 10%, for a homogenous effect (E) of an intervention that equates to a particular increase (say 5%) in a given outcome of interest, we would expect the ITT estimator to report a smaller average effect. This would be calculated as follows: E ITT = E TOT (1 − C) = 5*(0.9) = 4.5%. 17 A second type of distance metric is a simpler measure of the percent difference ; as also reported in McKenzie et al., 2010), but is not synthesised across studies in this report. This expresses the difference between experimental and nonexperimental effect estimates as a percent of the absolute value of the experimental estimate . However, as highlighted by Steiner and Wong (2016)

| Quantitative estimates of bias
We calculated bias estimates for all included studies and report the mean standardised bias in Table 6 Three of the four studies using nonrandomised matching estimators report statistically significant average differences (Chaplin et al., 2017;Diaz & Handa, 2006;Handa & Maluccio, 2010). One set of estimates from Handa and Maluccio (2010) are relatively large on average (0.43 standard deviations), but within this study the bias coefficients greater than one standard deviation reflect some of the estimates contained in that study with weaker matching strategies   and Hansen et al. (2013) highlight some concerns about that study's potential risk of bias.
A final point of note is warranted regarding the discontinuity designs.
All of the studies examining discontinuity designs that use equivalent causal estimands (i.e., with the exception of Lamadrid-Figueroa et al., 2010), produce distance metrics that are small on average (<0.06 standard deviations). None are significantly different from the benchmark estimate. However, because of the local population around the cut-off over which discontinuity designs are estimated, they are also of low power which would account for the statistically insignificant findings; for example, Goldberger (1972) originally estimated sampling variances for an early conception of RDD as being 2.75 times larger than an RCT of equivalent sample size. Presumably, this is also the case of estimates from other designs that produce global (rather than local) causal estimands which would explain why some of the findings from matching estimators are of similar small magnitude as the RDD estimates, but significantly different from zero at standard significance levels.

| RISK OF BIAS IN RDD
High quality systematic reviews set explicit study design inclusion   Campbell Collaboration, 2014;Waddington et al., 2012). In systematic reviews examining questions about the effects of programmes on outcomes, assessment of internal validity is done in "risk of bias" assessment. Risk of bias tools provide the criteria to enable reviewers to evaluate transparently the likelihood of bias, for particular bias domains (e.g. confounding, selection bias, deviations from intended interventions, bias in outcomes data collection and reporting). A recent review paper (Waddington et al., 2017) (Chaplin et al., 2018, p.7).
In the following section, we present results of a review of approaches used in risk of bias assessment in Campbell reviews, including RDDs. Subsequently, we present an approach to conducting risk of bias assessment in RDDs.

| Risk of bias in Campbell systematic reviews
We downloaded records from the Campbell Library and collected the following data: • Study information: Lead coordinating group and study identifiers (lead author and year).
• Study inclusion criterion: Whether the review eligibility criteria included nonrandomised studies of effects.
• Incorporation of RDD: Whether the criteria for the review included RDDs, whether any were found and included, and whether any were excluded. included as eligible but none were found (Petkovic et al., 2018).
Authors used different tools to assess risk of bias for included nonrandomised studies (Table 7), roughly corresponding to the group coordinating the registration process. SWG authors mainly used either an early version of Cochrane's risk of bias tool for nonrandomised studies of interventions (Reeves, Deeks, Higgins, & Wells, 2011), or Cochrane's risk of bias tool for RCTs , as did half of ECG reviews. IDCG authors largely used the tool developed by 3ie We also collected data on the domains of bias contained in the tools used to assess studies incorporated in Campbell reviews. We   18 Nonrandomised studies are defined here as approaches in which assignment to intervention is based on some method other than randomised or quasi-randomised (e.g., alternation) assignment.

| 19 of 35
confounders (e.g. RCTs and RDDs). We found that 94 percent of Campbell reviews incorporating non-randomised studies assessed risk of bias due to confounding. 'Bias in selection of the reported result' may refer to outcomes or analysis, and may be addressed through reporting of pre-analysis plans, transparent reporting of any analyses that were determined post hoc, and reporting on all outcomes and participant sub-groups measured irrespective of findings. Over 80 percent (83%) of Campbell reviews incorporating non-randomised studies specifically assessed bias in selection of the reported result. Other sources of bias defined by Sterne et al. (2014) which are relevant for RDDs are not covered here because the signalling required to answer them are unlikely to be different for RDDs as for other designs (e.g. 'performance bias', is when participants receive a different intervention to the one intended, and 'bias in outcomes data collection' due to measurement error in outcomes reported by participants or measured by outcomes assessors). One source of bias, 'bias in classification of interventions', was deemed irrelevant for RDDs, because assignment to intervention is based on a score measured at pre-test. RDDs are sometimes eligible for, and included in, Campbell systematic reviews. In 2012-2018, 35 reviews explicitly allowed for the inclusion of RDD (51% of reviews that incorporate nonrandomised studies, or 35% of all reviews)-that is they mention "regression discontinuity" explicitly in the study inclusion criteria, or define inclusion that would be consistent with RDD (e.g., allocation by "time differences, location differences, decision-makers, or policy rules"; Filges, Jonassen, & Jørgensen, 2018, p. 19). Twelve reviews were able to include at least one RDD in analysis (18% of reviews incorporating nonrandomised studies; Table 8). These include eight reviews in international development, which draws on a cross-disciplinary literature including econometrics where the method has become popular, plus three reviews in social welfare and one in education. In a few instances, authors differentiate types of discontinuity design such as Lawry generate data that can be used to make causal inferences, they were excluded from this review because statistical methods for incorporating RDD… data into meta-analyses are, to the best of our knowledge, not well-established".
Previous research has found that existing risk of bias tools are not operationalised to assess nonrandomised studies with selection on unobservables (Waddington et al., 2017). This is partly because most tools were not developed to address particular designs like RDDs. For example, we are aware of four tools that present signalling questions for RDDs (Chief Evaluation Office, undated;Schochet et al., 2010;Valentine & Cooper, 2008).
The tools on which RDDs have been critically appraised in Campbell reviews are Hombrados and Waddington (2012), Reeves et al. (2011) and Higgins et al. (2011).

| Definitions of RDDs
Given the increased interest in RDD as a method of programme evaluation, the incorporation of these studies in systematic reviews and concerns about methods for their synthesis, we elected to focus our work here on RDD. Using a mixture of literature review (of existing risk of bias tools, textbooks and recent journal articles) and expert consultation, we developed signalling questions under seven bias domains.
RDD, also called RD and "cut-off based design", was conceived of as in educational psychology by Thistlethwaite and Campbell (1960) (cited in Shadish et al., 2002) and in economics by Goldberger (1972). In RDD, assignment to treatment is based on a specific score on an ordinal or continuous measure, such as a diagnostic test, that is given prior to treatment. For example, an educational authority might choose to administer a math test to all 10 year-old students, and provide extra support for students scoring below some threshold of proficiency. Students just on either side of the cut-off threshold should be highly similar to one another, and therefore if there is a treatment effect this should be noted in a discontinuity (or break) in outcomes between the treated and untreated groups at the point of intervention, which may be a change in intercept or slope, or a combination. Figure 2 presents two simple examples of the relationships between the assignment variable (pretest score with cut-off set at 50) and outcomes; there are many more possibilities allowing for nonlinear relationships between assignment variable and outcome (see, e.g., Shadish et al., 2002). RDD has become a popular approach to treatment effect estimation, 19 particularly in economics. Key review articles and practice guidelines are available in econometrics (Imbens & Lemieux, 2007;Lee & Lemieux, 2010), educational psychology (see e.g., Cook, 2008) and, latterly, epidemiology . CJG  3  0  0   ECG  9  1  1   IDCG  20  8  2   SWG  3  3  0   Total  35  12  3 Abbreviations: CJG, Crime and Justice Coordinating Group; ECG, Education Coordinating Group; IDCG, International Development Coordinating Group; RDD, regression discontinuity design; SWG, Social Welfare Group. 19 Hits on "regression discontinuity" in Google Scholar (ran December 21, 2017December 21, ): up to 19992000; 2010 to present, over 17,000.

RDD eligible RDD included RDD excluded
We give several definitions of the RDD approach by leading authors in Table 9. The definition by Shadish et al. (2002) implies that it is the experimenter who designs the study prospectively. However, discontinuity assignment may also arise from natural processes of policy and practice-for example, age requirements for pensions and biometric tests used in medicine. Where outcomes data are available, or can be collected, this opens up the approach to retrospective analysis provided data are also available on the assignment variable used to allocate units prehoc; in other words, RDD as a natural experiment (Dunning, 2012). Cook's (2014) definition as presented here suggests a deterministic relationship between the assignment variable and treatment status, but Cook also refers to "fuzzy" RDD which allows for a probabilistic relationship between assignment and treatment (in other words, there are additional factors, which may or may not be measured, affecting assignment status). The quote from Dunning (2012) also indicates that, at least in social science research and econometrics, the causal relationship is usually estimated close to the cut-off threshold. However, the global relationship may also be estimable under stronger assumptions (Angrist & Rokkanen, 2015;Bor, Moscoe, Mutevedzi, Newell, & Bärnighausen, 2014;Rubin, 1977). 20 Finally, Schochet et al. (2010) give guidance on the nature of the assignment variable, suggesting scope for ordinal scales to be included with sufficient units either side of the threshold.
In theory, RDD generates an unbiased treatment effect. This is because the assignment rule is known, hence the relationship between treatment and outcomes can be modelled with trivially small confounding, at least close to the assignment threshold (Cook, 2008). Its closest design relative is the RCT. 21 In economics, political science and many health applications (see Dunning, 2012;Moscoe et al., 2015), RDD has frequently been used retrospectively to evaluate existing threshold rules. Because RDD needs large samples, it is well suited to retrospective evaluation using existing data sources (e.g., administrative data). (2010)   Dunning (2012): "Individuals or other units are sometimes assigned to the treatment or control groups according to whether they are above or below a given threshold on some covariate or pretest. For units very near the threshold, the process that determines treatment assignment may be as good as random, ensuring that these units will be similar with respect to potential confounders" (pp. 63-64) Cook (2014): "The regression discontinuity design (RDD) occurs when assignment to treatment depends deterministically on a quantified score on some continuous assignment variable. The score is then used as a covariate in a regression of outcome. When RDD is perfectly implemented, the selection process is fully observed and so can be modelled to produce an unbiased causal inference" (p. 1) 20 Angrist and Rokkanen (2015) focus on reweighting internally valid local effects to other values of the assignment variable based on the distribution of other pretreatment characteristics. 21 The combined RDD and RCT design assigns units to intervention randomly, when the units fall within a range of scores on an eligibility variable (Shadish et al., 2002).  (Dunning, 2012;Hahn, Todd, & van der Klaauw, 2001;Moscoe et al., 2015) including but not limited to:

Lee and Lemieux
• Test scores: for example, continuous biomarkers in medicine (e.g., cholesterol, blood glucose, birth weight, CD4 count); entrance exams in education.
• Programme eligibility criteria: for example, poverty index; criminality index.
• Age-based thresholds: for example, voting age; pension age; birth date/quarter.
• Size-based thresholds: for example, hospital size; school size.
• Time (RD in time [RDiT]): for example, date of a policy or practice change.
In GDD, exposure to the treatment depends on the position of observations with respect to an administrative or territorial boundary (e.g., Galiani et al., 2017). In the particular case where the assignment variable is time, RDiT is similar to controlled ITS (Hausman & Rapson, 2018;Shadish et al., 2002). Perhaps one difference between ITS and RDiT is that treated and untreated samples are different, which makes a difference to the length of follow-up period over which treatment effects can be credibly estimated. In ITS, the same participating units are followed-up over time, and the treatment effect is identified through variation in exposure to treatment over time, sometimes with respect to an untreated comparison (Somers, Zhu, Jacob, & Bloom, 2013). It is most credible in estimating treatment effects for observations immediately after the time of intervention in comparison with their values immediately before (i.e. the short term effect). In contrast, in RDD, the treatment effect is estimated by comparing observations from different units measured at the same time (or follow-up period); the comparison is made up of units who were eligible immediately before or after a threshold date on which a policy or practice change occurs.
However, the outcome for those units could be assessed many years later.
In the basic design, assignment to treatment and comparison is based on the observational unit's pretest score on the continuum, relative to the assignment threshold ( Estimating the correct functional form of the relationships between assignment variable and outcome is the main estimation concern. Researchers must correctly establish the functional form to avoid a potentially confounded relationship-for example, where a linear regression line is fitted to a nonlinear relationship (Angrist & Pischke, 2009, p. 254;Shadish et al., 2002). Variants on the estimation approach include allowing the treatment effect to manifest as a change in slope as well as, or instead of, a change in intercept (by incorporating interaction terms in regression estimation). For example, in the "regression-kink design" the treatment effect is estimated as the change in slope of the outcome variable at the threshold ( Figure 2).
The counterfactual approach, and the modelling of functional form, can be strengthened by data. Availability of pretest outcomes data can strengthen the approach. The clear advantage of prospectively designed RDDs-over and above the advantage they have over retrospective studies in controlling for selection bias into the studyis that outcomes can be measured before assignment across participants, allowing estimation of the relationship between assignment variable and outcomes among different units at pretest. This can verify the counterfactual relationship at pretest for observations before they are assigned, in order to confirm the functional form.
Where treatment occurs among units in time-dependent batches, there may be scope to estimate the counterfactual functional form F I G U R E 3 Examples of assignment rules used in RDD. RDD, regression discontinuity design T A B L E 10 Proposed signalling questions for risk of bias in regression discontinuity design: Confounding and selection of the reported result (modifications highlighted in bold font) Definition of study designs/analysis types to be addressed RDD A study that estimates an intervention effect by allocating participants to treatment and comparison groups on the basis of a threshold on a scale variable measured at pretest, and comparing outcomes among units either side of the threshold at post-test. Examples of threshold variables include: Test scores such as continuous biomarkers (e.g., cholesterol, blood glucose, birth weight, CD4 count) or entrance exams in education; age-based thresholds (e.g., pension age); size-based thresholds (e.g., class size); or time (e.g., date of birth, date of policy or practice change) RDD may be designed prospectively where outcomes data are collected by experimenters at baseline and endline specifically for the purposes of evaluating the intervention. However, it is common for RDD studies to be conceived and analysed retrospectively, for example, by using routine data collection. In both instances, participants are allocated to intervention at pretest Notes on RDD approaches • RDD shares many similarities with RCTs. While assignment to intervention is not random (an observation's location on a scale variable either side of a threshold is used to determine assignment), estimation methods are used to identify 'as-if random' allocation at the threshold • Similar to RCTs, the main threat to internal validity for RDD is manipulation of the value of the threshold variable by participants and practitioners. That is, nonrule-based discretionary assignment by practitioner, rather than rule-based assignment determined centrally, or if there is public knowledge of the assignment mechanism and participants are able to manipulate assignment variable score to determine access. However, in the case of RDD, causal identification can also rest on random error in measurement of the assignment variable at the threshold • The main difference between RDD and RCT is in the treatment assignment mechanism, which then implies different analytic techniques to estimate the causal effect. Shadish et al. (2012) list the following main threats to estimation of the causal effect in an RDD: Nonlinear relationship between assignment variable and outcome (requiring use of correct polynomial functions of assignment variables), interactions with a third variable causing the relationship between assignment and outcome to be probabilistic rather than deterministic (so-called 'fuzzy RDD'), or where variables are non-normally distributed • A threat to internal validity arises if the same threshold value is used for assignment to interventions, other than the intervention of interest, which may affect the outcomes of interest ('treatment confounding'). In such a case, the threshold variable cannot be used to estimate the causal effect of the intervention of interest, but it can be used to estimate the effect of the combined interventions • RDDs can estimate the LATE directly around the threshold and, under much stronger assumptions, the (global) ATE that would be estimated in an RCT for the sample population • In RDD, the relationship between assignment variable and treatment status may be deterministic (sometimes known as 'sharp RDD') or probabilistic (sometimes known as 'fuzzy RDD'), in which case factors other than the threshold rule influence treatment assignment. Under 'fuzzy RDD' IV estimation should be used to estimate the treatment effect (known as the complier average causal effect, CACE) The effect of interest The treatment effect may be estimated as a change in intercept and/or slope, and is estimated as either: • The difference in outcomes (intent-to-treat effect) at the threshold, also called local average treatment effect (LATE or, alternatively, average treatment effect at the cut-off) • The difference in outcomes (intent-to-treat effect) across the entire distribution, also called (global) average treatment effect (ATE), which can be obtained via extrapolation in RDD under additional assumptions • The difference in outcomes among "compliers", that is, units induced to take up the treatment because of the threshold rule (complier average causal effect, CACE at the threshold), estimated by instrumental variables regression Researchers should analyse the change in slope and/or level using different band-widths around the threshold or functional form. The following should be prespecified as far as possible, and reported in sensitivity analysis: Y/PY/PN/N/NI • Selection of optimal bandwidth using existing data-driven routines; • Selection of appropriate functional form for the relationship between assignment and outcome variables; and • Robustness checks to other bandwidths and functional form specifications 2.3 Is the reported effect estimate likely to be selected, on the basis of the results, from different subgroups?
Particularly with large cohorts often available from routine data sources, it is possible to generate multiple effect estimates for different subgroups or simply to omit varying proportions of the original cohort. If multiple estimates are generated but only one or a subset is reported, there is a risk of selective reporting on the basis of results

Y/PY/PN/N
Note: Suggested changes for RDD indicated in bold font. All other text taken from Sterne et al. (2014). Only bias domains 1 (confounding) and 7 (bias in selection of the reported result) are shown here. Other bias domains are relevant for RDD -but are not shown here as changes to existing ROBINS-I signalling questions are not being suggested at this point.
Abbreviations: ATE, average treatment effect; LATE, local average treatment effect; RCT, randomised assignment design; RDD, regression discontinuity design.
relationship at the same time among different groups (Cook, 2008). 22 Recruitment into the study of a "pure control" group that is subjected to the same informed consent and data collection also has the advantage of enabling measurement of motivation biases in prospective studies (see below). The addition of a nonequivalent dependent variable function ("placebo outcome") can also be added to provide a "no-cause counterfactual" (Trochim, 1984;cited in Cook, 2008). Tests for "placebo discontinuity" at different thresholds of the assignment variable can help rule out the existence of a chance relationship with the outcome of interest .
Most RDDs are analysed retrospectively using routine data collection. This has implications for the analysis that can be done to verify underlying assumptions of the approach (e.g., limits to testing the preintervention relationship between assignment variable and where outcomes baseline data are not available outcomes data that are self-reported rather than directly observed (Schmidt, 2014), may be less relevant. A disadvantage of retrospective RDD is that one really needs to know a lot about the situation in which RDD is implemented to judge its validity-for example, whether placement scores could be manipulated (by selectively allowing some people to retake the placement test).

| Internal validity in RDD
The main threats to internal validity in RDD concern inappropriate characterisation of the assignment variable or the relationship between assignment variable and outcome, and manipulation by participants or implementers of the assignment variable. In order for RDD to produce internally valid estimates, the minimum criterion is exchangeability at the threshold-that is, the potential outcomes would be same on average if treated units had been untreated and untreated individuals had been treated, as would be the case in a well-conducted RCT. One common way that this is violated is if the assignment variable itself is precisely manipulabled by participants or implementers, at least over the subsample of observations around the cut-off threshold. Threats to validity may arise where there is public knowledge among programme participants of a manipulable assignment variable, or where practitioners are able to assign to treatment on a discretionary basis. This is equivalent to assessing subversion of randomisation when random allocation is not concealed until after recruitment in RCTs. Hence, participants should either be blinded to the value of their assignment variable or unable to manipulate it (and practitioners should not be involved in assignment, or will not be able to manipulate it). Assignment variables which participants have manipulated include reported income, which may be incorrectly reported or manipulated to gain eligibility to programmes (Buddelmeyer & Skoufias, 2004).
In addition, the relationship between the assignment variable and outcome must be unconfounded over the sample in which the treatment effect is estimated. The most important cause of confounding is manipulation by participants or planners. This assumption is easiest to establish for the sample of observations closest to the threshold where there is random error in measurement of the assignment variable (Goldberger, 1972). Where an assignment rule determines eligibility for more than one intervention affecting the outcome of interest (treatment confounding), it will not be suitable for RDD analysis of the effects of the intervention of interest (Schochet et al., 2010). An example is evaluations of school lunch programmes in which family income is used for programme eligibility at the same cut-off as is used for other state welfare benefits. This is the same as a violation of the exclusion restriction in an IVs set up, or the selection-by-history effect that could also occur in an RCT.
However, it still offers a valid opportunity to assess the overall ITT effect of assignment to the full package of interventions.
Hence three causal estimands can be distinguished, relating to assumptions about homogeneity of treatment effect across the full sample distribution: 1. Where treatment effects are variable across the distribution, RDD is able to estimate the LATE (also called the ATE at the cutoff; Hahn et al., 2001). In the simplest case, where there is random noise in the measurement of a (nonmanipulable) assignment variable, units immediately around the threshold can be considered as randomly allocated to treatment and control (Goldberger, 1972). Where the assignment variable is measured without noise, a stronger assumption is needed, that is, that the relationship between the assignment variable and outcome is unconfounded in the region of the threshold. Another way of saying this is that the potential outcomes (expected outcomes conditional on the assignment variable) are assumed 22 These methods may also be available in retrospectively designed RDDs, where baseline and/or control group data are available. 23 Peter Schochet in personal communication stated: "the issue about the scoring variable having 'at least four unique values below the cutoff and four unique values above the cutoff' was determined by an expert panel as a reasonable approximation for defining a 'continuous' scoring variable… What is 'continuous' will clearly depend on the context". Cochrane's Effective Practice and Organisation of Care (EPOC) group also used a minimum approach to define ITS studies as having at least three observations before and after intervention, on the intuition that a minimum of three observations are required to identify a trend. VILLAR AND WADDINGTON | 25 of 35 to be continuous at the threshold Hahn et al., 2001).

2.
Where treatment effects are constant across the assignment variable distribution (i.e., slope of assignment-outcome relationship independent of assignment variable), RDD can estimate the ATE. In this case, the functional form of the relationship between assignment variable and outcome must be modelled appropriately over the full distribution of the assignment variable. Strong assumptions are needed for identification, such as the relationship between assignment and outcome being unconfounded over the entire distribution of the assignment variable (potential outcomes are continuous across the full distribution; Bor et al., 2015).

3.
Where the relationship between assignment variable and outcome is probabilistic rather than deterministic ("fuzzy RD"), IVs is used to estimate the CACE.

| Statistical conclusion validity in RDD
The main factors affecting the validity sources of a statistical test of the treatment effect in RDD arise from: Choice of bandwidth around threshold, estimation of the functional form and adherence to assignment rule.
Statistical modelling to determine the sample over which to estimate the treatment effect is crucial in RDD. Under the weaker validity assumption for LATE, this means determining the appropriate bandwidth around the threshold for local linear regression, which is usually done using covariate balance tests to examine the assumption that groups on either side of the threshold are roughly equivalent in their baseline characteristics (Schochet et al., 2010).
Estimating the correct functional form of the forcing assignment variable is the main estimation concern. As noted above, the relationship between outcome and assignment variable may be nonlinear and confound the treatment effect estimate. There are two ways to estimate the relationship: 1. Parametric approaches: The use of parametric approaches does not necessarily imply that the relation is linear. Ideally one should try and report different polynomial orders for the assignment variable, and then (using Akaike Information Criteria and visual inspection) identify the preferred specification under different bandwidths of the assignment variable.
2. Nonparametric approaches: These do not make any assumptions about functional form. For this reason, Imbens and Kalyanaraman (2012) argue that this technique is preferable. The key issue when using this approach is in defining the optimal bandwidth. The state of the art option used by econometricians is the optimal bandwidth procedure defined by Calónico et al. (2014).
We can distinguish two forms of nonadherence in RDD analysis, which affect the estimation methods used: Nonadherence to the assignment rule; and fuzziness (cf. misclassification) in the relationship between assignment variable and assignment status. We note here that these are conceptually different from manipulation of the forcing variable, which is a major threat to validity in RDD, as noted above. Imbens and Lemieux (2007) and Cook (2008), citing Trochim (1984) distinguish "sharp" and "fuzzy" RDD. In the case of a "sharp" RDD, the relationship between assignment variables and treatment status is known and deterministic. In analysis, this means there will be no value of the assignment variable at which different units may be both treated and untreated. There may still be overrides to the assignment due to nonadherence, which can be addressed in ITT analysis. This is a qualitatively different case from fuzzy RDD, where the assignment scale only determines the probability of treatment status, for example because additional factors are taken into account to determine assignment. The distinction is between fuzziness in the assignment threshold value, as in the case of overrides to the assignment rule in sharp RDD, versus allowing for other factors to determine treatment uptake, with the assignment cut-off fixed, in fuzzy RDD. In fuzzy RDD there are elements of the rule determining assignment which may be "unknown" and in analysis, there may be values of assignment in which different units may be both treated and untreated, hence the need for two-stage least squares estimation, where RDD assignment is used as an IV.
Other implementation problems, such as nonrandom attrition and other sources of incomplete data, and biases in outcome measurement, can be assessed using standard approaches.

| Reporting standards and a preliminary risk of bias tool
There is potential for bias if the assumptions for RDD are not met, or not demonstrated and reported to be met. For example, adequate reporting of the assignment mechanism is vital in RDD. Lee and Lemieux (2010) and Moscoe et al. (2015) discuss reporting criteria for RDD. We list the following as being important in justifying the approach: 1. Clear presentation of the assignment-outcome relationship in a graph showing the discontinuity. Appropriate functional form may include local linear regression at assignment threshold, or ordered polynomial. The treatment effect may be measured as a change in intercept and/or change in slope (regression-kink design).
2. Discussion of the validity conditions in the context of the study, particularly around manipulation of assignment variable score, demonstrating that: the assignment decision rule was adequately concealed from participants; the assignment variable was nonmanipulable by participants, practitioners or other decision makers; or the assignment variable was measured with random error.

Confirmation tests:
• Reporting the distributions of baseline characteristics above and below the cut-off.
• Histogram of assignment variable demonstrating no data bunching around threshold, hence no manipulation of treatment status.
• Addition of a phase in which intervention is not present (e.g., by estimating the preintervention relationship between assignment variable and outcomes at baseline), or among a "pure control" group that is not offered treatment assignment, to verify functional form and to adjust for nonlinearities in the relationship.

Falsification tests:
• Analysis of "placebo discontinuities" at different thresholds of the forcing variable showing no other discontinuities in the assignment variable within the window of interest.
• Addition of a nonequivalent outcome, or "placebo outcome". That is, assessing the effect on a second outcome variable that the intervention should not influence, as a falsification exercise.

5.
Reporting multiple specifications to check robustness. This includes testing the robustness of the results to the use of: nonparametric methods using different bandwidths; and parametric methods using different windows for the assignment variables and different polynomial orders when modelling the relation between assignment and outcome variables.

6.
Demonstration that there is no "treatment confounding": that is, the allocation rule is only used to assign interventions of interest, and not additional interventions which may affect the outcomes of interest in the study.

7.
In the case of prospectively designed RDD (QEs), bias in selection of participants into the study can be addressed by controlling for baseline imbalances, similar to RCTs. Bias in outcomes measurement may be minimised by use of methods to account for biases, such as blinding of intervention or measurement of outcomes using direct observation rather than participant self-report.

8.
In the case of retrospectively designed RDD drawing on observational datasets (natural experiments), bias in selection of participants into the study may be problematic, but bias in outcomes measurement (e.g. due to self-reported outcomes) less so.
On the basis of this review, we propose signalling questions within relevant bias domains of ROBINS-I. ROBINS-I is grouped around seven bias domains as explained further in Sterne et al. (2016). Here we propose modifications (highlighted in bold in Table   10) for two domains: bias due to confounding, and bias in selection of the reported result.
Risk of bias due to confounding includes questions about the definition of the assignment scale (continuous or discrete), the specification of the relationship between assignment and outcome, treatment confounding and the assessment of balance. Thus we might expect credible RDDs to: use a continuous variable for assignment; use an appropriate method to examine the relationship with outcomes (e.g., nonparametric kernel regressions) as well as report sensitivity analysis; report a graph of the discontinuity to show no other discontinuities in the assignment variable within the window of interest; report a histogram (kernel density plot) of the assignment variable to spot bunching around the threshold which might be indicative of manipulation; and report baseline data to assess the preintervention relationship. Some of these reporting requirements, such as graphing the discontinuity, have only become common in RDD reporting latterly. For reporting bias, we may expect authors to present multiple findings for all outcomes prespecified and including multiple bandwidths.

| DISCUSSION AND OPPORTUNITIES FOR FURTHER RESEARCH
The main results of this paper have shown that a wide range of risk of bias tools and internal study replications exist and may be drawn on in designing future internal study replications and risk of bias tools. In particular, drawing on a specialised and more in-depth systematic search of impact evaluations from international development, it highlights that a much larger body of evidence exists in this field than was previously known. Here the review identifies that at least 10 within study comparisons from international development exist, more than doubling the number previously reviewed. We consider that these studies can be used in further developing the accuracy of bias assessments and provide the grounds on which they can be evidenced.
A synthesis of findings from primary studies and existing reviews of internal study replications from international development has highlighted that nonrandomised estimators can provide estimates very similar to randomised estimates, but that they do not always eliminate bias. In particular, evidence suggests that reductions in bias Further research would also benefit from exploring whether specification tests can be used to "rule out" biased estimates. For instance, assessments by Heckman and Hotz (1989) of the predictive accuracy of diagnostics tests of pre-programme alignment of participant's characteristics first showed promise in helping to improve our ability to detect a biased estimator. Findings from Glazerman et al. (2003) further indicated that specification tests can help to eliminate the poorest performing non-randomised estimates. To date, the internal replication literature from L&MICs provides relatively little analysis to contribute to the understanding of the efficacy of diagnostic tests. In some cases, the L&MIC studies do use diagnostic tests-for example, Chaplin et al. (2017) assess the balance of variables of in their final propensity score models. However, these studies do not evidence whether such diagnostics can be used to predict consistently when biased estimators will occur.
Examples of more contemporary diagnostic tests that could also be considered include whether researchers can accurately detect sensitivity of nonrandomised estimators to unobserved biases (e.g., see Arceneaux, Gerber, & Green, 2010).
Finally, this review has further highlighted the need for risk of bias approaches to be further developed for other credible nonrandomised methods, due to the increasing use of these approaches including QEs and natural experiments in programme evaluation, as well as their incorporation in systematic reviews. Here we have suggested development of an approach for RD. Further work should take place to pilot this approach and, importantly, develop algorithms for reaching transparent "risk of bias assessments" for particular domains. Further work is also needed on other approaches, for example, statistical matching approaches and DIDs, popular methods of programme evaluation for which careful risk of bias assessment is not commonly undertaken in systematic reviews.
Correspondingly, future research related to reviews of internal replication studies could consider an updated methodorientated review of nonrandomised matching studies in this literature. Chaplin et al. (2018) serve as an example of such a study on RDD. Other methods that may warrant reviews include panel regressions (such as DID), selection models, nonrandomised IVs and ITS analysis. We also note that the results of this study highlight that broader efforts to identify all existing internal replication studies should consider more specialised systematic search strategies within particular literatures (so to overcome the issue of a lack of systematic indexing of this evidence). With respect to international development, future updates of this review would be particularly warranted as systematic searches on databases such as the 3ie impact evaluation repository continue to be updated and expanded.

DECLARATIONS OF INTEREST
The authors are not aware of any financial or other interests which might have influenced the work undertaken. APPENDIX: SYSTEMATIC REVIEW CODING TOOL Section 1 (S1): Study, Intervention And Estimate