In our experience, the process of selecting and recruiting sites for multisite impact evaluations has four key steps. First, the evaluation may specify its population of interest. In other cases, the population of interest may be left unspecified—either because the population of interest is obvious (e.g., all program participants, in evaluations of ongoing programs) or because there may be multiple populations of interest to the evaluation sponsor (e.g., in evaluations of interventions that may be adopted voluntarily by local entities). Second, the evaluation defines what a site is and may specify eligibility criteria for whether a site can be included in the evaluation, or at least describe the types of sites it aims to recruit. Third, the evaluation selects a sample of sites to recruit and invites them to participate in the evaluation. Fourth, these sites must decide whether or not to participate. The selection and recruiting process typically continues until the evaluation meets its sample size requirements (or decides to conduct the evaluation with a reduced sample).
In our conceptual model, we treat purposive site selection as a process. Instead of focusing on the outcome—the sample actually selected—we conceptualize the selection process as a random process with well-defined but unknown probabilities. More specifically, we assume that for any evaluation, for each site in the population of interest, there exists a well-defined probability of inclusion in the evaluation. Like all probabilities, the probability for each site falls between 0 and 1, inclusive. However, unlike formal probability sampling, the probabilities are unknown even to the researchers who selected or recruited the sample.
In this model, we define a site's probability of inclusion as the proportion of replications of the site inclusion process in which the particular site would be included in the evaluation sample (i.e., in which the site would both be chosen and agree to participate). We define a replication as a hypothetical execution of a site inclusion process, which is defined by certain fixed parameters, but also includes some variable or random elements. The fixed parameters of the inclusion process may include the universe of potential sites and the target number of sites to be included; the variable elements of the inclusion process may include the procedures used to recruit eligible sites and time-varying factors that influence sites’ willingness to participate, including the personality traits of site-level decisionmakers and political factors that influence their decisions. Under this conceptual model, the particular sites included in the evaluation can vary across replications.
Although this conceptual model may seem restrictive, it is in fact sufficiently general to allow for any kind of inclusion process. At one extreme, it allows for a perfectly deterministic site inclusion process (e.g., 60 eligible schools with an inclusion probability of 1 and 940 eligible schools with an inclusion probability of 0). At the other extreme, it allows for a perfectly random process (e.g., all 1,000 eligible schools with a 6 percent chance of inclusion in the sample).
Most importantly, our model allows for more realistic situations in which some sites in the population of interest have zero probabilities of inclusion and other sites have positive but varying inclusion probabilities. For example, for a hypothetical random assignment evaluation of after-school programs, the probability of inclusion may be zero for sites that lack enough excess demand to conduct random assignment, small but positive for oversubscribed sites that serve a small number of children (e.g., those located in rural areas), and larger for oversubscribed sites that serve a large number of children (e.g., those located in urban areas).
One way of understanding our conceptual model is by analogy. Our conceptual model for purposive sampling is analogous to the conceptual models behind Donald Rubin's theory of missing data (e.g., Rubin, 1976) and James Heckman's theory of sample selection bias (e.g., Heckman, 1976). Both of these models consider the absence of particular units from the analysis sample as having a probabilistic component. Our model can thus be thought of as a special case of more general models that have played a prominent role in evaluation research.
External Validity Bias
In this subsection, we derive a mathematical expression for the bias that results from selecting sites purposively and then using standard methods to obtain a pooled impact estimate. First, let us formally establish the parameter of interest in multisite impact evaluations as the average impact in the population of interest. We assert that in most evaluations, the main parameter of interest is either the average impact across all sites in the population of interest or the average impact across all individuals in this population (where the latter is simply a weighted average of the former). To derive a formal expression for the bias, we focus on the former impact. This would be a key parameter of policy interest if individual sites can choose whether to adopt the intervention, or if policy decisions are made at a higher level for all sites in the population, but the number of program participants per site is the same across all sites in the population. Equation (1) defines the parameter of interest as Δ:
where K equals the number of sites in the population and is the impact in site s for s = 1,…, K.6
Suppose that J sites are included in the evaluation, where , and the J sites included in the evaluation are a subset of the K sites in the population of interest. Equation (2) defines the pooled impact estimator that is often computed in multisite evaluations based on purposive site selection, which is just a simple average of the site-level impact estimates from the sites included in the evaluation:
where j subscripts the J sites included in the evaluation sample and is the impact estimate in site j.
An alternative way of expressing this estimator is the following:
where equals 1 if site s from the population was included in the evaluation and equals 0 otherwise.
The bias of the estimator in equation (3) equals the expected difference between this estimator and the average impact shown in equation (1):
The expectation in equation (4) is defined across replications of a given evaluation design. The evaluation design to be replicated includes both a specific process for selecting sites and a specific methodology for estimating impacts in each site that could potentially be included in the evaluation. The methodology for estimating impacts includes both the process for selecting the study sample in each included site, and for evaluations based on random assignment and many quasi-experimental methods, a process for assigning sample members to groups. The pooled impact estimate will vary across replications for two reasons: (1) the sites selected for the evaluation will vary across replications, and (2) for each site, the individuals included in the treatment and control or comparison groups will vary across replications. The expected value of the impact estimate, , is defined as the limit of the simple average of the pooled impact estimates across replications of the evaluation as the number of replications approaches infinity.
Substituting equation (3) into equation (4), and moving the expectation inside the summation, yields equation (5):
It is important to recognize that (a) the population mean of the site-level impacts is, by definition, the parameter of interest established in equation (1) (i.e., ), and (b) the population mean of the site-level inclusion probabilities is, by construction, equal to the fraction of all sites in the population to be included in the evaluation (i.e., ).8
Equation (12) shows that the external validity bias from purposive site selection depends on three factors: the variance of impacts across sites in the population of interest (), the coefficient of variation in inclusion probabilities across sites in the population (), and the correlation between site-level impacts and the site inclusion probabilities in the population (). If all three of these factors are nonzero, then the external validity bias from purposive site selection will be nonzero, and the magnitude of the bias will depend on the magnitude of the three factors. However, if any of the three factors equals zero, the bias will be zero. In other words, the external validity bias from purposive site selection will be zero if (1) the impact is the same in all sites, (2) the probability of being included in the sample is the same in all sites (i.e., as if the sample were a simple random sample), or (3) impacts and site inclusion probabilities vary across sites in the population, but they are uncorrelated with each other—that is, the site inclusion process does not favor sites with particularly large or small impacts.
Interestingly, a parallel expression for bias has been derived in the survey nonresponse context, where the set of respondents may not be representative of the full population. Brick and Jones 2008 express the bias in the mean of an outcome Y as the product of the coefficient of variation of the probabilities of response, the standard deviation of Y, and the correlation between the response probability and Y (see equation 2 in their paper).
One factor that does not affect the external validity bias is the average impact across sites in the population of interest. Although the variance of the site-level impacts appears in equation (12), the mean of the site-level impacts does not.
In addition, increasing the number of sites in the evaluation does not necessarily reduce the bias, as we might expect. At the extreme, the external validity bias equals 0 when all sites in the population are included in the sample. However, when the study includes a small share of all sites in the population, and all the site inclusion probabilities are less than 1, increasing the number of sites in the sample will not necessarily reduce the external validity bias. For example, if all the site inclusion probabilities are increased by a constant multiplicative factor without increasing any probabilities above their limit of 1, it can be shown that the sample size will increase by the same factor and the external validity bias will be unaffected (proof available upon request). However, there is no guarantee that the site inclusion probabilities will increase by a constant multiplicative factor if an evaluation increases the number of sites to be included: It will depend on how the site recruiting process is changed to generate a larger sample of sites, and how those changes affect the terms in equation (12). Therefore, there is no necessary relationship between the number of sites included in the evaluation and the magnitude of the external validity bias.
Magnitude of the Bias
While investigating the components of the external validity bias is helpful, knowing the formula for the external bias does not yield any insights into how large the bias from purposive site selection is likely to be—either in the average study or in particular studies. Many papers have provided empirical evidence on the magnitude of a different type of bias—internal validity bias (selection bias) resulting from study designs based on nonexperimental comparison groups (e.g., Bloom, Michalopoulos, & Hill, 2005; Cook, Shadish, & Wong, 2008; Fraker & Maynard, 1987; Glazerman, Levy, & Myers, 2003; LaLonde, 1986. However, to the best of our knowledge, no published papers have provided empirical evidence on the magnitude of external validity bias resulting from purposive site selection.
In summary, the amount of external validity bias that results from purposive site selection is an empirical question for which we lack empirical evidence. Just as researchers 25 years ago had no evidence on the magnitude of the internal validity bias that would result from a nonexperimental comparison group design, researchers today have no evidence on the consequences of beginning their next multisite impact evaluation by selecting a purposive or convenience sample of sites.