Mostly harmless simulations? Using Monte Carlo studies for estimator selection

1Department of Economics, University of Warwick, Warwick, UK 2Centre for Competitive Advantage in the Global Economy, University of Warwick, Warwick, UK 3Institute for Fiscal Studies, London, UK 4Department of Economics, University College London, London, UK 5Centre for Microdata Methods and Practice, London, UK 6Department of Economics and International Business School, Brandeis University, Waltham, Massachusetts, United States 7IZA—Institute of Labor Economics, Bonn, Germany

ducting a small-scale simulation study tailored to the features of the data at hand." Similarly, Huber et al. (2013) suggest that "the advantage [of an EMCS] is that it is valid in at least one relevant environment"; that is, that it is informative at least about the performance of estimators in the data set on which it was conducted.
In this paper we evaluate the premise that EMCS can be informative about the performance of estimators in the particular data that are the basis for the EMCS. We first show theoretically that these approaches are expected to be informative only under very restrictive conditions. These conditions are unlikely to hold in many practical examples faced by a researcher. We then test EMCS performance in a real-world case where we know the actual behavior of estimators. We find that in terms of selecting estimators on absolute bias they are often worse than choosing randomly. On mean squared error (MSE) they perform better than random, but no better than selecting an estimator based on simple bootstrap estimates of MSEs. Their performance in absolute terms may also still be poor.
The first type of EMCS we consider is the placebo EMCS (Huber et al., 2013). 3 This proposes a way to assign "realistic placebo treatments among the non-treated," using information about the predictors of treatment status in the original data. It then tests how well estimators can recover the zero effect of the placebo treatment. The performance of estimators in this exercise is hypothesized to be informative about their performance in the original data.
The second type we describe as the structured EMCS. An exercise of this type is undertaken by Busso et al. (2014). 4 Here a parametrized approximation of the original DGP is created, using functional form assumptions about the distributions of observed covariates. Parameters of their marginal (or conditional) distributions are estimated from the original data. Samples can be drawn from this approximate DGP, to which the estimators can then be applied. Since the treatment effect in this DGP can be calculated directly from knowledge of the parameters, performance of the estimators in these samples can be measured. The performance of estimators in this exercise is also hypothesized to be informative about their performance in the original data.
To examine whether or not EMCS can correctly choose a best-performing estimator, for various definitions of performance, we first focus on a simple example with two estimators that have Gaussian sampling distributions. We show analytically that both these approaches will only be guaranteed to correctly select the preferred estimator if they can correctly reproduce both the biases and the ordering of the variances of estimators. These are restrictive conditions that we show can easily fail in practical applications, such as when the EMCS procedures fail to recover heteroskedastic errors or misspecify the regression equations or propensity scores. In two sets of simulations based on a stylized DGP, both approaches select the better estimator less than 3% of the time-much worse than the 50% achievable by selecting randomly.
To study the extent of the problem in a real-world circumstance, we apply both methods to the National Supported Work (NSW) Demonstration data on men, previously analyzed by LaLonde (1986), Heckman and Hotz (1989), Wahba (1999, 2002), Todd (2001, 2005), and many others. In these data participation in a job training program was randomly assigned, so the treatment effect of the program can be estimated by comparing sample means. LaLonde (1986) used these data to test the performance of estimators at reproducing this treatment effect when an artificial comparison group (rather than the experimental controls) was used. We instead use the data to test how well the two EMCS procedures can inform us about the performance of the estimators: Can EMCS tell us which estimator to use? On average, how much worse than the optimal estimator is the one chosen by EMCS? How well can EMCS reproduce the ranking of performance across all estimators?
Applying the two EMCS procedures we find three main results. First, in terms of absolute bias, the EMCS procedures are no better, and often noticeably worse, than selecting an estimator at random. In two out of three cases we study, the rankings produced are negatively correlated with the true ranking. In one case the preferred estimator selected by EMCS is, on average, 30-37 times worse than the actual best estimator.
Second, EMCS does better at reproducing the performance of estimators in terms of MSE. This is because the MSEs of the estimators are mostly driven by their variances, and EMCS appears more effective at capturing variances. The rankings of estimators are consistently positively correlated with the true rankings, although the estimator preferred by EMCS has an MSE up to twice as high as the best estimator.
Third, given the variance result, we also compare EMCS procedures to choosing estimators based on performance criteria estimated from a simple nonparametric bootstrap. We find that the bootstrap is as good, and often much better, 3 It is also applied by Lechner and Wunsch (2013), Huber et al. (2016), Frölich et al. (2017, Lechner and Strittmatter (2019), and Bodory et al. (2018). A related approach is proposed by Schuler et al. (2017). 4 A similar approach is also used by Abadie and Imbens (2011), Lee (2013), and Díaz et al. (2015).

894
important question is whether either type of EMCS can help applied researchers in choosing what estimator(s) to prefer in a given context. Busso et al. (2014) indicate this might be possible, noting that their results "suggest the wisdom of con-than either of the EMCS procedures. Hence even when the procedures are somewhat informative, they are not superior to a procedure that relies on fewer design choices.
These results are unfortunate, but nevertheless important. They caution against treating either of these approaches as general solutions to the problem of estimator choice. There remains no silver bullet that can assist empirical researchers with the "right" or "best" estimator for a particular context. In the absence of a clear choice driven by research design, the best advice at this stage is likely to be implementing a number of estimators, and then considering the range of estimates provided, as Busso et al. (2014) also suggest.
Our results also have implications for researchers studying the small-sample properties of treatment effect estimators. It has been argued that "it is preferable to study DGPs that are empirically relevant" (Busso et al., 2014). Our theoretical and empirical results suggest there is little support for this claim. We show theoretically that misspecification in the construction of the DGP can lead the ranking of estimators to be incorrect for the original data set. In our empirical example, we see that EMCS is not better than using a bootstrap (and sometimes not better than random) to predict performance in the data on which the EMCS was performed. There seems to be little reason, then, to think it is particularly informative about performance in other unrelated real data sets-that is, that testing small-sample properties of estimators in "real data" is necessarily better than in completely artificial data. A more fruitful path might be to test sensitivity of estimator performance to parameters of the simulation, such as sample size and the degree of heteroskedasticity. This approach is also taken by Huber et al. (2013), and might be more helpful in understanding what characteristics of samples most affect the performance of particular estimators.

EMCS DESIGNS
We first describe the two main approaches to conducting an EMCS, namely the placebo design of Huber et al. (2013) and the structured design of Busso et al. (2014). In either EMCS design, one simulates many "empirical Monte Carlo" replication samples from a known DGP. By implementing the estimators on the simulated replications, one obtains estimates of the sampling distributions and performance criteria (e.g., MSEs) of the estimators, according to which one ranks the candidate estimators. Note that the researcher needs to make a choice of what criteria to use to rank estimators.

The placebo design
The idea of the placebo design is to assign placebo treatments to some control observations, so that by construction the treatment effect is zero, and then to attempt to recover this effect. 5 In particular, covariates and outcomes (X i , Y i ) are first drawn jointly by sampling (with replacement) from the empirical distribution of control observations. Using the original data set, the propensity score is estimated (e.g., using a logit model). The estimated parameters of this model̂are then used to assign placebo treatments to the generated sample in the following way: where i is an i.i.d. error, and both and are additional parameters to be selected. While shifts the proportion of observations that are treated, controls the extent of selection: with = 1 selection on observables takes the same form in the Monte Carlo samples as in the original data set.

The structured design
The idea of the structured design is instead to create a parametrized approximation to the original (unknown) DGP, and then draw samples from the approximated process. To begin, a fixed number of treated and control observations are created, to match the number of each in the original data set. Covariates and outcome variables are then drawn from parametrized distributions where the parameters are estimated from the original data set. For example, the variable black might come from a Bernoulli with mean estimated from the data, and the variable earnings from a log-normal distribution with mean and variance estimated from the data. The parameters of these distributions are typically estimated conditional on treatment status. Parameters of some distributions might also be conditional on the value of other variables; for example, earnings might be conditional on race as well as treatment status. More conditioning will improve the match of the joint distribution of simulated data to the joint distribution of the original data, but will increase the number of parameters that need to be estimated.

THEORY
To understand the conditions under which an EMCS might be informative about the preferred estimator in some particular data set, we first construct a simple example. Here we have only two estimators, with a straightforward and restricted joint sampling distribution (bivariate Gaussian). This bivariate Gaussian setting mimics an ideal situation in which the finite-sample distribution of the estimators is well approximated by their asymptotic distribution. We show that even in such an ideal, large-sample situation, EMCS can fail to select the best estimator if the bias in any one of the estimators or the ranking of variances is not correctly replicated in the simulated samples. We provide simple common cases for treatment effect estimation in which failure to capture the biases and heteroskedasticity contaminates EMCS, and provide results from a simple simulation illustrating this. We then extend the example to the case of more than two estimators.

Simple example: Two-estimator case
Suppose the researcher wants to rank two estimatorŝ1 and̂2 according to their statistical performance under repeated sampling. These estimators are estimating the same object of interest ∈ R, but their constructions are different. For simplicity of the illustration, assume that the joint sampling distribution of the two estimators is bivariate Gaussian: where n = n −1 , Σ = , and n is the sample size. Here, our implicit assumption is that the estimators (̂1,̂2) converge to ( 1 , 2 ) at √ n -rate. Let 0 be the true value of the parameter of interest. We alloŵ1 and/or̂2 to be biased so that 1 and/or 2 can differ from 0 .
We rank these estimators according to their statistical performance. Given that we often assess the performance of an estimator by its mean squared error (MSE) or mean absolute error (MAE), we may, for instance, rank the estimators according to their MSEs or MAEs. 6 Given the Gaussian assumption, the MSE of each estimator, j = 1, 2, is We denote by j 0 ∈ {1, 2} the index of the strictly preferred estimator, assuming it exists. Ranking the estimators is difficult in practice since we do not know the mean and variances of the estimators as well as the true value of . Proposals of the EMCS literature aim to infer a best-performing estimator j 0 by estimating the sampling distribution of̂1 and̂2 via some Monte Carlo studies. For simplicity, we assume that the estimators simulated in EMCS also follow bivariate Gaussian: , and a n is the size of a simulated sample that may differ from the size of the original sample n. The underlying parameters in EMCS, (̃1,̃2,Σ), generally depend on the original sample, but we assume for simplicity that the dependence is negligible and they can be treated as constants. EMCS computeŝ * 1 and̂ * 2 repeatedly using simulated samples of size a n drawn from a DGP with the parameter value set at known valuẽ0. For instance, the placebo EMCS approach of Huber et al. (2013) sets̃0 = 0 and a n ≤ n 0 , the size of the control group in the original data. The approach of structured EMCS sets̃0 at an estimate of 0 constructed from the original sample. In implementing EMCS, we do not have to know the mean and variance parameters of (̂ * 1 ,̂ * 2 ), and they can be estimated with arbitrary accuracy based on the simulated estimators. EMCS accordingly obtains the MSE of each estimator, j = 1, 2, bŷ We denote bŷ0 the index for a best-performing estimator estimated from EMCS,̂0 ≡ argminMSE(̂). To assess the validity of EMCS, we define a criterion of EMCS-validity by the probability that̂0 coincides with j 0 , Pr(̂0 = 0 ), where the probability is evaluated under repeated sampling of the original samples.
In the examples to follow, we investigate how this criterion of EMCS-validity becomes one or zero depending on the parameter values in the bivariate Gaussian distributions of Equations (3) and (4). We assume away the dependence of the parameters in Equation (4) on the original sample for simplicity of illustration. In such a case the MSE estimates in EMCS and resulting selection of a best estimator 0 are nonrandom when the number of Monte Carlo iterations is large enough. The criterion of EMCS-validity in this case is either 1 or 0.
We can also consider the average regret type criterion such as E[MSE(̂̂0) − MSE(̂0)] ≥ 0 to quantify EMCS-validity. Here, the expectation concerns the sampling distribution of EMCS's selection of an optimal estimator̂0. This average regret criterion can quantify the severity of a wrong choice of the estimators in terms of how much MSE is, on average, sacrificed relative to the true best-performing estimator.

Denote the biases in
We start with a scenario in which (̂1,̂2) are unbiased and the distribution of (̂ * 1 ,̂ * 2 ) well replicates the distribution of (̂1,̂2) in the following sense: Here, the biases and the sample-size-adjusted variances of the estimators simulated in EMCS coincide with those of the estimators in the original DGP. Note that the true parameter value assumed in EMCS,̃0, does not have to agree with the true parameter value in the original sampling process, 0 .
In the current scenario, the ranking of the true MSEs clearly coincides with the ranking of the MSE estimates in EMCS, implying Pr(̂0 = 0 ) = 1. This is a benchmark case in which EMCS works. The next two scenarios show that once we depart from the assumptions in Equation (5), EMCS can be no longer valid.

Scenario 2
Assume that the estimators are free from biases both in the original DGP and EMCS, b =b = 0, but EMCS fails to replicate the normalized covariance matrix of the estimators, Σ ≠Σ. In this case, the MSE estimates in EMCS correctly rank the true MSEs of the estimators (assuming 2 1 ≠ 2 2 ) if and only if the ordering of the variances of the two estimators agrees between the original sampling process and the simulated sampling process; that is, Otherwise, EMCS reverses the ranking of the estimators and incorrectly selects a suboptimal estimator as optimal, Pr(̂0 = 0 ) = 0.
Hence, even when EMCS well replicates the biases of the estimators, it can fail to select a best-performing estimator due to an incorrect variance ordering.

Scenario 3
In the third scenario, we assume that EMCS correctly replicates the variance ordering of the estimators; that is, This can correspond to a situation that the estimator 1 is correctly specified and has no bias, whereas estimator 2 is misspecified and is subject to bias in the original DGP. EMCS, however, fails to capture the misspecification bias in estimator 2. Suppose 2 1 > 2 2 holds. The true MSEs are MSE(̂1) = n −1 2 1 and MSE(̂2 Since we assumed that EMCS correctly replicates the variance of the estimators, EMCS selects j = 2 as a best estimator. This selection of the estimator is indeed misleading if b 2 is far from n ,̂1 outperformŝ2 in terms of MSE. This scenario highlights that EMCS-based selection of the estimator can fail if any one of the estimators is misspecified and the simulation design in EMCS does not replicate the misspecification bias.

Are Scenarios 2 and 3 relevant in treatment effect estimation?
We next provide simple but empirically relevant examples where we focus on the estimation of treatment effects, and show that both types of EMCS may yield misleading choices of estimators for the reasons illustrated in Scenarios 2 and 3 above.
Data are given by a random sample of 1} is her treatment status, and X i ∈ R d x is a vector of her pretreatment characteristics whose support is assumed to be bounded. We denote unit i's potential outcomes by

An example for Scenario 2
To keep our example as simple as possible, consider the following DGPs: The specified mean equations for both potential outcomes imply that the conditional average treatment effects are homogeneous over observed characteristics and equal to 1 . The potential outcomes are heteroskedastic if c ≠ 1. We assume a linear probability for the propensity score in order to simplify analytical comparisons of the variances of the estimators we introduce below. Suppose that the parameter of interest is the population average treatment effect for the treated (ATT), We consider two different estimators to estimate the population ATT. The first estimator̂1 is a semiparametric estimator for ATT, which is consistent without assuming functional forms for the outcome and propensity score equations, and asymptotically attains the semiparametric efficiency bound (SEB) of ATT derived by Hahn (1998). Estimators that attain this property include the inverse probability weighting (IPW) estimator with nonparametrically estimated propensity scores (Hirano et al., 2003), doubly robust estimators of Hahn (1998), covariate or propensity score matching estimators with a single covariate (Abadie & Imbens, 2006;, and covariate balancing estimators of Chen et al. (2008) and Graham et al. (2012Graham et al. ( , 2016. We can set any one of these estimators as our first estimator without affecting the analysis below. We specify the second estimator̂2 as the ordinary least squares (OLS) estimator of 1 in the following regression equation: In other words,̂2 =̂1 ,OLS . The specification of Equation (6) implies that̂2 is unbiased and consistent for the population ATT, 0 . We consider a situation in which the finite-sample distribution of (̂1,̂2) is well approximated by its large-sample normal approximation: where 2 1 is the asymptotic variance of √ n(̂1 − 0 ) given by SEB for ATT without the knowledge of propensity scores, and 2 2 is the asymptotic variance of √ n(̂2 − 0 ). Under the current specification, they are obtained as See Appendix A for their derivations. When Y(1) and Y(0) share the variance (c = 1), it can be shown that the OLS estimator is more efficient than the semiparametric estimator, 2 2 < 2 1 , due to exploitation of the correct functional form of the regression equation. In contrast, if the variance of the treated outcome is higher than the variance of the control outcome (c > 1), the simple OLS estimator that does not take into account the heteroskedastic errors can become less efficient than the semiparametric estimator. Specifically, we show in Appendix A that where Hence, if the degree of heteroskedasticity satisfies the condition in Equation (10), the semiparametric estimator̂1 is strictly preferred to the OLS estimator̂2. Given that c meets Equation (10), consider applying the placebo EMCS proposed in Huber et al. (2013). We assume that the two estimators are centered at zero and their simulated distributions can be well approximated by bivariate Gaussian: where n 0 is the sample size of control group in the original sample. Suppose also that the propensity scores used to generate the placebo treatment coincide with the true propensity scores in the original data. Since the placebo-treated group is generated from the original control group, it fails to replicate the variance of the treatment outcomes in the original data. As a result, the variances of √ n 0̂ * 1 and √ n 0̂ * 2 are given by the homoskedastic version (c = 1) of Equations (8) and (9): wherePr andẼ are the probability and expectation with respect to the sampling distribution specified in the placebo EMCS. This inequality is strict if e(X)|D = 1 is nondegenerate. EMCS therefore incorrectly selects the OLS estimator̂2 as a preferred estimator. The underlying mechanism for why EMCS goes wrong is in line with Scenario 2 in the previous subsection. Even in a rather ideal situation where EMCS well replicates the unbiasedness of the estimators, artificially creating a placebo treated group from the control group in the original sample distorts the variance ordering among the estimators.
Exactly the same reasoning can also invalidate structured EMCS designs if the estimated DGP from which the data are to be simulated ignores or fails to replicate the underlying heteroskedasticity of the potential outcome distributions.
This problem can be seen in a simple simulation study. We draw 1,000 samples from a DGP of the form given by Equation (6) with 1,000 observations per sample. 7 For each sample we run 1,000 replications of the placebo and structured EMCS procedures, considering IPW and OLS as our two estimators. This gives us "the true MSE" for each estimator (based on the original samples) as well as 1,000 estimates of the MSE for each combination of an estimator and an EMCS design. Looking at a simple count of how many times each procedure selects the right estimator, we see that the placebo approach selects the superior estimator only 19 times (1.9% of the time) and the structured approach is little better at 30 times (3.0%). This compares with 97.6% and 100% for the placebo and structured procedures, respectively, when there is no heteroskedasticity. Of course, this is a single example, and in a very stylized context; in Section 4 we will see that the performance of these methods is also poor in a "real-world" example.

An example for Scenario 3
We shift our focus to Scenario 3. We now introduce a bias in one of the estimators in the original DGP. For this purpose, we maintain the two estimators as in the previous example, but alter the potential outcome equations from Equation (6) with distinct slopes, t ≠ c . This causes the regression specification of Equation (7) to be misspecified so that̂2 is no longer consistent for the population ATT, plim n→∞̂2 = 2 ≠ 0 = 1 + E(X ′ |D = 1)( t − c ). See, for example, Słoczyński (2018) for analytical characterizations of the bias. On the other hand, the semiparametric estimator̂1 remains consistent and semiparametrically efficient (asymptotically attains SEB). Hence, assuming that the finite-sample distribution of (̂1,̂2) is well approximated by its asymptotic normal approximation, we have As we argued in Scenario 3 above, the bias in̂2 makeŝ2 inferior to unbiased estimator̂1 even when 2 2 < 2 1 if b 2 or the sample size is sufficiently large.
In the placebo EMCS procedure of Huber et al. (2013), the fact that the placebo-treated group is generated from the original control group removes the misspecification issue of the OLS estimator caused by the nonparallel potential outcome equations. Hencê * 2 behaves as a correctly specified OLS estimator with homoskedastic errors, and the simulated distribution of̂ * 2 fails to replicate the bias in̂2. Since the variance ordering in EMCS obtained in Equation (11) is preserved in the current example, EMCS erroneously concludes that the OLS estimator̂2 dominates the semiparametric estimator̂1.
In the case of structured EMCS procedures, if the DGP from which Monte Carlo samples are drawn is estimated under misspecification, the structured EMCS misleads the estimator selection for exactly the same reason. For example, if one were to construct the Monte Carlo DGP using linear regressions additive in D i , structured EMCS will then wrongly conclude that the OLS estimator̂2 outperforms the semiparametric estimator̂1.
Again we perform a simple simulation, analogous to the previous subsection but modifying the potential outcome equations as given by Equation (12). We perform 1,000 replications of each EMCS procedure using the same estimators, and then compare in how many cases the EMCS correctly selects the estimator with the lower MSE. Again the performance of EMCS is rather poor: placebo EMCS correctly selects IPW 2.3% of the time, and structured EMCS is correct only 0.2% of the time. See the Supporting Information Appendix for further details.

More than two estimators
Applications of EMCS often consider comparing more than two estimators. Fragility of EMCS-based estimator selection highlighted in the two estimator examples above naturally carries over to settings with more than two estimators, since ranking over multiple estimators consists of transitive pairwise rankings of any two candidate estimators.
The Monte Carlo exercises and the empirical application below consider a setting with seven estimators in the context of program evaluation with observational data. Let (̂1, … ,̂J) be the pool of J candidate estimators, and let the purpose of EMCS be to obtain a complete ordering among these J estimators according to the MSE criteria.
The EMCS-validity criteria introduced above, Pr(̂0 = 0 ) and E[MSE(̂̂0)− MSE(̂0 )], can be straightforwardly extended to the case with several estimators. In addition, to measure similarity or dissimilarity between the true ranking and estimated rankings in EMCS, it can be of interest to look at the distribution of the Kendall's tau: where (j) and̂( ), j ∈ {1, … , J}, are the ranks of estimator j with respect to the true MSE and estimated MSE in EMCS, respectively. Notinĝ∈ [−1, 1] has a distribution under repeated sampling, its mean or other location parameters can summarize how well EMCS can assess the relative performance among the candidate estimators.

APPLICATION
To demonstrate the empirical relevance of the theoretical results discussed above, and to consider the extent to which they might be a problem in practice, we provide an application of EMCS procedures to a real-world data set. In these data we have an experimental estimate of the treatment effect. By (initially) treating the experimental estimate as the true treatment effect, the aim is to show whether (or not) EMCS procedures can accurately recover the ranking of estimators that we see from the experiment. We first discuss the data used, then our approach, the estimators, and finally the details of how the EMCS procedures were conducted.

Data and context
We focus on the data for men from LaLonde (1986), used also by Heckman and Hotz (1989), Wahba (1999, 2002), and Todd (2001, 2005). 8 A subset of these data comes from the National Supported Work (NSW) Demonstration, which was a work experience program that operated in the mid-1970s at 15 locations in the USA (for a detailed description of the program see Smith and Todd, 2005). This program served several groups of disadvantaged workers, such as women with dependent children receiving welfare, former drug addicts, ex-convicts, and school dropouts. Unlike many social programs, the NSW implemented random assignment among eligible participants. This random selection allowed for straightforward evaluation of the program via a comparison of mean outcomes in the treatment and control groups.
In an influential paper, LaLonde (1986) used the design of this program to assess the performance of a large number of nonexperimental estimators of average treatment effects, many of which were based on the assumption of unconfoundedness. He set aside the original control group from the NSW data and created several alternative comparison groups using data from the Current Population Survey (CPS) and the Panel Study of Income Dynamics (PSID)-two standard data sets on the US population. His key insight was that a "good" estimator should be able to closely replicate the experimental estimate of the effect of NSW using nonexperimental data. He found that very few of the estimates were close to this benchmark. This result motivated a large number of replications and followups, and established a testbed for estimators of average treatment effects under unconfoundedness (see, e.g., Heckman & Hotz, 1989;Dehejia & Wahba, 1999;2002;Smith & Todd, 2001;2005;Abadie & Imbens, 2011;Diamond & Sekhon, 2013). Like many other papers, we use the largest of the six nonexperimental comparison groups constructed by LaLonde (1986), which he refers to as CPS-1.

Approach
In this paper we take the key insight of one step further. We treat the NSW-CPS data from LaLonde (1986) as a finite population, with 185 treated observations and 7,660 comparison observations in our main example. This comes from taking the treated sample used by Dehejia and Wahba (1999) and a trimmed version of the CPS-1 data set, where the literature suggests conditional independence might hold using the available conditioning variables. 9 From this we draw 1,000 samples, each composed of 100 treated observations and 1,900 comparison observations. We then implement the estimators described below. For each sample and each estimator we compute the difference between the estimate and the "true effect" ($1,794), which comes from the experimental estimate of the impact of NSW on earnings. With 1,000 such differences for each estimator, we can compute the MSE and other performance measures for that estimator in these data. Then, on each of the 1,000 samples, we implement the two EMCS procedures described in Section 2, and compare their performances in terms of the criteria introduced in Section 3.
One limitation of this approach is that the "true effect" we calculate is subject to sampling error. We therefore consider a second case, where we apply the insight of Smith and Todd (2005) that the control sample from the NSW can be compared to the same nonexperimental comparison group. The NSW control sample includes people who were selected in the same way as those actually treated, but who were randomized out of treatment. Now we know that the "true effect" is a precise zero, since the control sample did not actually receive treatment. Thus we have an original data set of 142 "treated" observations (who in reality received no treatment) and 7,467 comparison units. This comes from taking the "early random assignment" control sample from Smith and Todd (2005) a version of the CPS-1 data set trimmed to overlap 8 Recent work by Calónico and Smith (2017) highlighted the effects of the NSW program for women. Prior to this, women were largely ignored in the NSW literature subsequent to LaLonde (1986) because the analysis data file for women was not preserved. 9 We use a logit model to predict propensity to be in the experimental data (either as treatment or control) versus being in the CPS-1 data. We then drop all CPS-1 observations with propensity scores below the minimum or above the maximum in the experimental data. This is the trimmed CPS-1 data set, which we then combine with the NSW-treated observations from Dehejia and Wahba (1999). with these controls. Again, we draw samples by selecting 100 treated observations and 1,900 comparison observations from this population, with the true effect being precisely zero in each sample, and then perform EMCS on these samples.
Another possible worry might be that our example applies estimators that are suitable under unconfoundedness-that is, when potential outcomes are independent of treatment assignment, conditional on observed covariates. Smith and Todd (2005) question this assumption in the context of the NSW-CPS data, and especially in the context of their "early random assignment" samples. To address this concern, we take a third approach. The basic idea is to construct a population similar to the NSW-CPS data where unconfoundedness holds by construction, and then draw samples from this. We begin with a trimmed version of the Dehejia and Wahba (1999) data set used in the first case. Next, we perform four-nearest-neighbor matching (with replacement) to impute the "missing" potential outcome for each observation. This is our new population, in which we have complete knowledge of both potential outcomes, as well as a propensity score for each observation estimated from the NSW-CPS data. We then draw random subsamples of 2,000 observations (covariates, potential outcomes, and propensity scores) from this artificially created population. For each observation we create a perturbed propensity score by adding a logistic error to the NSW-CPS estimated propensity score. We assign to treatment the individuals in the top quarter of the perturbed propensity score distribution (giving 500 treated and 1,500 nontreated in each sample). By construction, treatment is therefore ensured to be independent of potential outcomes in this subsample. The overlap assumption is also satisfied, since the inclusion of a logistic error ensures that no individual is guaranteed to be treated. The true value of ATT in this sampling process is given by 1 is the probability of being treated in the assignment rule based on the ranking of the perturbed propensity scores. By design we know Y(1), Y(0), and Pr(D = 1) = 0.25, and we can approximate e(x) by the empirical frequencies in the simulations. Finally, we implement EMCS on the samples drawn in this way.

Estimators
In all our simulations we study the impact of the NSW program on earnings in 1978. We consider seven nonexperimental estimators: linear regression, Oaxaca-Blinder, inverse probability weighting (IPW), doubly-robust regression, uniform kernel matching, nearest neighbor matching, and bias-adjusted nearest neighbor matching. For details see Appendix B.
In each case we focus on the average treatment effect on the treated (ATT), unless a given method does not allow for heterogeneity in effects (in which case we estimate the overall effect of treatment). As noted above, all of these estimators are based on the assumption of unconfoundedness.
We use a single set of control variables in all our simulations. Following Dehejia and Wahba (1999) and Smith and Todd (2005), we control for age, age squared, age cubed, education, education squared, whether a high school dropout, whether married, whether black, whether Hispanic, earnings in months 13-24 prior to randomization, earnings in 1975, nonemployment in months 13-24 prior to randomization, nonemployment in 1975, and the interaction of education and earnings in months 13-24 prior to randomization.

Procedures
In Section 2 we noted that for the placebo design we require some choice of and , where determines the degree of covariate overlap between the "placebo-treated" and "placebo-control" observations and determines the proportion of "placebo-treated." We choose to ensure that the proportion of the "placebo-treated" observations in each placebo EMCS replication is equal to the proportion of treated units in the sample. 10 We also follow Huber et al. (2013) in choosing = 1 as well as in using a logit model to estimate the propensity score.
The structured design requires more choices, in particular how we specify the joint probability distribution as the product of the marginal distribution for treatment status and some conditional distributions. As discussed in Section 2, we begin each structured EMCS replication by generating a fixed number of treated and nontreated observations to match the numbers in the sample. We then order the covariates, regress each covariate on the preceding covariates (using logistic regression for binary covariates), and use this to define the conditional distribution for that covariate. In EMCS replica-tions the covariates are then drawn in the same order, from the appropriate conditional distribution. Full details of the procedure are provided in Appendix C.

RESULTS
We now describe the results of our tests of the two EMCS procedures-placebo and structured-in the context of our real-world data. As described in Section 4.2, we perform three sets of tests. First, we apply the two procedures to the NSW treatment sample, combined with the CPS-1 comparison data set. We find the performance of the procedures to be poor when it comes to finding the estimator with the lowest bias. When we study MSE (i.e., account also for variance), performance is better. This is because the rankings of estimators are mainly being driven by the variance, and both EMCS methods do well at replicating the variance components. However, given this, we also test a simple bootstrap procedure and find that it is more effective at picking the best estimator. Then, we follow Smith and Todd (2005) in using the NSW controls as our "treated" sample instead: now the effect we intend to estimate must be zero for sure, removing worries that poor performance might be an artifact caused by sampling uncertainty around the true effect. We find that the previous results are maintained. Finally, we use an adjusted version of the original data, constructed so that conditional independence necessarily holds, to allay concerns that poor performance is driven by a context in which unconfoundedness may not hold. Again, we find that the EMCS procedures do not perform well on bias, and are better on MSE, although here the bootstrap does not clearly dominate.

Testing EMCS in the NSW data
Our first results using "real-world" data focus on the variant of the original NSW treatment sample constructed by Dehejia and Wahba (1999), combined with a trimmed version of the CPS-1 comparison data set. We create 1,000 samples from the original data set by sampling 100 treated and 1,900 nontreated observations from the 185 possible treated and 7,660 comparison units in the original data. We implement the two EMCS procedures 1,000 times on each of the 1,000 samples, giving a total of 1,000,000 replications for each EMCS procedure. In each replication we implement the seven estimators described earlier, and measure how well the two EMCS procedures help us assess the relative performance of the estimators. We might measure performance of an estimator in terms of absolute bias or MSE (which also takes into account its variance). Performance of EMCS ("EMCS-validity") is then measured by how well the EMCS procedure replicates these features of an estimator in the original samples. In Section 3, we described two measures of EMCS performance suitable for when we have many estimators: 1. the average regret-that is, average difference in absolute bias/MSE between the estimator selected by EMCS and the estimator with the actual minimum absolute bias/MSE; and 2. the average Kendall's tau (Kendall's rank correlation coefficient), which measures the similarity between the ranking of estimators suggested by EMCS and the "true" ranking from the original samples.
For ease of interpretation, it is also useful to normalize the values of average regret. Our discussion below focuses on the average regret as a percentage of the minimum value of absolute bias/MSE. However, we also consider an alternative normalization, where we divide the average regret for a given EMCS procedure by the average regret for random selection of estimators (which we discuss further below). Finally, we also consider an additional measure, which is straightforward to interpret, namely: 3. the average correlation in absolute bias/MSE (rather than in the rankings, as given by Kendall's tau).
In each case the comparison is between what the EMCS procedure suggests and the results from taking the "true effect" in the original data, and then calculating the absolute bias/MSE of each estimator across the 1,000 samples.
To provide a benchmark for the performance of EMCS, we also include results from two other procedures. In the first we simply apply nonparametric bootstrap to the same samples used for the EMCS procedures. 11 We can then compare estimators on absolute bias, variance, or MSE, and also see how the resulting rankings compare to those from the original samples. Our estimates of absolute bias and MSE are centered around the point estimates in each original sample. In the second we do not create any samples, but simply rank estimators randomly. This provides a "worst-case" benchmark: Note. "EMCS approach" denotes the way in which the empirical Monte Carlo samples were generated. "Placebo" and "Structured" generate samples using the placebo and structured approaches described in Section 2. "Bootstrap" generates nonparametric bootstrap samples by sampling with replacement the same number of observations as the original data. "Random" does not generate samples but instead randomly assigns rankings to the estimators (hence statistics are only available for the performance metrics based on rankings). The absolute bias, mean squared error, and variance are features of estimators. The "minimum" value for each feature is its lowest value among our estimators in the original data-generating process (i.e., we have one value of each feature for each estimator in the "original samples" and we report the lowest of these values). See Supporting Information Appendix F for more details. Four performance measures are used for each of these statistics. "Average regret" measures the average increase in the statistic from choosing the estimator actually selected by the EMCS approach rather than the estimator with the minimum value of this statistic, as a percentage of (i) that minimum value or (ii) the average regret for random selection of estimators. "Average Kendall's tau" measures the average correlation in the ranking of estimators provided by the EMCS approach relative to the ranking in the original samples. "Average correlation" measures the average correlation in the actual values of the statistic (rather than the ranking) provided by the EMCS approach relative to the values in the original samples. All averages are taken with respect to 1,000 original samples; for each sample, a separate simulation study was conducted. The results for random selection of estimators are analytical; instead of actually generating random rankings, we report the known values of expected Kendall's tau (zero) and expected regret with random rankings. The latter value is equal to the average regret across estimators, with an equal probability of each estimator to be selected as "best." suppose a researcher knows nothing at all about performance and just picks an estimator blindly-how would they do?
Here we cannot compute a result for the correlation, but can for average regret and Kendall's tau. Table 1 shows the results from these simulations. Supporting Information Appendix F provides further details. The first result is that performance of both EMCS procedures in terms of bias is very poor. The average regret in terms of absolute bias, as a percentage of the absolute bias for the best estimator, is 3,067% (3,766%) for placebo (structured)-an order of magnitude larger than the minimum value. It is worse than choosing completely randomly, which would be 1,184% worse than the best estimator, and worse than the bootstrap, 1,000%. Looking at the ranking across estimators, the average Kendall's tau is −0.21 (−0.37) for placebo (structured). Thus the rankings produced by EMCS are, on average, negatively correlated with the ranking in the original samples. This is worse than random, which gives 0.00, and bootstrap, 0.02. The same pattern is seen in the average correlation coefficients for absolute bias, which are −0.44 (−0.51).
A researcher might be interested in knowing about performance of estimators in terms of MSE rather than only considering bias. Here EMCS performs much better. The average regret for placebo (structured) is now only 18% (16%)-much better than random (142%). Similarly, average Kendall's tau is now 0.60 and 0.64 for placebo and structured, respectively-much better than 0.00 for random. The bottom panel of Table 1 shows that this is driven by the much better performance in replicating the variances. Since the rankings here are mostly determined by the variance, being able to reproduce variances substantially improves the measures of performance relative to the metrics based on absolute bias.
However, looking at our other benchmark case-the bootstrap-we see that it outperforms both EMCS methods in terms of MSE. Average regret is lower at 7.9%, and the average Kendall's tau is much higher, at 0.83. Given that MSE performance for EMCS is driven by the variance components, this does not seem surprising. The bootstrap is a simpler procedure than the two EMCS methods, and its ability to help us understand the variability of estimators is well known. It therefore seems like a potentially valuable path with fewer design choices than EMCS.

Removing sampling error from the "true effect"
The previous subsection calculated the MSE for each estimator by comparing the value of the estimate in each sample to a "true effect" measured using the experiment. One concern might be that the estimate from the experiment is subject to sampling error, and this might somehow negatively affect our performance measures for EMCS. To test this, we now use as our "treated" observations the "early random assignment" NSW control sample from Smith and Todd (2005). Since these individuals were selected for the program in the same way as those actually treated, but were then randomized out, the actual treatment effect for them is precisely zero. We therefore repeat the exercise on these data, again implementing the two EMCS procedures 1,000 times on each of the 1,000 original samples. Table 2 documents the results. Supporting Information Appendix F provides further details. Our conclusions are similar to those in the previous subsection. In terms of absolute bias, the average regret is much lower than previously, at 30% (42%) for placebo (structured). However, this is mostly driven by a large increase in the minimum value of absolute bias, since it is much more difficult to recover the true effect of NSW in these data (Smith & Todd, 2005). This can alternatively be seen from normalizing values by the average regret for random selection of estimators. In the first simulation study, the average regret for placebo (structured) is 2.6 (3.2) times larger than for random; in the second, it is 1.6 (2.2) times larger. While these values continue to be smaller in the second simulation study, their overall magnitudes are similar in both cases. This also makes it clear that EMCS is still worse than choosing at random (average regret of 19%) and bootstrap (10%). As before, the average Kendall's tau is negative for placebo (structured) at −0.27 (−0.47), which is worse than random (0.00) and bootstrap (0.32) as well. On MSE performance is better, with average regret of 23% (32%) and average Kendall's tau of 0.65 (0.55). These are much better than random (94% and 0.00), but worse than bootstrap (17% and 0.81).

Ensuring unconfoundedness holds
Another potential concern is whether the conditional independence assumption holds. Here we take the approach described in Section 4.2 to generate 1,000 samples in which conditional independence holds by construction. Then, we implement the two EMCS procedures 500 times on each of these samples. Table 3 displays the results. Supporting Information Appendix F provides further details.
The previous results are broadly maintained even after ensuring conditional independence. In terms of absolute bias the performance of both EMCS approaches is similar to random, though now slightly better than bootstrap. In terms of MSE both procedures perform better than random selection of estimators and also marginally better than bootstrap. Average regret in terms of MSE is worse than in the first case, though average Kendall's tau is a little higher, so it is also not obvious that contexts where conditional independence holds should necessarily see better performance of EMCS procedures.

DISCUSSION
Advances in econometrics have left the empirical researcher blessed with a wealth of possible treatment effect estimators from which to choose. They have not yet provided clear guidance on which of these estimators should be preferred in which context. In this paper we studied two proposals that suggest an approach to choosing an appropriate estimator for a given context. The first approach (placebo) suggests a way to introduce placebo treatments to some control observations in a data set, and studies how well estimators can pick up the true zero effect. The second approach (structured) creates data from a known DGP whose parameters are estimated from features of the original data, and studies how well estimators can pick up the implied true effect in the DGP. We showed theoretically that both approaches can only be guaranteed to work under rather restrictive conditions: specifically, when they can correctly reproduce both the biases and the ordering of the variances of estimators. We showed simple practical cases where one or other of these might fail, and gave an example of the consequences based on simulations from an artificial DGP. To provide a real-world example, we also implemented the EMCS procedures in the NSW-CPS data, where we know the "true effect" of the program. This allowed us to compute actual performance of the estimators in samples from the original data, and compare this to what EMCS would suggest if applied to these samples. We showed that in this example EMCS performs badly on ordering estimators in terms of absolute bias, and the estimator it suggests is often many times worse than the best (or even than selecting randomly). In this example both EMCS procedures perform much better in terms of MSE because reproducing the variance term turns out to drive the MSE in these data. But this leads the methods to be no better (and sometimes substantially worse) than a simple bootstrap procedure.
These results are unfortunate, but nevertheless important. There remains no silver bullet that can assist empirical researchers with the "right" or "best" estimator for a particular context. In the absence of a clear choice driven by research design, the best advice at this stage is likely to be implementing a number of estimators, and then considering the range of estimates provided, as Busso et al. (2014) also suggest.
One possible future alternative, recently proposed, is synth-validation (Schuler et al., 2017). This approach is related to cross-validation and is based on estimating "the estimation error of causal inference methods applied to a given data set." The authors provide simulations which suggest that this "lowers the expected estimation error relative to consistently using any single method." Further work is needed to test how general this approach is, and whether it can reliably guide researchers in selecting estimators.
Wooldridge, Tiemen Woutersen, and numerous seminar and conference participants. We also thank Michael Lechner and Blaise Melly for providing us with their codes, as well as Steven Karel and Francesco Pontiggia for assistance with the Brandeis HPC cluster. This research was supported by a grant from the CERGE-EI Foundation under a program of the Global Development Network (Grant No. RRC12+09). All opinions expressed are those of the authors and have not been endorsed by CERGE-EI or the GDN. Advani also acknowledges support from Programme Evaluation for Policy Analysis, a node of the National Centre for Research Methods, supported by the ESRC . Kitagawa also acknowledges support from the ESRC through the ESRC Centre for Microdata Methods and Practice (cemmap) (Grant No. RES-589-28-0001) and from the ERC through an ERC starting grant (Grant No. EPP-715940). Słoczyński also acknowledges support from the Foundation for Polish Science (FNP) through a START scholarship and from the Theodore and Jane Norman Fund. No authors are aware of any conflict of interest.

OPEN RESEARCH BADGES
This article has earned an Open Data Badge for making publicly available the digitally-shareable data necessary to reproduce the reported results. The data is available at http://qed.econ.queensu.ca/jae/datasets/advani001/.
6. Nearest neighbor matching-nearest neighbor matching on propensity scores, which are first estimated from a logit regression, with matching on the single nearest neighbor. We match with replacement; if there are ties, all of the tied observations are used. 7. Bias-adjusted nearest neighbor matching-as above, but correcting bias as in Abadie and Imbens (2011), since nearest neighbor matching is not √ n-consistent.

APPENDIX C: EMPIRICAL APPLICATION-STRUCTURED EMCS PROCEDURE
Here we detail precisely the procedure followed to implement the structured EMCS in our empirical application. As noted previously, we begin each structured EMCS replication by generating a fixed number of treated and nontreated observations to match the number in the sample. We then draw an employment status pair of u74 and u75 (nonemployment in months 13-24 prior to randomization and nonemployment in 1975), conditional on treatment status, to match the observed conditional joint probability. For individuals who are employed in only one period, an income is drawn from a log-normal distribution conditional on treatment and employment statuses, with mean and variance calibrated to the respective conditional moments in the data. Where individuals are employed in both periods a joint log-normal distribution is used, again conditioning on treatment status. In all cases, whenever the income draw in a particular year lies outside the relevant support observed in the data, conditional on treatment status, the observation is replaced with the limit point of the empirical support, as also suggested by Busso et al. (2014). We model the joint distribution of the remaining control variables as a particular tree-structured conditional probability distribution, so that we can better fit the correlation structure in the data. The process for generating these covariates is as follows: