Using Monte Carlo experiments to select meta‐analytic estimators

The purpose of this study is to show how Monte Carlo analysis of meta‐analytic estimators can be used to select estimators for specific research situations. Our analysis conducts 1620 individual experiments, where each experiment is defined by a unique combination of sample size, effect size, effect size heterogeneity, publication selection mechanism, and other research characteristics. We compare 11 estimators commonly used in medicine, psychology, and the social sciences. These are evaluated on the basis of bias, mean squared error (MSE), and coverage rates. For our experimental design, we reproduce simulation environments from four recent studies. We demonstrate that relative estimator performance differs across performance measures. Estimator performance is a complex interaction of performance indicator and aspects of the application. An estimator that may be especially good with respect to MSE may perform relatively poorly with respect to coverage rates. We also show that the size of the meta‐analyst's sample and effect heterogeneity are important determinants of relative estimator performance. We use these results to demonstrate how these observable characteristics can guide the meta‐analyst to choose the most appropriate estimator for their research circumstances.


| INTRODUCTION
Meta-analysts have an embarrassment of riches when it comes to choosing an estimator for measuring mean effects. The list of potential estimators is long and growing. Accordingly, a literature has arisen that attempts to provide guidance to those seeking a "best" estimator. The purpose of this study is not to produce yet another attempt at recommending estimators. Instead, this study lays out a procedure for how one can identify a best estimator for a given research application. While we provide an example of how such a procedure could work, the purpose of the example is to demonstrate the feasibility and practicality of our approach.
Selecting a best estimator for meta-analysis (MA) is complicated. "Best" depends on the meta-analyst's goals. Different meta-analysts can have different goals. Further, no estimator performs well in every situation. Yet relatively little is known about the circumstances which would cause a given estimator to perform better than others. In an ideal world, there would exist a flow-chart that would guide researchers toward the estimator that was best for their research application. Given the current state of knowledge, it is not even clear what factors should be included in such a flow-chart. This study attempts to make progress on this issue. In the remainder of this introduction, we elaborate on what this study does and the results that we obtain.
Our study performs the largest, most extensive Monte Carlo analysis of MA estimators to date. We conduct 1620 individual experiments, where each experiment is defined by a specific combination of sample size (ie, number of estimates in the meta-analyst's sample), effect size, effect size heterogeneity, publication selection mechanism, and other research characteristics. We compare 11 estimators commonly used in medicine, psychology, and the social sciences. We assess these estimators on the basis of bias, mean squared error (MSE), coverage rates, and Type I error rates.
Our Monte Carlo experiments reproduce experimental designs from four previous studies: Stanley et al 1 ; Alinaghi and Reed 2 ; Bom and Rachinger 3 ; and Carter et al. 4 We do this rather than design our own experiments for two reasons. Monte Carlo experiments are by definition artificial representations of a complex reality. They involve a large number of subjective judgments. We wanted to select designs that had to some extent been approved by the peer review process. We also wanted to use multiple experimental designs to see if results would differ across simulation environments.
Our research produces four major findings. First, estimators that rank relatively high in terms of average performance on one criterion frequently do not perform as well on other criteria. From this we conclude that metaanalysts need to prioritize which criteria (Bias, MSE, etc.) are most important to them. Second, estimators that perform relatively well in one experimental design often do not perform as well in others. We identify two possible reasons for this difference across experimental designs, though more research needs to be done to better understand why this is so.
Third, we show that effect size heterogeneity and the number of estimates in the meta-analyst's sample ("sample size") are important determinants of estimator performance. Both these characteristics are observable to the meta-analyst. As such, they can serve as elements of a "flow-chart" that allows the meta-analyst to match experimental results to their own research situation, and thus guide them to the best estimator for the problem at hand.
Lastly, we give a specific example of how this would work. Our example assumes that the meta-analyst wants an estimator that minimizes MSE. The meta-analyst's sample consists of 100 estimates characterized by a high degree of effect heterogeneity. Further, they believe that the Monte Carlo design of Carter et al 4 most closely matches their own research situation. The meta-analyst gathers all the experimental results associated with the Carter et al 4 experimental design having sample sizes of 100 and a high degree of heterogeneity. They then compare MSE performance across all 11 estimators for this set of experiments and select a "best" estimator, which can then be used for their own research.
Our study proceeds as follows. Section 2 describes the estimators that we study. Section 3 highlights the main characteristics of the different simulation environments used for our analysis. Section 4 defines the performance measures. Section 5 presents our results. Section 6 gives an example of how Monte Carlo experimental results can be used to guide the selection of a "best" estimator for a given research situation. Section 7 concludes with a Highlights • Despite much previous research, meta-analysts do not have much guidance when it comes to selecting a "best" estimator • This study shows how Monte Carlo experiments can be used to select the "best" estimator for a given research situation • We compare 11 estimators commonly used in medicine, psychology, and the social sciences • The estimators are evaluated on four performance measures: bias, mean squared error (MSE), coverage rates, and Type I error rates. • We conduct 1620 individual experiments, where an experiment is defined by a unique combination of sample size, effect size, effect size heterogeneity, publication selection mechanism, and other research characteristics • Estimators that are relatively good on one performance measure may perform relatively poorly on another • The size of the meta-analyst's sample and effect heterogeneity are important determinants of relative estimator performance • We demonstrate how the observable characteristics of sample size and effect heterogeneity can guide the meta-analyst to select the most appropriate estimator for their research circumstances summary of our main results and suggestions for future research. All of the programming code and output files associated with this project are available at https://osf.io/ pr4mb/. We note that our code borrows considerably from Carter et al. 5

| THE ESTIMATORS
As noted by Carter et al, 4(p117) while many studies have analyzed the performance of meta-analytic estimators, "… there is very little overlap among these studies in either the methods they have examined or the simulated conditions they have explored." Table 1 summarizes a selection of previous Monte Carlo studies and compares them in terms of the number of experiments and estimators studied. Our study analyses and compares the performance of 11 estimators. This compares favorably with previous studies both in terms of number of estimators and variety in the types of estimators. We chose our estimators because they either are widely used in the MA literature, or have recently appeared in prominent publications.

| The context
The estimators are best described within a research context. The following example focuses on a linear regression model, but is easily extended to analyses involving Cohen's d and Log-Odds/logistic regression. Suppose a researcher is interested in synthesizing the results of an empirical literature. The literature consists of studies that estimate the effect of X on Y using the following linear regression model, where i identifies a given regression having T i observations. The true effect of X on Y in any given regression is given by β i . β i can differ across regressions for many reasons that are unobservable to the meta-analyst. The distribution of the population effect β i across regressions is represented by β i~N (μ, τ 2 ), τ 2 ≥ 0. Letβ i be the estimated effect from regression i. The meta-analyst collects a sample of estimates,β i , =1,2,…,N, and wants to estimate μ, the population mean effect of X on Y. They know that publication selection may distort their sample of estimates. They have the following estimators available to them: Trim-and-Fill, p-curve, p-uniform, Random Effects, Three-Parameter and Four-Parameter Selection Models, Andrews and Kasy's 16 "symmetric selection" and "asymmetric selection" models, the Weighted Average of the Adequately Powered-WLS hybrid estimator, PET-PEESE, and Bom and Rachinger's 3 Endogenous Kink estimator. Each of these is briefly described below.

| Trim and Fill (TF)
Trim and Fill (Duval and Tweedie 17 ) is a method that assumes that any asymmetry in the distribution of effect sizes and SEs is due to publication selection. The method works by iteratively removing individual observations until symmetry in the distribution of effect sizes and SEs is achieved. The removed observations are then added back into the sample, along with artificially generated effect/SE observations that are the mirror images of the removed observations. This ensures that the reconstructed MA sample achieves symmetry. Our estimates are obtained using the metafor package in R.

| p-curve (pC)/p-uniform (pU)
The p-curve (Simonsohn et al 11 ) and p-uniform (van Assen et al 15 ) methods are conceptually identical and similar in implementation. Both estimate the mean true effect from the sample of MA estimates that are statistically significant; that is, have P-values less than 5%. Both assume that estimates with P-values less than .05 are equally likely to be published, and that the respective Pvalues are independently distributed. Both methods work from the starting point that the distribution of P-values (the "p-curve") will be uniformly distributed between 0 and .05 if the null hypothesis is true. Larger, positive effects produce a right skewness to the shape of the "pcurve." Neither is recommended in the presence of effect size heterogeneity (van Aert et al 14 ).
Conceptually, both methods estimate the value of the true (unobserved) effect that would produce a "p-curve" closest to the observed "p-curve." Both define a loss function that measures the distance between the (transformed) expected and the observed p-curves and choose a mean true effect that minimizes that loss function. The two methods differ in the metric they use to measure distance. P-curve uses the Kolmogorov-Smirnov test statistic as a distance metric, while p-uniform's metric is based on the Irwin-Hall distribution. They also differ in that the pcurve estimator does not produce a SE. We follow standard practice and only include significant estimates that are same-signed (positive in our case). Our p-curve estimates are obtained from the programming code in Carter et al. 5 Our p-uniform estimates use method one in the puniform package in R.

| Random effects (RE)
The random effects estimator is arguably the most commonly used meta-analytic estimator. It does not explicitly correct for publication selection other than giving greater weight to more precise estimates of β i . It estimates the population mean effect μ assuming the following specification:β is the variance inβ i due to sampling error, and τ 2 is the variance of true effects across studies. σ 2 β i is estimated by SE 2 i , where SE i is the (estimated) SE of the estimated effect,β i . A variety of procedures have been developed to estimate τ 2 . Our RE estimates are obtained using the R package metafor, wherê τ 2 is calculated using the restricted maximum likelihood method.

| Three-parameter and fourparameter selection models (3PSM and 4PSM)
A variety of selection models have been proposed in the literature to correct for publication bias (Iyengar and Greenhouse 18 ; Vevea and Hedges 19 ; Vevea and Woods 20 ). A common model is the Three-Parameter Selection Model (3PSM). 3PSM assumes that standardized effect sizes (β i =SE i ) are distributed normally in the population. Publication selection induces differential probabilities of being published, with publication probabilities following a step function. The general method allows researchers to set the values of the steps. For our 3PSM analysis, we follow Carter, Schönbrodt, Gervais, and Hilgard 4 in allocating estimates to two categories depending on whether the estimates are (a) correctly signed (positive) and statistically significant,β i =SE i À Á ≥1:96 ; or (b) not correctly signed and significant,β i =SE i À Á < 1:96 . These have relative publication probabilities equal to 1 and p 1 , respectively (see Panel A of Figure 1). The "Three-Parameters" correspond to the mean true effect (μ), the extent of effect heterogeneity (τ 2 ), and p 1 .
We also consider a Four-Parameter Selection Model. Our 4PSM adds another category to the 3PSM model: positive and insignificant estimates. The respective categories then become (a)β i =SE i À Á ≥1:96 ; (b) 0 ≤β i =SE i À Á < 1:96; and (c)β i =SE i À Á < 0. 1 The associated relative publication probabilities are equal to 1, p 1 , and p 2 (see Panel B of Figure 1); with μ, τ 2 , p 1 , and p 2 accounting for the Four-Parameters. We use R's weightfunct package to estimate 3PSM and 4PSM. When the relative probabilities of being published are equal to one (ie, no publication selection), these models collapse to the RE model.

| AK1 and AK2
Similar to 3PSM and 4PSM are two new estimators from Andrews and Kasy. 16 Figure 2). The relative probability that a significant estimate is published is  Figure 2). Andrews and Kasy 16 call this the "asymmetric selection" case. Because the P-values produced by AK1 and AK2 are based on tstatistics, they require four and six observations, respectively, in order to obtain P-values for all the parameter estimates. This can be a problem for meta-analyses with very small samples, such as is common in medicine. We use the programming code that accompanies Andrews and Kasy 16 to obtain our AK1 and AK2 estimates. 2

| Weighted average of the adequately powered-weighted least squares hybrid estimator (WAAP)
The Weighted Average of the Adequately Powered-Weighted Least Squares hybrid estimator was introduced in Stanley et al. 1 Conceptually, this estimator chooses a subset of the N estimatesβ i that are "adequately powered," defined as coming from regression equations having a power of at least 80%. Weighted Least Squares (weights = 1 ) is used to estimate Equation (2) in order to obtain an initial estimate of μ.
To determine whether a particular estimate comes from an "adequately powered" regression equation, the WAAP estimator determines a threshold value, δ, for the effect SE: whereμ is the WLS estimate of μ in Equation (2) based on the full sample of N estimated effects. Note that this initial estimate of μ does not correct for publication bias. WAAP then selects all theβ i 0 s for which SE i < δ. Let M ≤ N of theβ i 0 s satisfy this criterion. It then uses WLS to reestimate Equation (2) using only the M estimates (the "adequately powered" estimates) to obtain a revised estimate of μ. A problem can arise when there too few effect estimates that are adequately powered. Following Stanley et al 1 if there are fewer than two adequately powered effect estimates, the WAAP estimator uses the WLS estimate from the full sample of N estimated effects.

PET-PEESE stands for Precision Effect Test and Precision Effect Estimate with SE (Stanley and Doucouliagos 22 ).
The PP estimator proceeds in two steps. The first step estimates a publication-corrected variant of Equation (2) using WLS:β with weights equal to 1 . It then tests whether μ = 0. If it fails to reject this hypothesis, then PP takesμ as an estimate of μ. If it rejects μ = 0, it then estimates The estimate of μ from Equation (4b) then becomes the updated PP estimate of μ. 3 Following Stanley's 12 recommendation, we use a one-tailed test when testing μ = 0. While endogeneity lies outside the purview of our study, it should be noted that PET-PEESE, unlike the other estimators, easily accommodates IV methods.

| Endogenous Kink (EK)
Bom and Rachinger 3 recently proposed a modification to the PET-PEESE model. The modification concerns a nonlinearity between the size of the bias due to publication selection and the SE. When μ is nonzero there is no publication selection when SE is very small because all or virtually all estimates are statistically significant. As SE increases, the degree of publication selection increases. This induces a non-linearity in the relationship between bias and SE. This nonlinearity is the reason why Stanley and Doucouliagos 22 propose including SE 2 in Equation (4b).
As an alternative, Bom and Rachinger 3 propose the following kinked regression specification: where I SE ≥ a is a dummy variable that takes the value 1 whenever SE is larger than a cut-off point a. This induces a kink at SE = a. To determine a, Bom and Rachinger 3 follow a two-step procedure. They first estimate μ as if one was implementing the first stage of the PET-PEESE procedure.
Assuming the estimated effect is positive, they then calculate the lower bound of a 95% confidence interval aroundμ where the SE is derived from a RE model (to accommodate effect heterogeneity): Below a, most estimates of μ are likely to be statistically significant and thus unaffected by publication selection. Beyond a, publication selection is likely to become an increasing problem, causing the bias to be linearly related to SE. To estimate the EK model, we use programming code provided by Bom and Rachinger.

| THE SIMULATION ENVIRONMENTS
To assess the 11 estimators above, we reproduce the simulation designs from four recently published studies: Stanley et al, 1 Alinaghi and Reed, 2 Bom and Rachinger, 3 and Carter et al. 4 We chose to work with multiple simulation environments in light of Carter et al 4(p117) assessment of previous research: Different simulation studies have implemented bias differently, have drawn sample sizes from different distributions, and have varied widely in the value and form of the simulated true underlying effects. This lack of overlap is not surprising given that there is an effectively infinite number of possible combinations of different conditions to explore and no way of determining which conditions actually underlie real-world data. In other words, not only is there an inherent dimensionality problem in these simulation studies, but there is also no ground truth. These problems are often not discussed in reports of simulation studies, and indeed, many of the reports just cited-explicitly or implicitly-recommended the use of a single method, despite the fact that each study examined performance of only a handful of correction methods in only a limited subset of possible conditions. Working with multiple simulation environments allows us to determine the sensitivity of our results to alternative experimental designs.
Our choice of simulation environments was made to ensure that we covered scenarios of interest to multiple disciplines. Stanley et al 1 was published in Statistics in Medicine. Carter et al 4

was published in Advances in
Methods and Practices in Psychological Science. Alinaghi and Reed 2 and Bom and Rachinger 3 were recently published in Research Synthesis Methods. Each of the simulation designs are briefly described below. More extensive discussions can be found in the original articles. While the simulation designs cover many different scenarios, many relevant scenarios are missing from the designs.

| Stanley, Doucouliagos, and Ioannidis
SD&I 1 consider two scenarios where researchers are interested in determining the effect of a given treatment, treat = {0, 1}. In the "Log Odds Ratio" scenario, primary studies track the effect of a treatment on a binary indicator of "success." Individual observations are simulated such that the probability of "success" (Y = 1) is 10% for the control group, and (10% + a fixed effect + a mean zero, random component) for the treatment group. Effect heterogeneity is regulated by the variance of the random component, σ 2 h . Primary studies estimate a logistic regression to determine the effect of the treatment on Prob(Y = 1). The parameter of interest is the coefficient on treat, α 1 . Each study produces a single estimated effect. Variation in the SE of the estimated effects across studies is generated by allowing the primary studies to have different numbers of observations. The mean effect of treatment across all studies, α 1 , equals 0.0, 0.30, or 0.54, depending on the experiment. Sample sizes for the simulated meta-analyses vary across experiments and are pre-determined to consist of 5, 10, 20, 40, or 80 estimated effects. In the absence of publication selection, a regression of the estimated treatment effects on a constant should produce an unbiased estimate of α 1 in any given MA sample.
Publication selection consists of two regimes: no publication selection, or 50% publication selection. Under 50% publication selection, estimates are sequentially evaluated for inclusion in the meta-analyst's sample. Each estimate has a 50% chance of being "selected." If it avoids selection, the estimate is "published" without consideration to its sign and statistical significance. If selected, the estimate is "published" if it is positive and significant. If not, new estimates are generated until a positive and significant estimate is found. This continues until the meta-analyst's sample attains its pre-determined size (see Panel A of Appendix 1 in Data S1). 4 In the second scenario, "Cohen's d," primary studies estimate the effect of a treatment, but this time the dependent variable is continuous. The difference in outcomes between the treatment and control groups is equal to a fixed effect, α 1 , plus a random component that differs across studies. Effect heterogeneity is introduced through this random component, which is regulated by the parameter σ 2 h . Each primary study calculates Cohen's d, which is the standardized difference in the mean outcome values across the two groups. The mean value of d across studies is set equal to either 0 or 0.5, depending on the experiment. Differences in the SEs of d are generated by allowing the simulated primary studies to have different sample sizes. In the absence of publication selection, a regression of the estimated treatment effects on a constant will produce an unbiased estimate of the population mean of d. Sample sizes for the simulated meta-analyses are pre-determined to consist of 5, 10, 20, 40, or 80 estimated effects, depending on the experiment. The Cohen's d experiments include the no publication selection and 50% publication selection scenarios used for the Log OR scenario, plus one more: 75/100% publication selection. Under 75/100% publication selection, positive and statistically significant estimates are selected with probability 75%, but 100% of the estimates are restricted to be positive (see Panel B of Appendix 1 in Data S1).

| Alinaghi and Reed
A&R 2 study univariate regression models where a variable X affects a continuous variable Y. The parameter of interest is the coefficient on X. In the "Random Effects" data environment, each study produces one estimate and the population effect differs across studies. The coefficient on X equals a fixed component, α 1 , plus a random component that is fixed within a study but varies across studies. The overall mean effect of X on Y is given by α 1 . To estimate α 1 , the meta-analyst regresses the study specific estimates on a constant. In the absence of publication selection, the resulting estimate will be unbiased.
A distinctive feature of A&R's experiments is that they fix the size of the sample of estimated effects before publication selection, rather than after. The size of the meta-analyst's sample is thus determined endogenously, and is affected by the size of the effect. For example, very large population effects will be subject to relatively little publication selection as most estimates will satisfy the selection criteria, whether it be statistical significance or correct sign.
Another distinctive feature of A&R's experiments is that they separate statistical significance from the sign of the estimated effect as criteria for selection. Other studies commonly combine these two, assuming a mechanism that selects estimates that are both positive and statistically significant. A&R's experiments accommodate the fact that these two criteria have different, sometimes conflicting, consequences for estimator performance. All significant/correctly-signed estimates are "published," while insignificant/wrong-signed estimates only have a 10% chance of getting published.
A&R design their simulations to be representative of meta-analyses in economics and business. These typically have samples of several hundred estimates and substantial effect heterogeneity. In addition to the "Random Effects" data environment described above, A&R also construct a "Panel Random Effects" data environment, where each study has 10 estimates. This models the fact that the overwhelming share of meta-analyses in economics and business have multiple estimates per study. Effect estimates and SEs are simulated to be more similar within studies than across studies. Publication selection targets the study rather than individual estimates. To be included in the meta-analyst's sample, a study must have at least 7 out of the 10 estimates be significant/correctly signed. 5

| Bom and Rachinger
B&R 3 consider univariate regression environments where researchers are interested in estimating the effect of a variable X 1 on a dependent variable Y, represented by the parameter α 1 . Variation in the SEs of estimated effects is accomplished by allowing sample sizes to differ across primary studies. Effect heterogeneity is introduced via an omitted variable (X 2 ) that is correlated with X 1 . The coefficient on the omitted variable, α 2 , is randomly distributed across studies with mean zero and variance σ 2 h . Individual estimates of α 1 will be biased for nonzero values of α 2 . In the population of all studies, the omitted variable bias averages out. However, publication selection induces a bias in the meta-analyst's sample when selection depends on the sign and significance ofα 1 .
The experiments are designed to produce 5, 10, 20, 40, or 80 "studies" for a given simulated MA, with each study consisting of one estimated effect. In the absence of publication selection, the regression on a constant produces an unbiased estimate of α 1 , where α 1 equals either 0 or 1 depending on the experiment. Publication selection consists of four regimes: no publication selection, 25%, 50%, and 75% publication selection. The publication selection algorithm is modeled after SD&I's 50% publication selection algorithm (see Panel A of Appendix 1 in Data S1).

| Carter, Schönbrodt, Gervais, and Hilgard
In the simulation environment of CSG&H 4 (for Carter, Schönbrodt, Gervais, and Hilgard), primary studies estimate the effect of a treatment using Cohen's d as their measure of effect. The difference in outcomes for the treatment and control groups is equal to a fixed effect, α 1 , plus a random component that differs across studies. Effect heterogeneity is introduced through this random component, which is regulated by the parameter σ 2 h . The mean value of d takes on four values depending on the experiment: 0, 0.2, 0.5, and 0.8. Differences in the SEs of d for a given experiment are generated by allowing the simulated primary studies to have different sample sizes.
CSG&H introduce two types of distortions in the research environment. They employ a publication selection algorithm in which the probability of estimates being "published" depends nonlinearly on both the sign of the estimated effect and its P-value. They construct three different publication selection regimes which they call "No Publication Bias," "Medium Publication Bias," and "Strong Publication Bias." These are obtained by altering the parameters of the publication selection algorithm. They also simulate four different types of "questionable research practices" (QRPs): (a) optional removal of outliers, (b) optional selection between two dependent variables, (c) optional use of moderators, and (d) optional stopping. Finally, CSG&H also construct experiments in which the simulated MA samples take on four different sizes: 10, 30, 60, and 100. Table 2 reports the number of experiments for each of the four simulation environments, categorized by number of estimates included in the meta-analyst's sample ("Sample Size") and the extent of measured effect heterogeneity (" I 2 "). We calculate I 2 as: whereτ 2 is the estimate of effect heterogeneity using the restricted maximum likelihood method, and w i = 1=SE 2 i , and N is the number of estimates in the meta-analyst's sample. I 2 takes values between 0 and 100%. I 2 is often interpreted as a measure of the share of effect size variance that is due to heterogeneity in true effects in the population. However, Augusteijn et al. 24 demonstrate, that it is affected by publication selection. The effect of publication selection can be large, and can either increase or decrease the value of I 2 . Our simulations calculate I 2 post-publication selection. Whether that vitiates the usefulness of I 2 in the selection of estimators is an empirical question.
In order to induce greater overlap in the simulation environments, we added simulations to the SD&I, 1 B&R, 3 and CSG&H 4 experimental designs that allow for larger sample sizes. These are yellow-highlighted in the table. This resulted in a total of 1620 experiments, where an experiment is defined as a unique set of parameters determining (a) effect size, (b) effect heterogeneity, (c) publication selection, (d) sample size, and (e) (for CSG&H 4 ) questionable research practices. This compares favorably with previous studies (see Table 1). Details about the experiments are reported in Appendix 2 in Data S1.

| THE PERFORMANCE MEASURES
We evaluate estimators on three performance measures: (a) Bias, (b) Mean Squared Error (MSE), and (c) 95% Coverage Rates. With respect to bias, the average bias for any given experiment k is calculated by where R k is the total number of iterations for that experiment (typically 3000). Note that Bias k can be positive or negative. When aggregating over experiments to obtain a summary measure of performance, we calculate the average of absolute values, |Bias|= 1 R À ÁP R k = 1 Bias k j j,, where R is the total number of experiments included in the evaluation. "Best estimator" with respect to bias is defined as the estimator with the smallest value of average |Bias|. MSE for a given experiment k is calculated by When used as a summary measure of performance, it is calculated by SE= 1 R À ÁP R k = 1 MSE k . "Best estimator" with respect to MSE is defined as the estimator with the smallest value of MSE.
|Coverage-0.95| calculates the absolute value of the difference between the coverage ratethe percent of times the 95% confidence interval covers the true effectand 95%. For example, if the coverage rate in one simulation was 97%, and in another it was 93%, the mean absolute value of the difference would be 2%. "Best estimator" with respect to |Coverage-0.95| is the estimator that produces values closest to 0.
The previous performance measures apply to all 1620 experiments. The last performance measure, Type I Error Rate, only applies to experiments where the true value of the mean effect equals 0. It measures how often the estimator finds a statistically significant when in fact there is no effect. Good performance on this criterion is represented by values close to 5%.

| A caveat about using average performance to assess evaluators
A number of the estimators (3PSM, 4PSM, AK1, AK2, pC, pU, and TF) use maximization procedures to produce their estimates. In some cases, the algorithms do not converge and no estimate is produced. This can cause comparisons of average performance to be misleading. Consider the extreme case where all estimators perform well in simulation environment A but poorly in simulation environment B, but one of the estimators always fails to converge in B. That estimator will appear to be superior based on its average performance. Its superior average performance would reflect differences in experiments, and not differences in actual performance. As an aggregate measure, average performance is suggestive, but conclusions regarding performance should always be based on an inspection of performance at the level of individual experiments. 6 Section 6 provides a demonstration of this approach. Table 3 ranks average performance of the estimators for all 1620 experiments. 7 Results are separated by performance measure (Bias, MSE, etc.). The purpose of this table is not to demonstrate overall superiority for any given estimator. In addition to the problem of convergence rates discussed above, there is too much heterogeneity in these average numbers for them to be very useful. The main purpose of this table is to note that estimators that dominate on one criterion may perform relatively poorly on another.

| Relative performance differs across criteria
For example, on the dimension of bias, Bom and Rachinger's 3 Endogenous Kink (EK) estimator produces the lowest overall, mean absolute bias ("|Bias|"). However, it is dominated by Stanley, Doucouliagos, and Ioannidis's 1 WAAP estimator when it comes to mean squared error ("MSE"); and Andrews and Kasy's 16 "asymmetric selection" estimator (AK2) with respect to |Coverage-0.95| and Type I Error rates. The table color-codes the three estimators with best average performance on the respective criteria to facilitate comparison. It highlights that superior performance on one dimension does not guarantee superior performance on another. As a result, when choosing an estimator, meta-analysts should prioritize which performance measure is of most importance. Table 3 also highlights the poor performance of all the estimators when it comes to coverage and Type I error rates. While AK2 may be the "best" performing estimator on these dimensions, the mean absolute deviation between the coverage rate and 95% is 17%. The mean Type I error rate is 11%, compared to an expectation of 5%. The other estimators perform much worse. Most of the estimators have coverage rates below 70% and Type I error rates larger than 25%. This should cause researchers to question the reliability of any hypothesis testing about effect sizes that is performed in meta-analyses that use these estimators.

| Relative performance differs across simulation environments
We next focus on the sensitivity of results to simulation environment. Table 4 collects the results from all 1620 experiments and breaks them out according to each of the four simulation environments and each of the four performance criteria. In both panels, we are looking for consistency in relative performance across simulation environments.
In Panel A, which reports performance with respect to bias, simulations for three of the four simulation environments show that the AK2 estimator has best average performance, as measured by smallest mean absolute bias. However, in the CSG&H simulations, the AK2 estimator ranks 9th of 11. The 3PSM estimator ranks second in the SD&I simulations with respect to mean absolute bias, but 8th in A&R's, 5th in B&R's, and 5th in CSG&H's simulations. These inconsistencies are not unusual. Panel B ranks average performance of the estimators with respect to MSE. The AK2 estimator ranks 1st in the SD&I and A&R simulations, but 6th in B&R's simulations, and 9th in CSG&H's. Other inconsistencies across simulation environments are easily found in Panels C and D. Table 1 demonstrated that it was difficult to draw guidance from previous studies about which estimator to use because there was little overlap in the estimators being compared. Table 4 identifies a more critical issue. Even when the same estimators are being compared, one can obtain different results depending on which simulation design is being used. This raises the question: what are the factors responsible for these differences? A full treatment of the question lies beyond the scope of this study. However, we undertake an initial effort at answering this question by focusing on two features of the simulation designs: number of estimates in the simulated MA samples ("sample size"), and the extent of effect size heterogeneity, as measured by I 2 . Table 2 highlights that the different simulation environments differ on these dimensions. If these two features systematically affect estimator performance, then differences in the combinations of sample size and effect heterogeneity would provide at least a partial explanation for the differences in average performance across simulation environments.

| The influence of sample size and effect heterogeneity on relative estimator performance
It is well-known that estimator performance generally declines as effect heterogeneity increases and improves as • This column reports the percentage of false positives when the true mean effect = 0; that is, the percent of times an estimate is statistically significant when there is no true effect.  12 ). Less well-known is that relative estimator performance is also affected by these factors. In this section we demonstrate both phenomena. We use the results from the CSG&H simulations to estimate the following regressions for each of the 11 estimators (j): and Regressions were estimated using OLS with bootstrapped t-statistics to obtain P-values. Each regression used the Bias/MSE results for a given estimator j. The respective samples were constructed from the individual results of the 756 experiments in the CSG&H simulations (see Panel D of Table 2). Table 5 presents the results. They provide strong evidence that Bias and MSE increase as effect heterogeneity (I 2 ) increases. With only one exception, the coefficient on the I 2 term is positive and significant in both the Bias and MSE regressions for each of the 11 estimators. The exception is the coefficient for I 2 in the MSE regression for the p-curve estimator (pC). Sample size is also strongly associated with MSE performance. Sample size is negatively and significantly associated with MSE for each of the 11 estimators. The evidence for sample size affecting bias is not as strong. Still, 9 of the 11 estimated coefficients are negative, with 5 of 11 negative and significant at the 5% level.
While Table 5 documents changes in absolute estimator performance, Table 6 presents evidence of changes in relative performance. Once again we use the CSG&H simulation results and focus on bias and MSE. We divide the 756 CSG&H experiments into 21 separate cells depending on sample size (10, 30, 60, 100, 200, 400, 800) and effect heterogeneity (I 2 ≤ 0.25, 0.25 < I 2 ≤ 0.75, 0.75 < I 2 ). Panel D of Table 2 reports the number of experiments for each sample size/ I 2 cell.
For both Bias and MSE, we identify the top two estimators in the cell for smallest sample size (10) and effect heterogeneity (low I 2 ). For Bias, these are the AK1 and 4PSM estimators. For MSE, they are AK1 and 3PSM. We then track the relative position of these estimators as sample size and effect heterogeneity increases. The respective estimators are color-coded to facilitate tracking across cells. This panel reports the percentage of false positives when the true mean effect = 0; that is, the percent of times an estimate is statistically significant when there is no true effect. Table 6 clearly reveals that there is substantial movement in the relative rankings of average estimator performance as sample size and effect heterogeneity change. For the sake of brevity, we only report results for Bias and MSE. 8 In some cases, the change in relative ranking is dramatic. When sample size = 10, the 4PSM estimator ranks 2nd and 1st on Bias, respectively, for low and moderate effect heterogeneity. It falls to 9th when effect heterogeneity is high. In other cases, relative performance is relatively stable: Across all sample sizes, AK1 is either ranked 1st or 2nd in terms of smallest average MSE.
The table demonstrates two things. It underscores a point made previously that no estimator dominates in all research settings. However, it also suggests that there may be circumstances where one estimator is generally preferred. For example, if a researcher is interested in estimator efficiency and works in an area where effect heterogeneity is expected to be high, and if the researcher is convinced that the CSG&H simulation environment captures the key elements of their research situation, then Table 6 suggests that AK1 may be the best estimator for their analysis. However, the Table 6 results are based on average performance within a given {sample size, effect heterogeneity} cell. As demonstrated previously, averages can conceal much variation. The next section illustrates how further investigation can lead to a more definitive conclusion regarding "best" estimator.

| AN EXAMPLE OF HOW SIMULATION EXPERIMENTS CAN GUIDE THE SELECTION OF A "BEST" ESTIMATOR
Previous sections demonstrated that there is no superior estimator for all research situations. "Best" is conditional on performance measure, and depends on observable characteristics of the meta-analyst's sample such as sample size and effect heterogeneity. It also can depend on T A B L E 6 The relationship between relative estimator performance, sample size, and unobservable characteristics such as the type of publication selection (statistical significance, correct sign, both), the extent of publication selection, and other factors such as assorted questionable research practices (QRPs). By conditioning on observables and investigating performance over unobservables, one can study the relative performance of estimators and use the results to guide estimator selection for use in a given research situation. This section demonstrates how this can be done. Suppose a meta-analyst is studying the empirical literature on a given "effect," measured by Cohen's d. They collect a sample of 100 estimates. Initial analysis indicates a high degree of effect heterogeneity (I 2 > 0.75). While they are unsure whether publication selection is a problem, if it does exist, they believe selection would depend on both correct sign and statistical significance. Looking over the alternatives, it is their experienced judgment that the CSG&H simulation environment best captures the salient aspects of their research situation. However, they do not have strong priors about the size of the effect, the severity of publication selection, nor the extent of QRPs. While they would like to have an estimator that minimized bias, produced accurate coverage rates, and provided reliable tests of significance, their main priority is choosing an estimator that is efficient. We show how simulation results can be used to guide that selection. Table 7 reports the individual experimental results for sample size = 100/ High I 2 . There are a total of 30 experimental results (cf.  Indicates that all estimates failed to converge for that experiment. effect sizes {0, 0.2, 0.5, 0.8}, severities of publication selection {No, Medium, Strong}, and QRP behaviors {None, Medium, High} (see Appendix 2 in Data S1). We suppose the meta-analyst is interested in not just average MSE performance, but also the variation of MSE across situations. Since they do not know which of the respective experiments best represents their research situation, they want to avoid an estimator that occasionally produces a bad result, even if it does well on average. The top part of the table reports the individual MSE experimental results. We yellow-highlight the minimum MSE value in each experiment. Of the 11 estimators, all but two of them (WAAP and PP) are "best" in at least one experiment. This again highlights the fact that no estimator is best in all research situations. To assist in processing the large amount of information in the table, we report average performance for the 30 experiments, along with minimum and maximum MSE values, at the bottom of the table.
Given that the researcher does not know which simulated situation best represents their actual research situation, they first consider the estimator with the lowest overall average MSE. That is the AK1 estimator. It has an overall average value of 0.020. The next best estimator is the WAAP, with an overall average of 0.027. AK1 also takes on a relatively narrow range of values across the 30 experiments. Its minimum value is 0.001, and its maximum value is 0.081. This compares favorably with most of the other estimators, but not all. For example, Bom & Rachinger's EK estimator, while producing a slightly larger overall average value of MSE (0.028), takes on a narrower set of values (minimum = 0.008, maximum = 0.052). The WAAP and PP estimators have similar characteristics.
With respect to AK1, it is worth noting that simulations will tend to be biased toward selection models, because selection models have been designed to capture the very kinds of behaviors built into selection algorithms. This is not necessarily a bad thing. However, to the extent that actual publication selection behavior differs from simulated selection behavior, results may overstate the performance of selection models in real world datasets.
The researcher's choice comes down to a trade-off between mean and dispersion, a choice that is complicated by the fact that randomness in the simulation process cautions against attaching too much significance to small numerical differences. We propose one possible solution, with the researcher choosing the AK1 estimator as best (yellow-highlighted), while also choosing one or two other estimators (WAAP, PP, EK; highlighted in blue) for robustness checking.
A final contribution of our study is that we have made all of our experimental results publicly accessible via a ShinyApp at https://hong-reed.shinyapps.io/HongReedIn teractiveTables. Table 7 presented the results of 30 Monte Carlo experiments from the Carter, Schönbrodt, Gervais, and Hilgard 4 simulation environment for sample sizes of 100 and high effect heterogeneity. The online results allow researchers to explore other scenarios that may be more relevant for their particular research situations.