Accuracy and precision of fixed and random effects in meta‐analyses of randomized control trials for continuous outcomes

Meta‐analyses of treatment effects in randomized control trials are often faced with the problem of missing information required to calculate effect sizes and their sampling variances. Particularly, correlations between pre‐ and posttest scores are frequently not available. As an ad‐hoc solution, researchers impute a constant value for the missing correlation. As an alternative, we propose adopting a multivariate meta‐regression approach that models independent group effect sizes and accounts for the dependency structure using robust variance estimation or three‐level modeling. A comprehensive simulation study mimicking realistic conditions of meta‐analyses in clinical and educational psychology suggested that imputing a fixed correlation 0.8 or adopting a multivariate meta‐regression with robust variance estimation work well for estimating the pooled effect but lead to slightly distorted between‐study heterogeneity estimates. In contrast, three‐level meta‐regressions resulted in largely unbiased fixed effects but more inconsistent prediction intervals. Based on these results recommendations for meta‐analytic practice and future meta‐analytic developments are provided.

What is already known • Randomized control trials (RCT) estimate treatment effects by comparing the change between pre-and posttest in an intervention group to the change in a control group.• For the calculation of RCT effects often a constant value is imputed for missing pre-post correlations.

What is new
• Meta-analyses with imputed pre-post correlations and multivariate approaches that allow pooling RCT effects with missing pre-post correlations result in largely unbiased point and interval estimates of fixed effects, albeit three-level meta-analyses exhibit a slight overcoverage of the confidence intervals and substantial undercoverage rates of the prediction intervals.• As compared to univariate meta-analyses of posttest effect sizes, metaanalyses of RCT effects were not affected by posttest variance heterogeneity or attrition bias.

Potential impact
• The choice of the meta-analytic model has a negligible impact on the pooled effects in RCTs when pre-post correlations are missing.• For practical applications, it is recommended to conduct univariate metaanalyses that impute a fixed value of 0.8 for the missing correlation or, alternatively, adopt a multivariate meta-regression model with robust variance estimation.

| INTRODUCTION
Randomized control trials (RCT) are often considered the gold standard to infer scientific evidence from empirical data, 1,2 thus, informing decisions in health care and education policy.The focus of RCTs is on treatment or intervention effects that compare the difference in the average change of an outcome between two measurement occasions (pretest vs. posttest) for two randomly assembled groups (treatment vs. control group).Treatment effects are inferred if the average change in the treatment group that has received the intervention of interest (e.g., a novel therapy) between the two measurements is significantly larger (or smaller) as compared to the average change in the control group that received no intervention (e.g., a placebo) or an alternative intervention (e.g., an established therapy).Properly designed RCTs strengthen causal attributions of observed changes to intervention effects because they can account for three potential sources of bias, that is, time effects, selection effects, and time-selection interaction effects. 2,3Thus, RCTs allow controlling for natural changes taking place between a pretest and posttest (e.g., maturation, fatigue) that are not caused by the intervention.If time effects are ignored, natural changes might be erroneously interpreted as treatment effects, despite the treatment having no relevant effect on the outcome.Simpler designs such as prepost designs without a control group typically cannot control for these time effects.Selection effects can occur if non-random groups are compared because treatment and control groups exhibit important differences in the outcome at the pretest.If selection effects are ignored, posttest differences between groups might be erroneously interpreted as treatment effects although they merely reflect preexisting differences between groups.In RCTs, effective randomization to treatment and control groups typically controls for known and unknown confounders and, thus, ensures that groups at the pretest are comparable.Still, differential attrition between pre-and posttest can lead to nonequivalent groups at posttest, simply because response rates depend differentially on the measured outcome for the two groups.Simpler designs such as posttest comparisons between treatment and control groups that do not acknowledge pretest information often cannot control for the effects of these time-selection interactions.Finally, in within-subjects designs such as RCTs, each individual can be considered their own control which increases the power of statistical tests and the precision of inferences. 3Therefore, RCTs are often considered the best practice for studying causal relationships in prevention and intervention research. 1,2oth in clinical and educational research, metaanalyses of RCTs are often considered the most reliable evidence for intervention efficacy, particularly in areas with a limited number of participants per trial or conflicting evidence.Consequently, these meta-analyses not only receive a lot of attention from the scientific community but are also used by stakeholders that base their decisions on scientific evidence.A prominent example is a recent discussion on the efficacy and safety of umifenovir for the treatment of the coronavirus disease which has initially been advocated as an effective treatment but turned out ineffective in a quantitative meta-analysis of the available RCTs. 4 Although combining the raw data of multiple studies in individual participant meta-analyses is preferable from a methodological point of view, 5 most psychological studies do not provide the respective raw data. 6,7articularly, in clinical research often legal restrictions or ethical considerations prevent sharing the raw data (see Reference [8], for a potential remedy).Therefore, metaanalyses of summary statistics are the only viable solution in many situations.
Because reporting practices in psychology and other behavioral sciences often do not follow prevalent recommendations, 9 necessary information to adequately aggregate meta-analytic results is often missing.The current manuscript evaluates different strategies for meta-analyses of RCTs with a focus on situations when information to calculate the sampling variances of the effect sizes is missing.To this end, we propose to respecify the traditional univariate meta-analysis as a multivariate model that acknowledges dependent effects using robust variance estimation 10,11 or a three-level meta-analysis. 12,13We present a comprehensive simulation study that contrasts these approaches under different realistic conditions to derive recommendations for future meta-analytic practice.

| META-ANALYSES OF STANDARDIZED MEAN DIFFERENCES IN RCTs
In the following, we summarize the prevalent method of synthesizing RCT effect sizes in meta-analytic research.Moreover, we will emphasize shortcomings in this approach that make its application infeasible in many situations.

| The RCT effect size
The conventional effect size for RCTs with continuous outcomes is the difference in the standardized mean change between the pretest and posttest for the treatment and control groups.Let us assume that the pretest and posttest scores for the metric outcome in both groups (T = treatment group, C = control group) follow a bivariate normal distribution in the population with means μ T,pre and μ C,pre at the pretest and μ T,post and μ C,post at the posttest.If we further assume a common variance σ 2 for both groups at the time points and a common correlation ρ between pre-and posttest scores, then the standardized mean change in the population is given by δ g ¼ ðμ g,post À μ g,pre Þ=σ in each group g ¼ T, C f g.The RCT effect size for the difference in the standardized mean change is with the corresponding sample estimate for Δ as In (2), M pre and M post are the pretest and posttest means in the two groups, while SD pre is the pooled pretest standard deviation given the pretest standard deviations SD T,pre and SD C,pre and respective sample sizes n T and n C .Although different estimators for b σ have been proposed that either use independent estimates b σ g for both groups 14 or also incorporate the posttest variance, 15 simulation research suggests that the pooled pretest SDs result in the most precise estimates of the sampling variances of b Δ. 16 Finally, c df ð Þ is a bias adjustment function to correct for a small sample bias with degrees of freedom (df ) of n T þ n C -2 and the gamma function Γ x ð Þ 17,18 : The asymptotic distribution of b Δ in (2) has been derived as 16 Var b

| The random-effect meta-analytic model
If RCT effect size estimates are available from multiple samples, meta-analytic methods can be used to combine them to infer an average true effect b Δ across samples.Consider K samples to contribute effect sizes for the meta-analysis.Let b Δ k denote the effect size estimate of Δ k from the kth sample with k 1, :::, K f g and Var b Δ k ¼ v k as the corresponding sampling variance.Then, the univariate random effect model can be written as a multilevel model, such that 13 b where the sampling errors e k ¼ b Δ k À Δ k are assumed to be uncorrelated with known variance v k .uk represents the deviation of a sample-specific true effect from the average true effect and τ 2 gives the heterogeneity estimate of the distribution of the true effects.If u k ¼ 0 for all samples, (6) simplifies to a fixed-effect model.Because the assumption of no between-sample heterogeneity is rarely tenable in practice, 19 we will focus on the randomeffect model.Although different estimators have been suggested for the random effects variance τ 2 , restricted maximum likelihood (REML) has shown the most promising results for continuous outcomes under different conditions (see Reference [20] for a review).The average true effect Δ in ( 6) is typically derived as a weighted least square estimate given by b

| Unresolved challenges in RCT meta-analyses
Morris 16 advocated the use of an effect size in RCT metaanalyses which is based on the pooled pretest standard deviation (see Reference [3]) because it "provides an unbiased estimate of the population effect size" (p.24) and has a smaller sampling variance than competing estimates.However, the sampling variance estimator in (5)  relies on the pre-post correlation.While means and standard deviations of the pretest and posttest scores and sample sizes are routinely reported in scientific publications, the correlation between pretest and posttest scores is seldom found.In fact, it is not uncommon that not a single primary study included in a meta-analysis informs about the respective correlation. 22Therefore, these correlations are often imputed by a constant value such as 0.70, 23 0.60, 24 or 0.50, thus, mimicking an independent groups design. 3However, empirical effect size distributions of pre-post correlations in different fields highlight that these correlations can vary substantially depending on the domain and the studied effect. 25,26For example, Taylor and colleagues 26 found pooled pre-post correlations for different types of training effects that varied between 0.43 and 0.82.Thus, using a specific value for the unknown pre-post correlation might be misleading and reduce the efficiency of the effect size estimator in (6) (see Reference [27] for similar concerns in the context of multivariate meta-analyses).Even if pre-post correlations are available from primary studies, it might not be advisable to use sample estimates for the population correlation ρ required in (5), because especially in small samples with less than 250 participants that dominate RCT research, 28 sample correlations are highly variable and a poor estimate of the population value. 29

| A MULTIVARIATE META-REGRESSION APPROACH FOR RCTs
To overcome the problem of missing pre-post correlations, we propose modeling the RCT effect as independent group effect sizes in a meta-analytic regression framework.To do so, the RCT effect size in ( 2) is restructured as and, thus, expressed as the difference between two independent group effect sizes for the pretest and the posttest.Then, the sampling variances of the effect size b δ t at each measurement occasion t (0 = pretest, 1 = posttest) do not rely on the pre-post correlation, but correspond to (5)  when setting ρ to 0.5. 17The difference in (7) can be formalized in a multivariate meta-regression model 30 where each sample contributes two effect sizes ( b δ pre and b δ post ) as with b δ kt as the independent group effect size in sample k at measurement occasion t (0 = pretest, 1 = posttest) and e kt as the sampling error residual with known variance v kt ¼ Var b δ kt .The regression coefficient β 1 represents the difference in effect sizes between the pretest and posttest and, thus, estimates the average true effect Δ across samples as in (6), while the intercept β 0 represents the mean difference at the pretest (i.e., a selection effect).The time-specific effects u kt within a sample are jointly distributed, in the most general case, with an unstructured variance-covariance matrix T 2 ¼ τ 2 0 τ 01 τ 01 τ 2 1 " # , thus, assuming different between-study heterogeneities at pre-and posttest.Then, τ 2 0 þ τ 2 1 À 2 Á τ 01 corresponds to the total between-study heterogeneity for the RCT effect size, that is, τ 2 in (6). 1 However, because the variancecovariance structure of the time-specific effects is rather complex, more restrictive specifications might be more appropriate in practice.For example, properly designed RCTs should result in no group differences at the pretest because respondents are randomized to the treatment and control groups (see Reference [31] for an overview of different approaches).Therefore, if the assumption of no (or negligible) pretest imbalance seems justified (which often is the case, see References [28,32]), β 0 as well as τ 2 0 and τ 01 in T 2 could be constrained to 0; this would reduce the total between-study heterogeneity to τ 2 ¼ τ 2 1 .Moreover, because RCT metaanalyses often include rather few primary studies with small samples, 33 the covariance in T 2 might be sometimes practically non-identifiable, 34,35 thus, requiring modeling independent variances at pre-and posttest.In practice, proper constraints on the variance structure can be identified by comparing models with different random effects structures, for example, using likelihood ratio tests. 36ecause each sample contributes two effect sizes to the multivariate meta-analytic model in (8), the b δ kÁ are no longer independent but exhibit an unknown withinsample correlation ρ k .However, information on the exact value of ρ k is not essential because dependent effects can be acknowledged using robust variance estimation (RVE 10,11 ) or extending (8) to a three-level model (TLM 12,13 ) which do not rely on a known correlation.

| Robust variance estimation
RVE corrects the standard errors of the model parameters estimated in (8) without requiring information on the exact variance-covariance structure for the dependent effect sizes within a sample.This is achieved by specifying a "working model" for the dependence structure such as using a common correlation ρ k of 0.8 within samples. 10Then, estimates of the study-specific covariance matrices are empirically derived using weighted least squares as the products of the regression residuals (see Reference [11], and the Appendix A).Although each study-specific estimate might be rather imprecise, their average across multiple samples is sufficiently precise when the number of samples is large.However, smallsample corrections can be incorporated into the variance-covariance estimator to yield approximately unbiased standard errors when the number of studies is small. 37,38Importantly, although the robust standard errors are unbiased even if the "working model" is not correctly specified, the precision of the resulting estimates increases the closer the true dependency structure is approximated.

| Three-level meta-analysis
TLM extends the model in ( 8) by an additional random effect u k .Thus, the total random variance is split into two components, the between-sample variation u k and the residual within-sample variation u kt in true effects: Then, the total between-study heterogeneity for the RCT effect size, that is, τ 2 in (6), corresponds to 2 Á τ 2 w . 2 Moreover, the TLM in (9) assumes independent sampling errors e kt .Thus, instead of specifying correlated errors, the TLM models correlated true effects within samples.This assumption is clearly violated when multiple effect sizes are based on the same sample.Additionally, the TLM implies a similar degree of between-sample heterogeneity for all effect sizes. 39Although these assumptions might not be tenable in empirical applications, Van den Noortgate and colleagues 13,40 showed that the three-level random effect structure can account for within-sample dependencies reasonably well when the number of effect sizes per sample are small and the random variances are large.Therefore, models with multiple random effects are increasingly used in applied meta-analytic research (see Reference [41] for a review).However, TLMs tend to suffer from convergence issues when the number of samples is small and show increased parameter bias when pooling outcome-specific effect sizes. 42

| OBJECTIVES AND RESEARCH QUESTIONS
As outlined above, different approaches are currently in use to pool RCT effect sizes across multiple samples.Hitherto, little is known to what degree and under which conditions the use of an ad-hoc substitute for the unknown pre-post correlation might distort the resulting meta-analytic estimates.Additionally, the proposed multivariate approach could improve current practice by modeling the RCT effect as dependent effects in line with current state-of-the-art approaches to model dependencies in meta-analytic research. 11,42Therefore, we present a comprehensive Monte Carlo simulation that evaluated the precision of different meta-analytic methods for pooling RCT effect sizes under various realistic conditions.Based on these results, we provide recommendations for future meta-analytic practice.

| Meta-analytic models for RCT effects
The simulation compared three univariate random-effects meta-analyses of RCT effect sizes and two multivariate random-effects meta-analyses of independent group effect sizes.As a point of comparison, we also included a univariate meta-analysis of standardized differences in posttest means that ignored any pretest information.All models used a REML estimator with a maximum number of 1000 iterations for the optimizer to converge.Although REML results in slightly negatively biased heterogeneity estimates in univariate meta-analyses, it is less biased than maximum likelihood estimation 43 and also compares favorably to alternative estimators as reviewed by Veroniki and colleagues. 20More importantly, REML is applicable for univariate as well multivariate meta-analyses.

| Univariate meta-analyses of RCT effect sizes
The reference approach (UMA-S with S for sample) consisted of inverse-variance weighted random-effects metaanalyses for which all primary studies reported the required sample statistics to calculate the effect size in (2) and its sampling variance in (5).In (5), the sample effect size d was used for Δ and the sample pre-post correlation r was used for ρ.The second approach (UMA-P with P for population) also assumed that all primary studies reported the required sample statistics, but since (5) requires the population value for the pre-post correlation, a two-step approach was adopted.First, a random-effects metaanalysis (REML estimation) pooled the inverse varianceweighted Fisher's Z transformed pre-post correlations that were reported in the primary studies to derive a pooled pre-post correlation b ρ.Then, the pooled pre-post correlation was used in (5) for the calculation of the sampling variances of the RCT effect sizes.It was assumed that the pooled pre-post correlation would more precisely represent the population correlation ρ required in ( 5), particularly in small samples. 29For the third approach (UMA-I with I for imputation), we simulated meta-analyses for which neither primary study reported the necessary prepost correlations.Therefore, the sampling variances in (3) were calculated by imputing a constant value of either 0.5 or 0.8 for the unknown correlation ρ in the population (5).For all univariate meta-analyses, standard errors and 95% confidence intervals were adjusted following Knapp and Hartung 44 for better control of type I error rates in small samples.Prediction intervals (PI) were calculated as where SE b Δ is the standard error of the estimated pooled effect b Δ, b τ 2 is the estimated between-sample variance, t 1,kÀ2 ð Þ is the 97.5 percentile of the t-distribution, and k gives the number of samples. 45

| Multivariate meta-analyses of pre-and posttest effect sizes
Two independent group effect sizes (d post and d pred ) with their sampling variances were calculated for each simulated sample.Then, the pooled RCT effect was estimated using the regression framework in (8).The first approach adopted a robust meta-analytic model (i.e., RVE) using ( 8) as a working model and assuming correlated errors of either 0.5 or 0.8.Then, cluster-robust variances with a bias-reduced linearization correction were calculated for the regression parameters to account for heteroscedasticity and unmodeled errors. 37,46Moreover, confidence intervals incorporated the Satterthwaite correction for the degrees of freedom which has been shown to lead to more appropriate coverage rates of confidence intervals. 37he second approach also pooled two independent group effect sizes but acknowledged the dependencies between effects by estimating a three-level meta-analytic model (TLM) as given in (9).Thus, the random variance terms modeled effect sizes nested within studies. 12,13For all multivariate meta-analyses, prediction intervals were calculated following (10) replacing b Δ with b β 1 from ( 8) or ( 9) and b τ 2 with the total between-study heterogeneity.

| Univariate meta-analyses of posttest effect sizes
As the most basic strategy for analyzing treatment effects, inverse-variance weighted random-effects meta-analyses pooled the posttest effect sizes without considering the pretest (UMA-B with B for basic).The effect size for the standardized mean difference between independent groups was calculated as b with SD post given by (3) using the posttest standard deviations. 17,18The sampling variance followed (5) when setting ρ to 0.5. 17Again, standard errors and confidence intervals were adjusted following Knapp and Hartung. 44rediction intervals were calculated as in (10).

| Experimental design
The simulation aimed to mimic typical conditions of metaanalyses of pre-post intervention studies with continuous outcomes that are often encountered in evaluation studies across many disciplines such as clinical (e.g., psychotherapy) and educational research (e.g., teaching) or personnel psychology (e.g., employee training).The present study manipulated seven design factors to evaluate their impact on the simulation results (see Table 1).These included the number of effect sizes in a meta-analysis (K), the average samples size per effect size (n), the true change in the treatment group (Δ), the true pre-post correlation (ρ), the true posttest variance in the treatment group (Φ), the betweenstudy heterogeneity (τ 2 ), and the presence of attrition bias.This resulted in a 5 (effect sizes) Â 3 (sample sizes) Â 3 (true changes) Â 3 (true correlations) Â 2 (true posttest variances) Â 2 (between-sample heterogeneities) Â 2 (attrition biases) fully crossed simulation design.

| Number of effect sizes per meta-analysis
Meta-analyses in psychology and education often combine between 10 and 200 effect sizes. 47,48Meta-analyses on training studies that often adopt RCT designs are usually located at the lower end of this distribution, regardless of the discipline.For example, Collins and Holton 49 reported meta-analytic effects on the effectiveness of managerial leadership development programs that included between 6 and 23 samples.Similarly, various meta-analyses on behavior modeling training effects for different outcomes were based on 14 to 66 effect sizes, with most of them including less than 40. 26In metaanalyses of clinical trials, the number of pooled effect sizes is even substantially smaller.In psychology, metaanalyses on the effectiveness of clinical psychology treatments include a median of 18 studies. 50For medical trials, a review of the Cochrane Database of Systematic Reviews found that of nearly 3000 meta-analyses on mental health, 90% pooled results from up to 10 studies, while half of them included no more than three studies. 28herefore, the number of effect sizes per meta-analysis was set to either 3, 5, 10, 20, or 40.
T A B L E 1 Experimental conditions and constant settings for simulation.

Experimental condition Values
Number of effect sizes per meta-analysis (K) Sample sizes of primary studies in meta-analysis differ largely depending on the setting (e.g., educational, clinical, personnel) and the specificity of the group (e.g., students with learning disabilities, patients with Parkinson's disease, team leaders).For example, Taylor and colleagues 26 reported a mean sample size (including control and treatment group) in meta-analyses of training studies of 37 (Min = 5, Max = 271), which was similar to metaanalyses in clinical psychology. 50In contrast, the abovementioned review of the Cochrane database 28  ).These vectors were replicated k/5 times in a given meta-analysis to meet the total number of simulated samples.The sample sizes in the treatment and control groups were equal.

| True change in the treatment group
Lipsey and Wilson 48 found a median standardized difference (Cohen's d) in meta-analyses of psychological treatment effects of about 0.47.However, meta-analyses of training studies also reported, depending on the observed outcome, pooled effects that reached up to 1.00. 26In contrast, clinical studies often observe more modest effect sizes.A review of more than 100,000 clinical trials conducted between 1975 and 2014 showed that-independent of the year of study-the average effect is about 0.20. 53ducational studies with randomized designs even produce average effects of only about 0.10 to 0.16. 54,55Therefore, the standardized mean change in the treatment group was set to either 0.20, 0.40, or 0.80 representing small, medium, and large effect sizes.In the control group, a standardized mean change of 0.00 was assumed.

| True pre-post correlation
Pooled pre-post correlations in training studies typically fall between 0.43 and 0.82. 26In various meta-analyses of psychiatric RCTs the median of the pooled pre-post correlations was 0.36 with the 25th and 75th percentile amounting to 0.22 and 0.58.Negative correlations were uncommon. 24Therefore, we used pre-post correlations of either 0.20, 0.50, or 0.80 that were identical in the treatment and control groups.

| True posttest variance in treatment group
The effect size for RCT designs assumes homogenous variances at pre-and posttest as well for treatment and control conditions. 16However, if participants are differently affected by the treatment, some of them will improve more strongly while others will improve less.Consequently, the posttest scores will exhibit a larger variance as compared to the control group or the pretest.For example, in a meta-analysis of training studies, the posttest standard deviations increased by about 7.6% in the treatment group. 56Similarly, a review of meta-analysis of clinical trials reported that, on average, empirical pre-post correlations for treatment groups were about Δr = 0.20 smaller than the respective pre-post correlations in the control groups, thus, reflecting larger posttest variances in the treatment groups. 24On the other hand, often meta-analyses do not identify pronounced treatment heterogeneity, thus, making the assumption of homogeneous variances plausible for many applications. 57,58To study potential effects of heterogeneous variances, we set the posttest variance in the treatment group to either 1.0 or 1.5 times the population variance.Although a variance increase by 50% seems unrealistic in most cases, it was chosen as a worst-case scenario (see also Morris 16 for a similar condition).

| Between-study variances
Van Erp and colleagues 59 reviewed heterogeneity estimates in over 700 psychological meta-analyses and found a median between-study heterogeneity of τ Δ = 0.20 (IQR = [0.10,0.33]) for meta-analyses of standardized mean differences.A similar review by Linden and Hönekopp 60 identified a slightly larger mean between-study heterogeneity across 150 meta-analyses from cognitive, organizational, and social psychology of τ Δ = 0.30, whereas multiple close replications were less variable with a mean τ Δ of 0.09.Therefore, we used a betweenstudy heterogeneity in (6) of either 0.10 or 0.30, thus, reflecting small and large heterogeneity, respectively.

| Attrition bias
Longitudinal studies often suffer from sample attrition because not all participants randomized to the control and treatment group at the pretest also participate in the posttest measurement.In the past, average attrition rates between 13% and 19% have been reported for medical trials and educational interventions, respectively. 61,62Differential attrition for treatment and control groups was typically small. 61,63Importantly, attrition is primarily a concern if it introduces bias because the likelihood of nonparticipation is associated with pre-and posttest scores.
Bias is sometimes considered problematic if it exceeds a threshold of jdj = 0.05. 64However, a recent examination of attrition bias in 10 educational RCTs found only a mean absolute bias of 0.026. 65Similarly, the average bias in medical RCTs was about 0.02, albeit attrition increased the between-study variance. 66Thus, attrition bias might not be a widespread threat to the validity of RCTs.Nevertheless, we considered a situation where RCTs exhibited an average attrition bias of d = 0.05 and compared the respective results to a condition without bias.

| Data simulation and model estimation
For each experimental condition outlined above, the pooled treatment effect was calculated using different random-effects meta-analyses from randomly generated samples.The entire simulation procedure for a given condition followed seven steps: 1.For a given sample k included in a meta-analysis, the true change in the treatment group Δ k was randomly drawn from a normal distribution N Δ, τ 2 Δ ð Þwith Δ and τ 2 Δ representing the true change and between-sample heterogeneity depending on the experimental condition.2. For a given sample k included in a meta-analysis, the true pre-post correlation ρ k was randomly drawn from a normal distribution as tanh ð Þ representing the inverse hyperbolic tangent function, thus, giving the Fisher's Z transformed true pre-post correlation ρ depending on the experimental condition, and τ 2 ρ giving the between-sample heterogeneity.For a given meta-analysis, τ ρ was derived by a random draw from a half-normal distribution min{half-N(0, 0.25), 0.5}.This closely reproduced the empirical distribution of between-sample heterogeneities identified in over 700 psychological meta-analyses of correlation coefficients that gave a median of τ ρ = 0.16 (IQR = [0.08,0.22]). 59and Φ representing the total sample size and posttest variance depending on the experimental condition.The covariances for the attrition indicator were identified by trial and error to produce an average attrition bias of about 0.05 for an attrition rate of 20%.In the condition without attrition bias, the first n/2 simulated rows were retained, whereas in the condition with attrition bias, the n/2 rows with the lowest values on the attrition indicator were retained.In this way, attrition bias did not affect the manipulated sample size.
4. For the control group in sample k, n/2 data points representing the pre-and posttest scores were randomly drawn from a multivariate normal distribution with n representing the total sample size depending on the experimental condition.

5.
In each sample, one RCT effect size and two independent group effect sizes with their sampling variances were calculated according to (2, 5).6. Steps 1 to 5 were repeated to generate K samples for a given meta-analysis according to the experimental condition.7. The different meta-analytic models were applied to the simulated samples to derive the pooled effect b Δ, the heterogeneity estimate τ 2 , a 95% confidence interval for b Δ, and a 95% prediction interval for b Δ.
These steps were replicated 1000 times for each experimental condition.By default, multivariate meta-analyses used a bound constraint quasi-Newton optimizer (nlminb), 67 but in case of a convergence failure resorted to the Nelder and Mead 68 method.Replications for which a meta-analytic model still failed to converge were discarded and replaced with a valid case.All analyses were conducted in R version 4.2.2 with the packages metafor version 3.8.1 and clubSandwich version 0.5.8. 69,70

| Performance criteria
The accuracy of an estimator b θ was compared using the average parameter bias and root mean squared error (RMSE) which were evaluated for the mean RCT effect b Δ and the heterogeneity estimate τ 2 : For both criteria, values close to 0 indicate preferable estimators.However, because RMSE can be simplified to , a more biased estimator might be more efficient if it yields a considerably smaller variance.Moreover, the coverage rates of the 95% confidence intervals were calculated as the percentage of replications for which the true effect Δ fell within the confidence interval.Similarly, the coverage rates of the 95% prediction intervals were compared for the different estimators to study the precision that treatment effects in hypothetical future samples could be predicted.These coverage rates were calculated by randomly drawing a value from N Δ,τ 2 Δ ð Þand determining the percentage of replications for which this true effect fell within the prediction interval.Accurate confidence and prediction interval estimators should exhibit a nominal probability of 95%.Finally, we also calculated the widths of the 95% confidence and prediction intervals.
The Monte Carlo error (MCE) for each performance criterion was estimated using the jackknife method. 71Let R represent the number of replications with X ¼ X 1 , X 2 , :::, X R f ggiving the estimated replicates from which the performance criterion θ X ð Þ (e.g., bias, RMSE, coverage rate) is calculated.If X -r with r 1, …, R f grepresents the subset of X without the rth replicate, then the MCE for θ is given as The MCE allows quantifying the precision for each performance criterion to compare differences across different simulation conditions.However, because of the large number of replications used in our simulation, the obtained MCEs were rather small.Thus, we refrain from reporting confidence intervals but provide the median and maximum MCE for each performance criterion.

| RESULTS
The simulation results are summarized separately for the different performance criteria.Given the large number of experimental conditions that resulted in 1080 unique cells, the results presented in the following tables and figures refer to the condition for a medium treatment effect (Δ = 0.4) with homogenous posttest variances (Φ ¼ 1:0).Moreover, we will focus on the conditions without attrition bias for the different meta-analyses of pre-post effects; specific results for UMA-B or attrition bias will be selectively pointed out in the text (full results are available in the online material).Moreover, factorial analyses of variance (ANOVA) evaluated the source of the variability in the performance criteria to determine which combinations of experimental conditions produced stronger effects (in terms of η 2 ) and warranted detailed scrutiny (see Table 2).Again, these analyses were limited to the different estimators of pre-post effects but did not include the meta-analyses of posttest effect sizes.

| Convergence rates
For all estimators and experimental conditions, the estimated models converged successfully after 1000 iterations.However, for about 1.7% of the TLMs the default optimizer (nlminb) 67 failed to converge requiring the use of an alternative optimization algorithm. 68RVEs did not exhibit similar convergence problems.

| Average bias and root mean squared error of fixed-effect estimators
The Monte Carlo errors for the average bias in Δ and RMSE were negligible in all conditions (Mdn = 0.003/0.002,Max = 0.012/0.009),thus, allowing for valid comparisons of the respective point estimates.Factorial analyses of variance for the simulation conditions showed that the main effect of the meta-analytic method explained about 16.7% of the variance in the average bias (see Table 2).Moreover, this effect was qualified by small two-way interactions with the average sample size (η 2 = 5.3%) and the true change in the treatment group (η 2 = 3.5%).Figure 1 summarizes the average bias by average sample size, number of studies, and true pre-post correlation for the conditions with a small and large betweensample heterogeneity.These results show that all estimators were slightly negatively biased at smaller average sample sizes.Regarding the meta-analytic method, the largest bias was observed for UMA-S which used samplespecific pre-post correlations for the calculation of the RCT sampling variances.In contrast, UMA-P which pooled the pre-post correlations before calculating the sampling variances of the RCT effects exhibited a smaller bias and performed comparably to the different approaches that did not make use of the sample pre-post correlations (UMA-I, RVE, TLM).The negative bias was more pronounced for smaller average sample sizes or larger true effects (see Figure S1 in the supplement material), whereas the other factors had no substantial impact.In these situations, RVE (r = 0.8) exhibited a slightly smaller bias as compared to the other multivariate estimators.
UMA-B that ignored the pretest information exhibited accuracies that were comparable to the metaanalyses of pre-post effect sizes (see Figure 1), at least as long as the posttest statistics were not systematically distorted.Attrition bias led to noticeable larger biases that reached up to À0.08 for larger true effects.Moreover, posttest variance heterogeneity amplified this effect and resulted in biases up to À0.15 to À0.11 for the conditions with and without attrition bias, respectively (see Figure 2).In contrast, meta-analyses using the pretest information were not affected by attrition bias.
The RMSE of the different estimators was not affected by the meta-analytic method (see Table 2).Although it was strongly affected by the number of included primary studies and grew larger for meta-analyses with a smaller number of samples or larger between-sample heterogeneity (see Figure S2 in the supplement material), the metaanalytic method did not produce different effects.This suggests that despite the larger bias of UMA-S, the estimator seems to exhibit a smaller variance.As a consequence, it performed rather comparably in terms of efficiency in relation to the other estimators.

| Average bias and root mean squared error of random-effect estimators
The average bias for τ 2 exhibited negligible Monte Carlo error in all conditions (Mdn <0.002, Max = 0.014).However, bias was affected by the chosen meta-analytic method (η 2 = 34.4%)including their two-way interactions with the true pre-post correlation (η 2 = 6.7%) and the average sample size (η 2 = 6.6%).Figure 3 highlights that these effects were primarily driven by UMA-I and RVE.Imputing a constant pre-post correlation of 0.8 led to an overestimation of the between-study heterogeneity in situations where the true correlation was substantially smaller, particularly at small sample sizes when few primary studies were available.In contrast, for UMA-I (r = 0.5) a mismatch between the imputed and true correlation was less severe.A highly similar pattern was observed for RVE that resulted in a positive bias when using a correlation that was larger than the true pre-post correlation.In contrast,  TLM was less biased, except in conditions with large between-sample heterogeneity.In these instances, TLM resulted in slightly negative biases, particularly when the true pre-post correlation was large.Similarly, UMA-B was largely unbiased across most conditions.Posttest heterogeneity or attrition bias did not affect UMA-B or any of the other estimators (see Figure S3).The results for the RMSE of the examined estimators (MCE: Mdn = 0.002, Max = 0.009) mirrored those for the bias.Again, UMA-I (r = 0.8) and RVE (r = 0.8) showed larger RMSE at small sample sizes and small pre-post correlations, while UMA-I (r = 0.5) and RVE (r = 0.5) were more efficient across all conditions (see Figure S4).In contrast, TLM seemed as efficient as the univariate metaanalyses.Again, posttest variance heterogeneity or attrition bias did not affect these results.

| Coverage rates of confidence intervals
The coverage rates of the 95% confidence intervals (MCE: Mdn = 0.690, Max = 1.436) were substantially affected by the meta-analytic method (see Table 2).The respective main effect (η 2 = 38.1%)was additionally qualified by two-way interactions with the between-sample heterogeneity (η 2 = 10.8%)and, to a lesser degree, also the prepost correlation (η 2 = 5.2%), number of samples (η 2 = 5.9%), and average sample size (η 2 = 4.2%).The results in Figure 4 indicate that TLMs exhibited overcoverage, particularly when the between-sample heterogeneity was small or the true pre-post correlations were large.In contrast, the other estimators achieved coverage rates close to the nominal 95%.Only at small sample sizes, meta-analyses using the known pre-post correlations (UMA-S) showed undercoverage in a few conditions.Again, UMA-B which ignored the pretest information exhibited substantially lower coverage rates when attrition bias was present or posttest variances were larger as compared to the pretest (see Figure S5).In the most extreme cases, for example, for a large true effect, the respective coverage rate fell as low as 1.3%.

| Coverage rates of prediction intervals
The coverage rates of the 95% prediction intervals (MCE: Mdn = 0.774, Max = 1.582) were substantially affected by the meta-analytic method (η 2 = 15.4%) Coverage rates of 95% confidence intervals for a medium treatment effect, homogenous posttest variances, and no attrition bias.Detailed results are given in Tables S5 and S6  which was qualified by two-way interactions with the number of samples (η 2 = 14.8%).Across all conditions, the best coverage rates were observed for UMA-I (r = 0.8) and RVE (r = 0.8) with median coverage rates falling between 92% and 97% (Min = 78%; see Figure 5).However, at small between-sample heterogeneity and large pre-post correlations these estimators resulted in substantial undercoverage, while they were close to the nominal level in the remaining conditions.Using r = 0.5 for the unknown pre-post correlation led to more heterogenous results with coverage rates falling as low as 39% and 49% for UMA-I (r = 0.5) and RVE (r = 0.5) , respectively.TLM exhibited severe undercoverage in most conditions with a median coverage rate of 90%.Particularly, at large between-sample heterogeneity and large prepost correlations coverage rates for TLM were rather low (Min = 42%).Univariate meta-analyses using the known pre-post correlations (UMA-S or UMA-P) showed rather severe undercoverage in most conditions, falling as low as 73% (Mdn = 90%).UMA-B exhibited coverage rates comparable to the other univariate estimators.Attrition bias or posttest variance heterogeneity amplified the undercoverage for UMA-B but had negligible effects on the other estimators (see Figure S6).

| Widths of 95% confidence and prediction intervals
The widths of the 95% confidence intervals (MCE: Mdn = 0.003, Max = 0.049) and the 95% prediction intervals (MCE: Mdn = 0.014, Max = 0.304) were hardly affected by the meta-analytic method (see Table 2).The estimators and their interactions explained between 0.0% and 0.8% of the variance in the performance indicator.Thus, the examined meta-analytic estimators did not substantially affect the widths of these intervals.

| ILLUSTRATIVE DATA EXAMPLE
To demonstrate the effect of the different approaches for the calculation of pooled RCT effects, let us consider a reanalysis of an existing RCT meta-analysis.This example aims to demonstrate that the choice of the metaanalytic method can matter and yield non-negligible variations in pooled RCT effects depending on the modeling approach.Carl and colleagues 72 evaluated the efficacy of virtual reality exposure therapy for the treatment of various anxiety-related disorders.The 14 RCTs on specific phobias and social anxieties included in the reanalysis provided mean pre-and posttest scores for a treatment group and an untreated control group (waitlist) with the respective standard deviations.As is common in clinical research reports, none of the primary studies provided the pre-post correlation.Therefore, we estimated the pooled effects with two univariate meta-analyses of RCT effect sizes that imputed a constant value of either 0.5 or 0.8 for the missing pre-post correlation.In addition, three multivariate meta-analyses of independent group effect sizes were conducted that either used RVE with working F I G U R E 5 Coverage rates of 95% prediction intervals for a medium treatment effect, homogenous posttest variances, and no attrition bias.Results for values falling below 0.80 are not presented.Full are results are given in Tables S7 and S8 of the supplement material.
[Colour figure can be viewed at wileyonlinelibrary.com] models assuming correlations of either 0.5 or 0.8 between the dependent effects or a TLM that accounted for dependencies with an additional random effect.As a point of comparison, we also report the results of a meta-analysis of posttest effect sizes that ignore any pretest information (Table 3).
As summarized in Table 3, the univariate meta-analyses of RCT effect sizes resulted in pooled point estimates Δ between À1.06 and À1.02, depending on the size of the imputed pre-post correlation.Similarly, the multivariate meta-analyses exhibited effect estimates around À1.06 to À1.04.In contrast, the meta-analysis of posttest effect sizes identified a slightly smaller effect of À0.96, thus, suggesting the presence of potential selection or time-selection interaction biases.The precision of the RCT effects was similar for all approaches except TLM and resulted in 95% confidence intervals of comparable widths.In contrast, the univariate meta-analysis of posttest effects exhibited a substantially larger interval.
The between-sample heterogeneity was more strongly affected by the meta-analytic estimator.The univariate meta-analyses showed, on average, slightly smaller between-sample heterogeneities as compared to the multivariate methods.Generally, the heterogeneity estimates slightly increased when imputing larger pre-post correlations or using larger correlations in the RVEs.As a result, the prediction intervals varied to some degree between the examined approaches leading to different conclusions about hypothetical effects predicted for future studies.For some estimators, the prediction intervals included 0, whereas for others they did not.In line with the simulation results, the TLM estimated a substantially smaller random effect and, consequently, a narrower prediction interval.Thus, the choice of the meta-analytic method affected the fixed effect estimate only modestly, but more so the heterogeneity estimates.

| DISCUSSION
In many disciplines such as clinical, psychological, and educational research, treatment or intervention effects are of primary interest to evaluate, for example, the effectiveness of novel therapies or training programs. 72,73ecause individual studies might be affected by a multitude of factors, meta-analyses try to consolidate the available evidence of multiple studies on a common topic by estimating whether an intervention yields robust effects in different settings to better understand the conditions under which an intervention might be more or less effective.Meta-analyses rely on sample statistics to calculate effect sizes in each sample.Unfortunately, relevant information for these calculations is frequently unavailable due to poor reporting practices in primary studies.Particularly the correlation between pre-and posttest scores which is required for RCT meta-analyses is often missing.As an ad-hoc solution, applied researchers frequently impute a constant value for the missing correlation without knowing how this might affect the pooled estimates.Therefore, the present study evaluated different metaanalytic estimators for the RCT effect.In addition to univariate meta-analyses of RCT effect sizes with known or imputed pre-post correlations, we also proposed two new multivariate meta-regression approaches which capitalize on recent advancements for the analysis of dependent effects in meta-analyses. 11,13A comprehensive simulation study evaluated the different analytic approaches under different realistic conditions that are typically encountered in applied research.These analyses provided five major results: First, traditional univariate meta-analyses of RCT effect sizes resulted in more biased point estimates as compared to the other estimators and tended to underestimate the true effect, particularly when sample sizes were small and the true pre-post correlation was large.In contrast, a substantial improvement was observed when pooling the pre-post correlations and using the pooled estimates for the calculation of the sampling variances of the RCT effect sizes.Respective univariate meta-analyses were largely unbiased in most conditions.However, the confidence intervals for both estimators held the nominal error level in most of the examined conditions and, thus, did not indicate substantially different interval estimates.Second, univariate meta-analyses of RCT effect sizes with imputed pre-post correlations also yielded largely unbiased estimates of the true effect and appropriate coverage rates of respective confidence intervals.However, estimates of the between-study heterogeneity were substantially biased when there was a mismatch between the imputed and the true pre-post correlation.This bias was larger for small sample sizes and when imputing a correlation that was too large as compared to imputing a too small of a correlation.
Third, multivariate meta-analyses exhibited largely unbiased estimates of the true effect in most conditions.Albeit, at smaller average sample sizes or larger true effects RVE (r = 0.8) was slightly less biased.However, TLM often resulted in overcoverage of the confidence intervals, while RVE held the nominal error rates in most conditions.Importantly, little differences were observed for RVE that assumed a correlation of 0.5 or 0.8 in the working model.However, the random effect estimates were slightly overestimated at small sample sizes when few primary studies were available, slightly stronger so for RVEs assuming larger correlations in the working model.In contrast, TLM resulted in more biased random effects across most conditions.
Fourth, all examined estimators had difficulties holding the nominal error rates for the prediction intervals, thus, mirroring previous results for simpler meta-analytic designs that also revealed far too low coverage rates for prediction intervals in most studied conditions. 74lthough prediction intervals reached close to nominal levels for RVE (r = 0.8) and UMA-I (r = 0.8) in many conditions, they tended to exhibit undercoverage at small between-sample heterogeneity with large pre-post correlations.In contrast, TLM showed substantial undercoverage across most conditions, particularly in the presence of large between-sample heterogeneity.Also, the univariate meta-analyses of individual or pooled correlations did not hold the coverage probabilities.Thus, the generalization of effects is seriously hampered because the estimation of reasonably expected effects in future RCT studies is subject to substantial imprecision.
Finally, a rather robust finding pertained to the effects of posttest variance heterogeneity and the presence of attrition bias.Neither attrition bias nor a rather large variance heterogeneity in the treatment group at the posttest affected the point estimates of fixed and random effects or the respective interval estimates, as long as the meta-analytic estimator incorporated the pretest information. 3In contrast, meta-analyses of posttest effect sizes that ignored the pretest information were affected by both sources of error.Consequently, this estimator yielded substantially biased point estimates and also distorted confidence intervals.
Revisiting the introductory example on the efficacy of virtual reality exposure therapy for the treatment of anxiety-related disorders, 72 the simulation results might inform about the trustworthiness of the results juxtaposited in Table 1.Given that the empirical meta-analysis approximates the simulation condition with a small average sample size, a medium number of samples, a large true effect, and large between-sample heterogeneity, the larger fixed effect reported by RVE (r = 0.8) seems more plausible than the smaller effects.Moreover, the heterogeneity estimate identified by TLM seems less trustworthy because in contrast to the other estimators it systematically underestimates the between-study variance.For RVE or UMA-I, the respective prediction interval might be too small or too wide depending on the unknown pre-post correlation.However, given the poor coverage rates of the prediction intervals for all estimators, these generally need to be interpreted cautiously.

| Recommendations for meta-analytic practice
Even though the true RCT effect and the true pre-post correlation are unknown in practice, a consideration of the simulation results under the examined conditions led us to put forward the following recommendations.If most of the primary studies report sample-specific prepost correlations (and there are no obvious systematic omissions), a univariate meta-analysis of RCT effect sizes following the two-step approach is recommended.Thus, first, a meta-analysis of the available pre-post correlations is conducted and, then, the pooled correlation is used for the calculation of the sampling variances of the RCT effect sizes.This approach yielded largely unbiased point estimates of the fixed and held the nominal coverage rate for the 95% confidence interval.If only a few or no pre-post correlations are available, we recommend using multivariate meta-analyses of independent group effect sizes with RVE that adopts a large correlation (e.g., r = 0.8) in the working model.In our simulation, this approach yielded largely comparable results to the univariate approach in most conditions and also yields more precise estimates of the between-sample variance.
Alternatively, a univariate meta-analysis of RCT effects with imputed pre-post correlations of r = 0.8 might be used which fared comparably to RVE (r = 0.8) .Although none of the estimators resulted in trustworthy prediction intervals, RVE (r = 0.8) or UMA-I (r = 0.8) might be preferred because it resulted in more consistent coverage rates that were less affected by the unknown pre-post correlations.In contrast, univariate meta-analytic approaches with known pre-post correlations or assuming small pre-post correlations such as UMA-I (r = 0.5) or RVE (r = 0.5) yielded slightly worse coverage rates and, thus, are not recommended.However, generally, the differences between the studied estimators were rather small in most conditions.Therefore, the choice of estimator has likely only minor implications for meta-analytic results in practice.Finally, we want to caution against the imprudent use of metaanalyses using posttest effect sizes that ignore pretest information.These can result in substantially biased estimates of fixed effects unless negligible attrition bias and variance homogeneity can be guaranteed.

| Limitations and future directions
Although the aim of the study was the formulation of clear-cut recommendations for the analysis of RCT effect sizes based on a comprehensive simulation study, some weaknesses limit the generalizability of the presented findings and open avenues for future research.First of all, our results only pertain to meta-analyses of standardized mean differences for metric outcomes.Although continuous variables dominate psychological research, particularly clinical studies often also employ dichotomous (or less frequently, multinomial) outcomes that classify individuals into different groups such as improved versus not improved, recidivistic versus not recidivistic, or symptomatic versus asymptomatic.These results might be similarly synthesized across multiple samples, but require different effect sizes (e.g., odds ratios, risk ratios).However, the choice of the effect size might alter recommendations for a meta-analytic estimator. 20Until these are available, applied researchers are encouraged to conduct sensitivity analyses with different estimators to compare the robustness of the metaanalytic results.
Second, in line with prevalent practice, the compared estimators assumed normally distributed effects which might not be tenable in some situations, particularly when the number of studies is small.Although point and interval estimates of meta-analyses including this normality assumption are quite robust, even when the true effects are severely skewed, 50,75 alternative parametric or mixture distributions might improve the accuracy of the heterogeneity estimates and prediction intervals (see Higgins and colleagues 76 for a review).Therefore, future research could evaluate the precision of RCT metaanalyses for different distributional assumptions of the between-study effect.
Third, as has been previously shown in the context of univariate meta-analysis and RVE, small-sample corrections are important to estimate precise standard errors and confidence intervals. 37,77However, for TLM respective adjustments such as the Kenward-Rogers 78 correction have not yet been thoroughly evaluated and, thus, are hardly used. 42To overcome the problematic coverage rates of TLM that were observed in the present study, we strongly encourage further research on the development of small-sample adjustments for these settings.
Fourth, the precise estimation of between-study heterogeneity is an unresolved challenge in meta-analytic research.Simulation studies showed that in many scenarios the coverage rates of prediction intervals are far too low, particularly for heterogeneous study sample sizes. 74herefore, future research is encouraged to improve prediction intervals for RCT meta-analyses, for example, using a bootstrap approach. 79ifth, although the simulation studies tried to cover a broad range of realistic conditions, empirical data typically is noisier and might not fully match the simulated conditions.For example, we did not specifically evaluate how outliers (i.e., extreme effect sizes), sample attrition, or publication bias might have affected the meta-analytic results.A fruitful extension could also focus on differences between the proposed estimators for the identification of moderating effects.
Finally, our simulation was limited to estimators commonly implemented in standard software that is used by applied researchers.We readily acknowledge that alternative approaches can also account for missing correlations in multivariate meta-analyses.For example, Hong and colleagues 80 proposed a multivariate RVE model that specifies an overall marginal correlation between dependent outcomes, thus, not requiring withinstudy correlations.However, the reported simulations indicated an unacceptable precision of this approach for meta-analyses with few primary studies (i.e., less than 50) that dominate RCT research.Alternatively, Bayesian methods could be adapted by assuming a distribution for the missing pre-post correlations rather than imputing a constant value. 81d-hoc solutions are adopted such as imputing a constant value for missing pre-post correlations without knowing the consequences for the meta-analytic results.The presented simulation study suggested that imputing a constant correlation of 0.8 might work well for estimating the pooled effect, but slightly distorts the between-study heterogeneity.Alternatively, we recommend a multivariate meta-regression approach with RVE that estimates the difference in independent group effect sizes without relying on known pre-post correlations.

3 .
For the treatment group in sample k, (n/2)/0.8data points representing the pre-and posttest scores as well as a standard normally distributed attrition indicator were randomly drawn from a multivariate normal distribution N Presented are values of η 2 for main effects and two-way interactions of the method factor based on analyses of variance including all possible higher-order interactions up to the order of 3. Univariate meta-analyses of posttest effects were not included.Method = Meta-analytic method for RCT effect, K = Number of samples, n = Average sample size, ρ = True pre-post correlation, Δ= True change in treatment group, τ Δ = True between-sample heterogeneity, Φ = True posttest variance in treatment group, Bias = Presence of attrition bias, RMSE = Root mean squared error, CI = 95% confidence interval, PI = 95% prediction interval.

F
I G U R E 1 Average bias of fixed-effect estimators for a medium treatment effect, homogenous posttest variances, and no attrition bias.Detailed results are given in Tables S1 and S2 of the supplement material.[Colour figure can be viewed at wileyonlinelibrary.com]F I G U R E 2 Average bias of fixed-effect estimators for a medium treatment effect, heterogeneous posttest variances, and attrition bias.[Colour figure can be viewed at wileyonlinelibrary.com] of the supplement material.[Colour figure can be viewed at wileyonlinelibrary.com]F I G U R E 3 Average bias of random-effect estimators for a medium treatment effect, homogenous posttest variances, and no attrition bias.Detailed results are given in Tables S3 and S4 of the supplement material.[Colour figure can be viewed at wileyonlinelibrary.com] found a median sample size of RCTs on mental health of Mdn = 63 with the 25th and 75th percentiles at 36 and 165.Therefore, the average sample size per effect size (including control and treatment group) was set to either 40, 80, or 120.However, sample sizes in psychological meta-analyses are often positively skewed.
72A B L E 3 Reanalysis of Carl et al.72Based on 14 independent effect sizes with a median sample size of 32.Δ = Pooled effect; τ 2 = Between-sample heterogeneity; LB = lower bound; UB = upper bound; width interval width as UB À LB. Note: