Consequences of multiple imputation of missing standard deviations and sample sizes in meta‐analysis

Abstract Meta‐analyses often encounter studies with incompletely reported variance measures (e.g., standard deviation values) or sample sizes, both needed to conduct weighted meta‐analyses. Here, we first present a systematic literature survey on the frequency and treatment of missing data in published ecological meta‐analyses showing that the majority of meta‐analyses encountered incompletely reported studies. We then simulated meta‐analysis data sets to investigate the performance of 14 options to treat or impute missing SDs and/or SSs. Performance was thereby assessed using results from fully informed weighted analyses on (hypothetically) complete data sets. We show that the omission of incompletely reported studies is not a viable solution. Unweighted and sample size‐based variance approximation can yield unbiased grand means if effect sizes are independent of their corresponding SDs and SSs. The performance of different imputation methods depends on the structure of the meta‐analysis data set, especially in the case of correlated effect sizes and standard deviations or sample sizes. In a best‐case scenario, which assumes that SDs and/or SSs are both missing at random and are unrelated to effect sizes, our simulations show that the imputation of up to 90% of missing data still yields grand means and confidence intervals that are similar to those obtained with fully informed weighted analyses. We conclude that multiple imputation of missing variance measures and sample sizes could help overcome the problem of incompletely reported primary studies, not only in the field of ecological meta‐analyses. Still, caution must be exercised in consideration of potential correlations and pattern of missingness.

the majority of meta-analyses encountered incompletely reported studies. We then simulated meta-analysis data sets to investigate the performance of 14 options to treat or impute missing SDs and/or SSs. Performance was thereby assessed using results from fully informed weighted analyses on (hypothetically) complete data sets.
We show that the omission of incompletely reported studies is not a viable solution. Unweighted and sample size-based variance approximation can yield unbiased grand means if effect sizes are independent of their corresponding SDs and SSs. The performance of different imputation methods depends on the structure of the metaanalysis data set, especially in the case of correlated effect sizes and standard deviations or sample sizes. In a best-case scenario, which assumes that SDs and/or SSs are both missing at random and are unrelated to effect sizes, our simulations show that the imputation of up to 90% of missing data still yields grand means and confidence intervals that are similar to those obtained with fully informed weighted analyses.
We conclude that multiple imputation of missing variance measures and sample sizes could help overcome the problem of incompletely reported primary studies, not only in the field of ecological meta-analyses. Still, caution must be exercised in consideration of potential correlations and pattern of missingness.

K E Y W O R D S
effect sizes, missing not at random, recommendations, research synthesis, simulated data sets, variance measures

| INTRODUC TI ON
Research synthesis aims at combining available evidence on a research question to reach unbiased conclusions. In meta-analyses, individual effect sizes from different studies are summarized in order to obtain a grand mean effect size (hereafter "grand mean") and its corresponding confidence interval. Most of the analyses carried out in meta-analysis and meta-regression depend on inverse-variance weighting, in which individual effect sizes are weighted by the sampling variance of the effect size metric in order to accommodate differences in their precision and to separate within-study sampling error from among-study variation. Unfortunately, meta-analyses in ecology and many other disciplines commonly encounter missing and incompletely reported data in original publications (Parker, Nakagawa, et al., 2016), especially for variance measures. Despite recent calls toward meta-analytical thinking and comprehensive reporting (Gerstner et al., 2017;Hillebrand & Gurevitch, 2013;Zuur & Ieno, 2016), ecological meta-analyses continue to face the issue of unreported variances, especially when older publications are incorporated in the synthesis.
To get an overview about the missing data in meta-analyses, and to identify how authors of meta-analysis have dealt with this, we first carried out a systematic survey of the ecological literature.
We thereby focused on the most common effect sizes (standardized mean difference, logarithm of the ratio of means, hereafter termed log response ratio, and correlation coefficient (Koricheva & Gurevitch, 2014). Meta-analysts have essentially four options to deal with missing standard deviations (SDs) or sample sizes (SSs).
The first option is to restrict the meta-analysis to only those effect sizes that were reported with all the necessary information and thereby exclude all incompletely reported studies. This option ("complete-cases analysis") is the most often applied treatment of missing data in published ecological meta-analyses (see Figure 1). However, at the very least, excluding effect sizes always means losing potentially valuable data. Moreover, if significant findings have a higher chance to be reported completely than nonsignificant results, complete-case analysis would lead to an estimated grand mean that is biased toward significance (i.e., reporting bias or "file-drawer problem" (Idris & Robertson, 2009;Møller & Jennions, 2001;Parker, Forstmeier, et al., 2016;Rosenthal, 1979). The second option is to disregard the differences in effect size precision and thereby assign equal weights to all effect sizes. This option ("unweighted analysis") has also been frequently applied in meta-analyses of log response ratios (see Figure 1). In the case that no SDs are available but SSs are reported, a third option is to estimate effect size weights from the SS information alone (see Equation 1, n c and n t denominate sample sizes of the control and treatment group, respectively). This "sample-size-weighted analysis" depends on the assumption that effects obtained with larger sample size will be more precise than those obtained from a low number of replicates. This weighting scheme has only rarely been applied (see Figure 1).
The fourth option is to estimate, that is, impute, missing values on the basis of the reported ones. In order to incorporate the uncertainty of the estimates, those imputations should be repeated multiple times. When each of the imputed data sets is analyzed separately, the obtained results can then be averaged ("pooled") to (1) var approx = n t + n c n t × n c F I G U R E 1 Results of our systematic review on ecological meta-analyses and their treatment of missing variances and sample sizes in primary studies summarized by 505 ecological meta-analyses that were published until 23 March 2018 (cf. Data S1 and Appendix S1) obtain grand mean estimates and confidence intervals that incorporate the heterogeneity in the imputed values.
Various previous studies have suggested that multiple imputations can yield grand mean estimates that are less biased than those obtained from complete-case analyses (Ellington et al., 2015;Furukawa et al., 2006;Idris et al., 2013;Nakagawa, 2015;Nakagawa & Hauber, 2011). Multiple imputation of missing data can increase the number of synthesized effect sizes and thereby the precision of the grand mean estimate (Idris & Robertson, 2009)  and SD values), this is denoted as missing not at random.
Consequently, our second goal was to conduct an evaluation of imputation methods for missing SDs or SSs studying the most common effect sizes in ecological meta-analyses (standardizes mean differences, log response ratios, and correlation coefficients (Koricheva & Gurevitch, 2014). Previous studies that compared the effects of different imputation methods focused on a limited number of imputation methods and were conducted on published data sets (Ellington et al., 2015;Furukawa et al., 2006;Idris et al., 2013;Idris & Robertson, 2009;Thiessen Philbrook et al., 2007;Wiebe et al., 2006). In order to systematically determine the effects of correlation structures and patterns of missingness on the performance of different imputation methods, we here simulated data sets that harbored four different correlation structures. This allows to comparing the rigor of the 14 options to treat missing SDs and SSs, c.f. Table 1. We assessed the performance of those 14 options by comparing the resulting grand means and confidence intervals against the estimates obtained from a fully informed weighted meta-analysis of the very same data sets. With this approach, we provide the currently most complete overview over the most common and easy to apply options to treat missing values in meta-analysis data sets.
We aim to show how the treatment, proportion and correlation structure of missing SDs and SSs can drive grand means and their confidence intervals to deviate from the results of fully informed weighted meta-analyses.

| Simulation of missing SDs and/or SSs in metaanalysis data sets
We assessed the effects of 14 options to treat increasing proportions of missing SDs and/or SSs on the grand mean and the corresponding confidence interval.

| Data-generating mechanism
We created two types of meta-analysis data sets. The first data set was created to calculate effect sizes that summarize mean differences between control and treatment groups. The second data set was created to analyze effect sizes that summarize mean correlation coefficients. Each data set consisted of 100 rows representing 100 hypothetical studies with separate means, SDs and SSs for the control and treatment group (for the mean difference data sets) and separated correlation coefficients and SSs (for the correlation coefficient data sets). To reduce random noise and obtain more stable results, we created ten separate mean difference data sets and ten separate correlation coefficient data sets. Mean difference data sets were created with the following data-generating mechanisms.
TA B L E 1 Description of 14 different options to treat missing standard deviations (SDs) and/or sample sizes (SSs) in meta-analysis data sets and the conditions under which we expected those options to yield grand means that differ from the results that would be obtained with fully informed weighted meta-analyses (MCAR-missing completely at random, MAR-missing at random, MNAR + C -missing not at random and SDs/SSs correlated to effect sizes)

Assumed conditions that might lead to deviations from fully informed weighted meta-analyses
(1) Complete-case meta-analysis Omits incompletely reported effect sizes due to which grand mean estimates are expected to exhibit lower precision, that is, larger confidence intervals Missing values are not MCAR (2) Unweighted meta-analysis (Pinheiro et al.,2018) Assigns equal weights to all effect sizes (with reported SSs), disregarding the differences in their precision Effect sizes are related to effect size precision (3)  The following imputation techniques are applied multiple times to yield separate imputed data sets with separate grand mean estimates which are pooled to obtain metaanalysis estimates that incorporate the uncertainty in the imputed values (illustrated in Figure 2). Thereby, SDs and SSs with missing values were treated as dependent variables. SDs and SSs with complete data as well as mean values and correlation coefficients were treated as predictor variables (6)  Mean values for the control groups were randomly drawn from a truncated normal distribution with mean = 1, SD = 0.25, and lower limit = 0.001. Mean values for the treatment groups were randomly drawn from a truncated normal distribution with mean = 2, SD = 0.5, and lower limit = 0.001. SD values for the control groups were randomly drawn from a truncated normal distribution with mean = 0.25, SD = 0.125, lower limit = 0.01, and upper limit = 1. SD values for the treatment groups were randomly drawn from a truncated normal distribution with mean = 0.5, SD = 0.25, lower limit = 0.01, and upper limit = 1. SS values for the control and the treatment groups were both drawn from a truncated Poisson distribution with λ = 10 and lower limit = 5. Correlation coefficient data sets were created with the following data-generating mechanisms. Correlation coefficient values were drawn from a truncated normal distribution with mean = 0.5, SD = 0.125, lower limit = −1, and upper limit = 1. SS values were drawn from a truncated Poisson distribution with λ = 10 and lower limit = 5.
In all data sets, we simulated missing data by either randomly or nonrandomly deleting between 10% and 90% of the SDs, SSs or both in the mean difference data sets and between 10% and 90% of the SSs in the correlation coefficient data sets (in steps of 5%). Within each data set row, we thereby deleted the SDs in both, the control and treatment group and we independently deleted the SSs in both, the control and treatment group. With these deletions, we constructed the following four deletion/correlation scenarios, visualized in Appendix S2: a. SDs and/or SSs were deleted completely at random (MCAR, missing completely at random), and there were no correlations in the data sets. c. The chance of deleting SDs and/or SSs increased with increasing SDs and decreasing SSs (MNAR, missing not at random).
We ranked the summed SDs (sd t + sd c ) in increasing order (corresponding to a lower precision) and ranked the summed SSs d. Effect size values were paired with effect size precision (i.e., sorted so that larger effect sizes had smaller SDs and larger SSs).
SDs and/or SSs were missing completely at random (corMCAR).
This hypothetical scenario might happen in meta-analyses across different study designs that impact both the obtained effect size and its precision (e.g., due to the different possibilities to account for additional drivers of effect sizes in experimental versus observational studies).
In total, we created 2,560 data sets: four deletion/correlation scenarios, four types of deleted data (SDs, SSs, or both for mean difference data sets and only SSs for correlation coefficient data sets), 10 randomly generated data sets and 16 deletion steps (10%-90% of values deleted).

Option Description
Assumed conditions that might lead to deviations from fully informed weighted meta-analyses

| Handling of missing data
To each of the 2,560 data sets, we separately applied one of the outlined 14 options to handle missing SDs, and/or SSs in meta-analysis data sets (Table 1). For the sample-size weighted meta-analysis, we assigned approximate variance measures to each effect size, according to Equation 1. Our general workflow to fill missing values via multiple imputations is illustrated in Figure 2. We generally restricted imputed SDs to range between 0.01 and 1 and imputed SSs to be ≥5. Those restrictions were applied to prevent implausible (e.g., negative) imputations and guarantee convergence of subsequent linear mixed-effects models. Data were imputed in the following order: SDs of the treatment group, SDs of the control group, SS of the treatment group, and SSs of the control group. Changing this imputation sequence had virtually no effect on the results. (4) The resulting 100 grand means, and confidence intervals were averaged according to Rubin's rules (Rubin 1987) in order to obtain single estimates. (5) These estimates were compared with the results of an analysis of the complete data set (i.e., without missing values) Ellington et al. (2015), we repeated all imputation methods 100 times (thus "multiple imputations") to obtain 100 imputed data sets.

| Effect sizes
After applying the outlined 14 options to handle missing SDs and/ or SSs (Table 1), we calculated the three most prominent effect size measures in ecological meta-analyses together with their respective variance estimates where possible/necessary. With the mean difference data sets, we calculated the small-sample bias-corrected log response ratio (Lajeunesse, 2015) (hereafter log response ratio) and Hedges' d. With the correlation coefficient data sets, we calculated Fisher's z (see Appendix S2, for the equations applied).

| Grand mean estimates
For every data set (including complete, unweighted, approximately weighted, and imputed data sets), we calculated the grand mean effect size and its corresponding approximated 95% confidence interval with a linear intercept-only mixed-effects model. Thereby, the effect size from each data set row was treated with a random effect and weighted by the inverse of its corresponding or approximated variance estimate (rma function in the metafor package (Viechtbauer, 2010). For every imputation method and every percentage of missing SDs and/or SSs, the resulting 100 grand mean and 95% confidence interval estimates were averaged under consideration of the uncertainty that arose from the multiple imputations (using Rubin's Rules (Rubin, 1987)

| Performance measures
We evaluated the effects of the different options to handle missing SDs and/SSs in terms of the obtained grand mean and the width of the corresponding 95% confidence interval against reference values obtained with a weighted meta-analysis on the complete data sets (hereafter fully informed weighted meta-analysis). Deviation in the grand mean was quantified as the obtained grand mean estimate minus the estimate from the fully informed weighted analysis.
Deviation in the confidence interval was quantified as the obtained width of the confidence interval minus the width from a fully informed weighted analysis. We then graphically summarized the trends in the grand mean and confidence interval from using different options to handle increasing proportions of missing SDs and/ or SSs. We refrained from using performance measures, such as the root-mean-square error, to compare the different options to handle missing data because we aimed at demonstrating general and nonlinear trends. Since some of the imputation models failed to converge above a threshold of ca. 60% of missing data this would render performance measures infeasible above this threshold.
All analyses were conducted in R (R Core Team, 2018) using gg-plot2 for graphical representations (Wickham, 2009

| Systematic literature survey
In the compiled data set of 505 published ecological meta-analyses, 35% used log response ratios, 36% used standardized mean differences, 24% used correlation coefficients, and 5% used a combination

| Visualization of the simulation results
In Figures 3-6

| Exploration of simulation results
A summary of the findings regarding the effects of different options to handle missing SDs and/or SS in meta-analysis data sets are listed in Table 2 Compared to all other imputation methods, mean, median, and random sample imputation yielded the largest deviation in grand mean estimate and Bayes predictive mean matching yielded the largest increase in the confidence interval. Imputation via bootstrap expectation maximization and additive regression and bootstrap predictive mean matching frequently failed above a threshold of ca.
60% of missing data.

| D ISCUSS I ON
Missing variance measures are a prevalent problem in research synthesis (Gurevitch et al., 2018). Yet, few ecological meta-analyses have adapted imputation algorithms to handle missing values (Figure 1).

F I G U R E 3 Effects of imputing
SDs and SSs that are missing completely at random (MCAR) on the grand mean (colored line) and confidence interval (shaded area) with respect to the results of fully informed weighted meta-analyses. Rows show results for the 14 methods to treat missing values (c.f. Table 1). Columns show result for the log response ratio, Hedges' d and Fisher's z effect sizes with 10% (top) up to 90% (bottom) of standard deviations (SDs) and/or sample sizes (SSs) removed. Each panel shows the deviation of the grand mean and its approximated 95% confidence interval (divided by two for better visibility) from the results obtained with a fully informed weighted meta-analysis. Deviations to the right indicate lower values and deviations to the right indicate higher values Our study demonstrates how the omission of incompletely reported studies (complete-case analysis), generally increases the confidence intervals and how it results in deviating (potentially even biased) grand mean estimates if SDs/SSs are not missing completely at random. The R-code used to simulate and compare the effects of different meta-analysis data sets structures, patterns of missingness, and options to handle missing data is freely available at github.com/Steph anKam bach/Simul ateMi ssing DataI nMeta -Analyses. Although our number of ten replicates is at the lower end of the desired replications in simulation studies (Morris et al., 2019), it was enough to show the general effects of treating missing SDs and SSs and meta-analysis data sets.
In accordance with previous publications (Morrissey, 2016;Nakagawa & Lagisz, 2016), we found that unweighted analyses yielded grand mean estimates that were unbiased with regard to fully informed weighted analyses as long as effect sizes and their corresponding variance estimates were normally and independently distributed. The same holds for sample-size-approximated effect sizes variances. In case of a potential relationship between effect sizes and effect size precision (maybe due to different study designs), we advise to apply imputation methods to fill missing SDs and/or SSs.
If SDs and/or SSs are both MCAR and unrelated to effect sizes, the imputation of up to 90% of missing data yielded grand means similar to those obtained from fully informed weighted meta-analyses. Below a threshold of ca. 50%-60% of missing SDs and/or SSs, imputation methods performed equally or outperformed complete-case, unweighted, and sample-size weighted analyses. Yet, our results also demonstrated that different imputation methods can accommodate different data set structures regarding missingness and correlation patterns. Mean, median, and random sample imputations are easy to implement but biased in case of a relationship between effect sizes and effect size precision.
Methods applying predictive mean matching tend to suit such relationships but tend to yield a larger confidence intervals of the grand mean.
Thus, for any meta-analysis, the method used to deal with missing SDs and/or SSs should be chosen under the following considerations.

| The effect size measure
The calculation of the small-sample bias-corrected log response ratio

| Relationships between effect sizes and SDs
Imputation methods that applied a predictive model, that is, except of mean, median, and random sample value imputations, could account for a relationship between effect sizes and effect sizes precision. In case of such a relationship, those algorithms that used predictive mean matching tended to yield grand means that were most similar to the results from fully informed weighted analyses. In case of correlated effect sizes and SSs in the Fisher's z data set, the imputation of missing data via mean, median, random sample, and nonparametric random forest imputation introduced a stronger deviation of the grand mean than the omission of those incompletely reported studies.

| Summary
Multiple imputation of missing variance measures can be expected to become a standard feature to increase the quality and trustworthiness of future meta-analyses, as advocated by Gurevitch et al. (2018) and Nakagawa et al. (2017) Our results clearly show that complete-case and unweighted analyses, although frequently applied, can potentially lead to deviation in the grand means and thus biased conclusions and should therefore be replaced with or (at least) compared to the results of multiple imputation analyses.
The same imputation methods might also be applied re-evaluate the robustness of already published meta-analyses.
With our simulation study, we aim to raise more awareness on the problem of incompletely reported study results (Gerstner et al., 2017;Parker,Nakagawa, et al., 2016) and their frequent omission in ecological meta-analyses. Our results discourage the use of complete-case, unweighted, and sample-size weighted meta-analyses since all three options could result in deviation of the grand means and confidence intervals. Even in the absence of valid predictors for the imputation of missing SDs or SSs, their imputation has the advantage of including all incompletely reported effect sizes while at the same time preserving the weights of the reported ones.
In summary, our study provides compelling evidence that future meta-analyses would benefit from a routine application of imputation algorithms to fill unreported SDs and SSs in order to increase both, the amount of synthesized effect sizes and the validity of the derived grand mean estimates. The provided R-script number three F I G U R E 6 Effects of imputing SDs and SSs that are correlated with effect sizes and missing completely at random (corMCAR) on the grand mean (colored line) and confidence interval (shaded area) with respect to the results of fully informed weighted meta-analyses. Rows show results for the 14 methods to treat missing values (c.f. Table 1). Columns show result for the log response ratio, Hedges' d and Fisher's z effect sizes with 10% (top) up to 90% (bottom) of standard deviations (SDs) and/or sample sizes (SSs) removed. Each panel shows the deviation of the grand mean and its approximated 95% confidence interval (divided by two for better visibility) from the results obtained with a fully informed weighted meta-analysis. Deviations to the right indicate lower values and deviations to the right indicate higher values TA B L E 2 Summary of the observed effects of the outlined 14 options to treat missing standard deviations (SDs) and/or sample sizes (SSs) on the estimated grand means and confidence intervals in comparison to the results from fully informed weighted meta-analyses in four simulated data sets with different patterns of missingness and correlation structures (MCAR -missing completely at random, MAR -missing at random, MNAR -missing not at random and corMCAR -SDs/SSs are correlated to effect sizes and missing completely at random)