Addressing the estimation of standard errors in fixed effects meta‐analysis

Standard methods for fixed effects meta‐analysis assume that standard errors for study‐specific estimates are known, not estimated. While the impact of this simplifying assumption has been shown in a few special cases, its general impact is not well understood, nor are general‐purpose tools available for inference under more realistic assumptions. In this paper, we aim to elucidate the impact of using estimated standard errors in fixed effects meta‐analysis, showing why it does not go away in large samples and quantifying how badly miscalibrated standard inference will be if it is ignored. We also show the important role of a particular measure of heterogeneity in this miscalibration. These developments lead to confidence intervals for fixed effects meta‐analysis with improved performance for both location and scale parameters.

erogeneity around this parameter. In Section 4, we present intuitive and formal arguments for why the impact of standard error estimation does not go away with larger sample sizes and how this impact depends on the underlying heterogeneity. We present simulation results comparing several confidence intervals for the precision weighted average that allow for estimation of standard errors, and for a related measure of heterogeneity. Finally, in Section 5, we give an applied example to illustrate and compare the different approaches to meta-analysis and conclude with a discussion in Section 6.

REVIEW OF APPROACHES TO META-ANALYSIS
In this section, we describe 3 different approaches to meta-analysis, in which different assumptions are made about the underlying true effect size parameters in the studies. Table 1 provides a summary of these approaches, and we subsequently present further details on the precision weighted average (Section 2.1), testing and quantifying heterogeneity (Section 2.2), and random effects analysis (Section 2.3).
The first approach is the fixed effect (singular) meta-analysis, also called the common effect meta-analysis. 4 This approach is based on the assumption of a single, common effect underlying all studies. 6 Under this simplifying assumption that all study effects are identical, the average effect is equivalent to the common effect size estimated in each study. Although commonly used, this method has often been judged inadequate in practice, as effects from different studies are expected to differ given the variability in study design, population, interventions, etc. [7][8][9] A second approach is the fixed effects (plural) meta-analysis, based on the assumption that the effects underlying the studies at hand are unknown, but fixed, and not necessarily identical. 10,11 Using the fixed effects approach, it is common to estimate the inverse-variance weighted average of the studies' effect sizes, 4 but estimation of other weighted averages is also possible. 11,12 As recently discussed by Rice et al, 4 the inverse-variance weighted average estimates a reasonable and interpretable parameter, even when the effect sizes are assumed to be different, but it may be a somewhat incomplete summary of the effect sizes if they are too heterogeneous. 13 The third approach is the random effects meta-analysis, where the effect-size parameters are considered to be a random sample from a population, ie, they follow a probability distribution. 14 By using random effects as a sampling model, this analysis allows the estimation of the average effect size in the population of effect sizes one might ever have observed. 14 (Details are given in Section 2.3.) This method not only takes into account the heterogeneity between studies but also provides a natural way of quantifying it, 15 making it a more attractive choice over the common and fixed effects approaches. 7,8 On the other hand, as pointed out by Higgins et al, 15 this approach is based on a construct of an hypothetical population of studies or study effects, so the interpretation of the analysis is potentially unclear and confusing. The relevance of random effects analyses that focus on the mean of a population of study effects has been questioned. 16 An alternative derivation for the random effects approach motivates the distribution of effect sizes not as a sampling distribution, but arising from a priori exchangeability in a Bayesian analysis-or an approximately Bayesian analysis, as noted in Higgins et al. 15 In each of the 3 approaches described, appeals to some form of frequentist optimality can be made. In the common effect approach, when the study-specific standard errors are known precisely, the optimality is straightforward; without

Heterogeneity
Not present, Hypothesis test based on̂2 = max Not evaluated Q any distributional assumptions, the inverse-variance weighted estimator provides the best linear unbiased estimator of the common effect, or the unique minimum variance unbiased estimator under the further assumption of normality of effects estimates. 6 When the study-specific standard errors must be estimated, a normal approximation based on the asymptotic distribution of the estimator is commonly used 17, chapter 6 ; in the common situation where all the studies are large, the standard errors are known with great accuracy, and any nonasymptotic inefficiency is extremely minor. For the fixed effects approach, it has been shown by Lin and Zeng 18 that the analysis provides, in many situations, a statistically efficient estimate of the parameter that would be estimated, were it possible to pool the data across studies and to perform a single regression analysis that adjusts for study. However, this pooling is inherently somewhat hypothetical; were it possible to do it, there would often be little motivation for use of meta-analysis, and so it may not always be obvious that this parameter is of direct interest.
The random effects approach has perhaps the least direct connection to optimality, while likelihood-based and fully Bayesian methods in general have guarantees of good large-sample properties, under correct model assumptions, 19,20 in finite samples or when the model is misspecified there are no such guarantees. Indeed, the finite-sample sensitivity of Bayesian random effects meta-analysis to choice of priors is well documented [21][22][23][24] and is a cause for concern in practice. 25 , chapter 5

The precision weighted average
Let 1 , 2 , … k be the true effect sizes from k different studies and let̂i be the estimate of the true effect i , with corresponding standard error i , which we assume known for now. The precision weighted average or inverse-variance weighted average of the true effect sizes is This parameter is a quantity of interest in either common effect or fixed effects meta-analysis; under the common effect model, F reduces to the common effect 0 seen in Table 1; under the fixed effects model, F is a weighted average of the effect-sizes i , where the weight is proportional to the precision with which each effect size can be estimated, giving more weight to those that can be estimated more precisely If the study-specific standard errors i are assumed to be known, a natural estimator of F is given bŷ Optimality of F under a common effect approach has already been mentioned. In fixed effects meta-analysis with known standard errors i ,̂F directly inherits any efficiency properties from thêi's, as would any linear combination of the effect-size estimates. This means that̂F is an unbiased, efficient, and/or normally distributed estimator of F , if within each study, the estimator̂i can be assumed to be an unbiased, efficient, and/or normally distributed estimator of i . Confidence intervals for F are usually derived from a normal approximation, appealing to the large sample properties of the study estimators. Transformations of the outcome measure have also been recommended, such as normalizations, log transformations, bias corrections, 17 and/or variance stabilizing transformations. 26,27 The small sample properties of the normal approximation and sensitivity to the assumption of known variances have been studied through simulation studies, 17,28,29 and some corrections and tests based on more robust test statistics have been proposed. 5,29

Testing homogeneity and quantifying heterogeneity
In both fixed effects and common effect work, it is common to test homogeneity of the study effects, that is, to test the null hypothesis H 0 ∶ 1 = 2 = … = k , against the general alternative of heterogeneity, where some i are not equal. In common-effect meta-analysis, this test assesses a key modeling assumption, while in the fixed effects analysis, the test simply gives a statistical measure of how much heterogeneity is present.
When assessing homogeneity, a commonly used test statistic is Under normality of the effect estimates (̂i ∼ N( i , 2 i )), Q is distributed noncentral chi-squared with k − 1 degrees of freedom and noncentrality parameter with F as in (1). Q is independent of thêF statistic, 4 which may simplify its interpretation.
Under the null hypothesis that all the effects are identical, the Q statistic is distributed central chi-squared with k − 1 degrees of freedom, thus providing a reference distribution to perform a test of homogeneity of effects. However, it also has been found that this test of homogeneity has low power when there are few studies 30 and is not adequate to summarize the extent of the heterogeneity present. 31 Other statistics have been proposed to not only test but also evaluate the impact of the observed heterogeneity, and thus provide a better measure of the consistency between trials. 31 Although these measures have been motivated and derived from a random effects framework, they still have valid interpretation under a fixed effects framework. 4 The most-frequently used of these quantities is I 2 , which can be calculated as and is interpreted as "the percentage of total variation across studies that is due to heterogeneity rather than chance." 32

Inference in random effects meta-analysis
Random effects meta-analysis is based on the assumption that the true study effects 1 , 2 , ... k are an independent and identically distributed sample from some distribution. The inference is then focused on the parameters of this distribution, typically its mean ( ) and variance ( 2 ). With no further assumptions on the distribution of the random effects, an inverse-variance weighted average estimate of can be obtained, 14,15 along with an estimate of its standard error: The weights here involve both the within-study variance 2 i and the heterogeneity (or between studies) variance 2 , for which a moment-based estimator iŝ2 As given here,̂and̂are known as the DerSimonian-Laird estimator for random effects meta-analysis. 14 Other similar moment-based estimator have been proposed. 33,34 Under the further assumption that the study effects follow a normal distribution, maximum likelihood 35,36 and restricted maximum likelihood 37,38 methods can be used to obtain estimates of 2 and . Although these methods are iterative and do not provide closed form estimates, it should be noticed that both the maximum likelihood and REML estimators of take the same form as in (3). A simpler, noniterative method for estimating 2 has recently been proposed 39 and is also based on the assumption of a normal distribution of the study effects. The performance of the different estimation methods has been evaluated and compared, in terms of bias and efficiency, 34 as well as coverage probability. 40

UNDERSTANDING HOW ESTIMATION OF STANDARD ERRORS AFFECTS FIXED EFFECTS META-ANALYSIS
In Equation 1 of Section 2.1, we saw how the underlying parameter estimated by fixed effects meta-analysis is typically defined, in terms of standard errors. When the standard errors are not known but only estimated, this leaves the target of this analysis without a full definition. In Section 3.1, we provide a more concrete motivation, showing how the parameter estimated is optimal for inference, in a certain sense. The impact of estimated standard errors on this inference is explored in Section 3.2, and we see how this motivates the study of a particular scale parameter, describing heterogeneity, in Section 3.3.

A location parameter for optimal estimation
Ideally, the parameters to which inference is targeted should be determined entirely by scientific criteria, ie, by research goals. But in practice these goals may not be known precisely enough to determine a single parameter for inference. In this situation, it makes sense to use statistical criteria to choose from among parameters that meet general research goals. In meta-analysis, where the general goal is to summarize study effects i by some form of average, we choose to pick the the affine combination (ie, the weighted average) of the i that can be most precisely estimated. This can also be stated as selecting the parameter for which the data provides the most information.
The main result here follows from a more general lemma, proved in Appendix A: In fixed effects meta-analysis, where the studies are independent, the covariance matrix of̂reduces to a diagonal matrix: =diag{ 2 i }. From Lemma 1 and assuming that 2 i is known exactly from each study, then the best affine combination of the effect-size parameters is the precision weighted average of the effect-size parameters.
In the situation where the 2 i are assumed known, the corresponding estimatêF can be easily constructed, and used as described in Section 2.1.
But the same optimality of F holds even when the i are not known. To show this formally, we express 2 i as where n i and i are the sample size and the Fisher information from each subject on i , respectively, in the ith study, N is the total sample size across all studies, and i = n i ∕N is the proportion of the total sample drawn from study i. Then formally, under the asymptotic regime where i are fixed when we consider larger N (ie, the same assumptions as in the earlier work of Lin and Zeng, 18 and indeed most asymptotic work), then the limiting value of the covariance matrix is = N −1 diag{( i i ) −1 }, and canceling terms in N, we find This shows that, without further assumptions, in large samples, F is the weighted average of the i parameters that can be most precisely estimated. When the true standard errors i are not known but instead estimated by s i , F can be consistently estimated by a "plug-in" version of̂F from Equation 2. We denote this estimate aŝ

Impact of estimated standard errors on̂F with estimated standard errors
When using the precision weighted average from Equation 2, it is common to assume that the sample size in each study is large enough for the variance of the effect estimate ( 2 i ) to be approximated with negligible error by its estimate (s 2 i ), 41 basing tests statistics and confidence intervals on the following plug-in estimator: The properties of Equation 5 in small sample size settings have been studied via simulation, with inflated type I error rates observed for the test of the null hypothesis H 0 ∶ F = 0, due to underestimation of the standard error of̂F. 17,28,29 FIGURE 1 Comparison of the distributions of̂F and̂F in a simple meta-analyses of 2 homogeneous studies with effect sizes 1 = 2 = 0 (left column) and 2 heterogeneous studies with effect sizes 1 = 1.5 and 2 = −1.5 (right column). We consider medium size studies with N=100 (top row) and very large studies with N=10 000 (bottom row). The y-axis and the vertical violin plots show the distributions of the estimateŝF with no uncertainty in the study weights (in red) and̂F with estimated study weights (in gray). The x-axis and the horizontal violin plot show the distribution of the estimated weight given to study 1 in̂F. Notice that this same x-coordinate is used for both the gray and red diamonds, to illustrate their variability and their overlap (in the homogeneous case), but the weight for study 1 in̂F is always exactly 0.5, as indicated by the red line in the horizontal violin plot Corrected and alternative test statistics have been proposed, 5,6,29 but all of them are based on the assumption of a common effect.
However, in our experience, many investigators expect that the effect of plugging-in s i for i should, in large samples, be negligible for inference on F , regardless of the underlying i -and so the simulation results can be ignored when studies have large sample sizes. This intuition appears to be based on experience with other small-sample corrections that change standard error estimates by factors of n∕(n − 1) or n∕(n − p), which can be ignored with large n. However, this intuition does not apply tôF; not only does the effect of plugging-in s i remain at any sample size, its impact depends importantly on the heterogeneity between the various i .
To better understand how the potential heterogeneity affects the estimation of Var[̂F], we decompose the variance of where the second line follows from assumptions that eacĥi is unbiased, is independent of its corresponding standard error estimate s i , and has variance 2 i . Under exact homogeneity, the second term in Equation 6 simplifies to zero, but is otherwise strictly positive. Moreover, this second term does not become small compared with the first term at larger sample sizes. Before showing this phenomenon formally, we first illustrate it in Figure 1. It shows a simple fixed effects meta-analysis of just 2 studies, of equal sample size, precision, but potentially with unequal i . Comparing behavior of the fixed effects estimate with known standard errors (̂F, in red) and estimated standard errors (̂F, in gray), we see that for heterogeneous data, regardless of sample size, the estimated standard errors give a more variable estimate. This is becausêF is "tilted" closer to 1 or 2 when-by chance alone-study 1 or 2 receives greater weight. This pattern persists at larger sample sizes, so while the absolute amount of extra noise induced is reduced, the relative variabilities remain essentially unchanged. For the homogeneous settings, the 2 i are equal, so no "tilting" occurs, but for the heterogenous settings, the precisions differ by a factor of more than 5.
To build further intuition about the extra variability induced by using estimated standard errors, we now provide an analytic version of the results illustrated in Figure 1. To do this, we write the variance of eacĥi as Var(̂i) = 2 i = (n i i ) −1 , and the estimator s 2 i of 2 i as s 2 i = (n îi ) −1 , so that Additionally, we make the large-sample approximation that eacĥi is asymptotic normal, with asymptotic variance given by some function of the distributional moments of the population(s) in study i, that we write as f i ( i ). * . Using the usual assumptions of normality of̂i and independence of̂i, s i , then by the delta method, we obtain Details are provided in Appendix B. Comparing Equation 8 to the standard error in Equation 5, we see that the asymptotic variance of̂F is the product of the asymptotic variance when the variances are known multiplied by an inflation factor, given in square brackets. This inflation factor, which accounts for the uncertainty in the estimation of the standard errors, depends on the squared deviations of i from F , and thus, it will reduce to 1 under homogeneity but will increase as the dispersion of the effect sizes increases. We also notice that the squared deviations are multiplied by f i ( i ), *The specific form of f i ( i ) will depend on the type of estimator used, the study's randomization ratio as well as the variances and kurtoses of the treatment and control subpopulations. For this reason, we have decided to use this generic expression but have also provided detailed case-specific derivations in Appendix E

FIGURE 3
Coverage probabilities of 95% confidence intervals for F = 0 from 10 000 simulations, using the fixed effects approach large sample size approximation (LSSA) estimator for the variance and bootstrap percentiles, compared to the "naive" estimator from a common effect approach and the DerSimonian-Laird estimator from a random effects approach the asymptotic variance of the information i , implying that the inflation factor increases when the studies are less informative about . Figure 2 illustrates, for a simple case, the nontrivial impact of inflation on type I error rates when testing a point null hypothesis for F , even in large samples. The overstatements of statistical significance depend on the heterogeneity present, but also the nominal level . (Full details are given in Appendices B and E1). This theoretic result underpinning Figure 2 has been confirmed empirically in a simulation study (Figure 3), described in Section 4.3.

A parameter to quantify heterogeneity
While quantifying heterogeneity in meta-analyses has an obvious scientific appeal-describing how effects differ across study populations-the results of Section 3.2 do also suggest a statistical role for consideration of heterogeneity. Bridging these 2 goals, we now propose a parameter to quantify the heterogeneity of a group of effect-size parameters.
As a natural extension of the location-summary F , we define where F is as in Equation 4. For fixed sample size proportions 1 , … , i , we can see that 2 is also a population parameter, just like F . We can interpret 2 as a weighted average of the squared deviations of each study effect size from the weighted average effect F , where the weights are proportional to the precision (or the proportion of information) associated with each study effect. Consequently, deviations from more precisely estimable study effects are upweighted. This parameter 2 is a weighted average squared deviation and quantifies the heterogeneity of the effect sizes.
As shown in Appendix C, 2 can also be defined without regard to F , as a summary of pairwise comparisons of the i , by writing it in the form Specifically, 2 is the weighted average of the pairwise differences of the effect sizes, weighting each pair by the product of their corresponding precisions. Unlike the between-studies variance 2 used in random effects approaches, 2 is defined on just the studies at hand, not a hypothetical population of potential studies, and some scheme for sampling from this population.
Although the definition of 2 is free of distributional assumptions, it can further justified if we assume normality of the effect-size estimators (see, eg, Rice et al 4 ). Under this assumption, the Q statistic is distributed noncentral 2 with k − 1 degrees of freedom and noncentrality parameter given by where we have used Φ = ∑ n i i = N ∑ i i to denote the total amount of information. This expression means that , and thus the power of the test of homogeneity based on Q, depends on 2 components: one is the total amount of information, which in turn depends on the total sample size, and the other is the heterogeneity between effect sizes, as given by 2 , which is independent of the total sample size. In other words, 2 provides a measure of the distance from the null hypothesis of homogeneity.

Inference for F and 2 with known standard errors
Inference for̂F with known standard errors was described in Equation 2; confidence intervals for F are usually built from a normal approximation, appealing to the large sample properties of̂i. For a full description, see, eg, Hartung and Knapp. 6 For the estimation of the heterogeneity parameter 2 , with known standard errors and efficient̂i, we write 2 i = (n i i ) −1 for i = 1, … , k and also define Φ = ∑ k i=1 n i i as the "total information." Then with no further distributional assumptions, a simple moment-based point estimate of 2 is given bŷ with details given in Appendix D. To give a strictly positive estimator of 2 , we can report .
To obtain approximate confidence intervals for 2 , we assume normality of the effect-size estimators and exploit the relationship between 2 and the noncentrality parameter as given in Equation 10. We proposed using methods for con-structing exact confidence intervals for the noncentrality parameter of a chi-square distribution that have been proposed and evaluated previously. 42,43 Basically, these methods consist on inverting a probability interval of the non-central 2 distribution. For example, for given for given Φ and Q, a (1 − ) × 100% confidence interval for 2 is given by all the values for which . Solutions can be obtained numerically, and code for this and other types of confidence intervals (for the noncentrality parameter) is available. 43

Inference on for F and 2 with estimated standard errors 4.2.1 Large sample size approximation
Based on Equation 8, a large sample size approximation (LSSA) of the variance of̂F is given by Further details on the specific form of f i ( i ) in (12) for some common effect-size estimators are provided in Appendix E. In situations where the function f i is known or can be estimated, tests of hypothesis and confidence intervals can be based on a normal approximation using a plug-in estimator of (12) with the estimates of i , i and i for 1 ≤ i ≤ k and F . The (1 − ) × 100% LSSA interval then takes the form

Quasi-F approach
We next propose an interval based on inverting a test of the null hypothesis of homogeneity, similarly to Hartung and Knapp. 5 It is based on a "quasi-F" test statistic, a statistic that approximates a F-distributed random variable. 44 To construct it, we use normality of thêi to provide for null value F0 . Approximating the noncentral 2 distribution by matching its moments to a central 2 , 45,46 we can approximate the distribution of Q as a -scaled central 2 distribution with degrees of freedom ( 2 ), where Under the assumptions above, Q and̂F are independent, 4,5 so has an approximate F 1 distribution, and its signed square root has an approximate Student t distribution with degrees of freedom.
To use these results with unknown i , a "quasi-F" statistic can be constructing by plugging-in estimators of all those quantities. Thus, lettinĝbe as in Equation 5,Var[̂F] the LSSA given in Equation 12, along with plug-in estimates of Q, 2 , and , used in turn to estimate and . Taking square roots, the test statistic has an approximate Student t distribution witĥdegrees of freedom under the null hypothesis H 0 ∶ F = F0 . (This reference distribution would differ importantly from a standard normal for small values of̂, which would be expected when the meta-analysis includes few studies (small k) and the total amount of information times the amount of heterogeneity is small, ie, approaching the limit where 2 → 0.) Inverting this test, we obtain an approximate confidence interval for F .

Parametric bootstrap
The alternative estimators described in Sections 4.2.1 and 4.2.2, which take into account the potential heterogeneity of the effect-size parameters, are based on approximations that would be expected to work in large sample settings, but would probably perform poorly in settings with very small size samples. An alternative method that could better in small sample size settings is bootstrap re-sampling. As individual-level observations are typically not available, we consider using parametric bootstrap sampling. (For a full review of this approach, see chapter 6 of Efron and Tibshirani 47 ) Estimates of the variance of̂F, as well as 95% confidence intervals, and/or P values for testing of hypothesis can all be obtained from parametric sampling, based on the estimateŝ1,̂2, … ,̂k and s 2 1 , s 2 2 , … , s 2 k . Assuming a normal distribution of the effect sizes estimates, a parametric bootstrap sample of size B for each of the effect-size parameters i can be obtained:̂ * However, parametric sampling for the variances of the effect estimates depends on the specific variance estimator used in each study. For example, for the variance of the difference in means of independent groups where equal variances are assumed, a bootstrap sample of̂2 i can be obtained aŝ whereς 2 i is the pooled estimate of the common variance ς 2 i . 48 More generally, for estimates from linear regression (where normality and constant variance are assumed), the sampling can be done from a 2 distribution with (n i − p i ) degrees of freedom, where p i denotes the number of predictors in the regression (including the intercept). In contrast, when i is estimated as the difference in means of independent groups with the variances not assumed to be equal, the parametric sampling of ς 2 i,X and ς 2 i,Y should be done separately and then combined to obtain the value of 2 i . Further details on the specific form of some of these estimators can found in Appendix E.
From the parametric bootstrap samples of effect size and variance estimators, different estimates and/or test statistics can be obtained. We propose (and evaluate) the following: 1. A pivotal (1 − )% confidence interval based on a normal approximation and using an estimate of the variance of̂F from a bootstrap sample (see chapter 6 of Efron and Tibshirani 47 ): ) . 47 ), based on the percentiles from the distribution of a test statistic constructed using a "naive" estimator of the variance of̂F:

A Bootstrap-t confidence interval (see chapter 12 of Efron and Tibshirani
) −1 .

A Bootstrap-t confidence interval
, based on the percentiles from the distribution of a test statistic constructed using the LSSA estimator of the variance of̂F, as given in (12): .
Similar approaches are proposed for the heterogeneity parameter 2 , based on a bootstrap sample of the estimator proposed in (14) and (15) We present evaluations of the coverage of confidence intervals using 2 approaches (some other alternatives were attempted, but did not show important improvement):

A pivotal (1 − )% confidence interval based on a normal approximation and using an estimate of the variance of̂2
from the bootstrap sample:̂2 is defined as above. ) .

Simulation study
We conducted a simulation study to evaluate and compare the different estimation methods proposed for F and 2 . For our simulations, we considered fixed effect sizes ( 1 , … , k ), uniformly spaced and centered around zero ( F = 0), with the spacing in between given by fixed values of 2 . We assumed continuous normal outcomes and the effect size i given by the mean difference between 2 groups, assuming equal population variances and balanced designs. We took random draws of the effect estimates (̂1, … ,̂k) from normal distributions centered around the fixed effects ( 1 , … , k ) along with random draws of their variances taken from scaled 2 distributions with n i − 2 degrees of freedom. Various scenarios were considered, varying the number of studies, sample sizes, and amount of heterogeneity. In addition to the various confidence intervals proposed here for F and 2 , we also compared their performance with methods typically used in meta-analysis, ie, the common effect and random effects approaches. To aid the comparisons, we chose a setup in which all these approaches estimate location parameters with the same numerical value. Further details on the settings and complete results from the simulation study can be found in the supporting information for the online article.
Representative results are shown in Figure 3. For the estimation of the location parameter F , we observed a better performance of parametric bootstrap methods over those based on asymptotic approximations, especially with small sample sizes. Among these, the confidence interval based on the percentiles of the empirical distribution of the parametric sample would be recommended, because it is simple and performed well, providing coverage close to nominal level. However, we also notice that the LSSA method performed reasonably well for large sample sizes (at least 60 subjects per study) and note that it can be used if the parametric bootstrap could not be implemented.
Compared to existing methods, as expected, the random effects approach (using the DerSimonian-Laird estimator of ) provided overconservative inference, as result of wide confidence intervals that account for random sampling of effect

FIGURE 4
Coverage probabilities of 95% confidence intervals for 2 from 10 000 simulations, using an inverted probability interval from a noncentral 2 distribution, a normal approximation with bootstrap estimate of the standard error and a bootstrap estimate based on the quartiles of the empirical distribution sizes that is not present in our simulation settings. However, for the common effect estimator, which is equivalent to use a naïve estimate of the variance of F as given in Equation 7, the coverage probability approaches the nominal level as the sample size increases but never reaching it in the presence of heterogeneity. The asymptotic coverage of this naive estimator has been calculated using (12) and is shown as dotted horizontal lines in Figure 3 (see details in Appendix E1). For the heterogeneity parameter 2 , although all the proposed methods seemed to asymptotically achieve the nominal coverage probability, none of them performed uniformly better for small sample size settings in all scenarios (Figure 4). The normal approximation with a moment based estimate of the standard error showed both significant overcoverage and undercoverage in different scenarios (not shown). The normal approximation using a Bootstrap estimate of the standard error seemed to correct the undercoverage in some scenarios, but not when the number of studies was small (k = 3), while the bootstrap confidence intervals based on the percentiles showed important undercoverage for low values of heterogeneity and large number of studies (k = 7, 15). This result is consistent with a previous result, in which the consistency of bootstrap estimation is related to the asymptotic normality of the statistic, 49,50 while in our case, distribution of the statistic is far from normal, for small sample size and low level of heterogeneity. On the other hand, given the more consistent performance of the inverted probability interval from a noncentral 2 distribution, we would recommend its use when the sample sizes are large enough (at least 40 observations per study) and the studies are not strongly heterogeneous.

EXAMPLE
In this section, we apply the estimation methods discussed in Section 4 to an example from a systematic review of studies that evaluate the efficacy of zinc in reducing the incidence, severity, and duration of common cold symptoms. 51 In this particular meta-analysis, the authors included studies that compare zinc acetate lozenges with placebo, with the outcome being the duration of cold symptoms (in days) and the treatment effect measured by the mean difference. A forest plot is shown in Figure 5.
In Table 2, we summarize the results of meta-analyses on the 6 studies comparing zinc lozenges to placebo, using 3 different approaches. We observe that the point estimates of 0 and F from the common effect and fixed effects approaches, respectively, although numerically the same (̂0 =̂F = −2.04 days), estimate different parameters. The first estimates a common effect underlying all 6 studies, but given the evident heterogeneity between studies, this inference does not seem to be adequate, or even valid. On the other hand,̂F estimates a weighted average of the mean differences from the 6 studies, for which a significant amount of heterogeneity is observed, as reflected by the estimate of 2 . More specifically,̂F estimates the mean difference in duration of common cold averaged in a meta-population composed of the populations from which the samples of these 6 studies were drawn, in proportions given by − i . Similarly, 2 can be thought  as estimating how far apart the mean differences in 2 of these populations are, averaged over the same meta-population. We also observe that the results from different estimation methods, although not exactly the same, do not seem to differ importantly, with a difference in length of 0.13 days between the 95% confidence intervals using the LSSA and the parametric bootstrap. On the other hand, random effects meta-analysis estimates the mean and variance of a population from which the effects in the 6 studies are thought to have been drawn ( and 2 ). The inference now is not made for the population of subjects (on whom we wish to estimate an average effect of a treatment) but for a population of potential treatment effects. As shown in Table 2, different methods for estimating the between-studies variance give notably different results, with larger estimates of 2 yielding estimates of that are closer to the unweighted simple average of the study effects (−0.56). Moreover, the precision with which these parameters are estimated is much smaller than the precision with which F and 2 are estimated, even after taking into account the uncertainty in the estimation of the variances. This gain in precision, it should be noted, is not a result of a particular choice of estimation technique, it is instead the result of targeting our inference to a parameter that is easier to estimate, ie, one for which the the data provide most information.
To further illustrate the properties of the estimators of F and 2 in a fixed effects meta-analysis, we have modified the example into 3 different versions, as shown in Figure 6. First, we increased the precisions of the estimates in the meta-analysis, by artificially growing the sample sizes by a factor of 10 (panel B). This results in a greater precision for the estimates of and 2 . However, this same increase in information does not translate into an increased precision for estimating or 2 in a random-effects model (for which more studies, rather than larger sample sizes, would be needed). In another version of the meta-analysis, we have kept the same precision but shrunk (shifted) the estimates towardŝF, so that the squared deviations have been reduced by factor of 10 (panel C). Reflecting this relative homogeneity, the estimate of 2 is much lower and close to zero. We also notice that estimate of remains practically unchanged, ie, is mostly independent of̂2 (except for the variance inflation effect described in Section 3.2, which is not substantial in this case).

FIGURE 6
Location and scale parameter estimates for 4 different versions of the meta-analysis in Figure 5: A, original estimates; B, more homogeneous estimates with squared deviation from̂F reduced by a factor of 10; C, more precise estimates with sample sizes 10 times those of the original estimates; D, more precise and more homogeneous estimates, with sample sizes 10 times larger and squared deviations from F reduced by a factor of 10, relative to the original estimates. Rhomboids are used to represent point estimates and 95% confidence intervals of location parameters F (in gray) and (in red). The vertical dashed lines represent the square root of the estimated averaged squared deviations from F , as given bŷ=

√̂2
In contrast, the estimate of , both in terms of its location and precision, is highly dependent on between study variance, as estimated bŷ2. Lastly, we artificially reduced the between-study heterogeneity and the within-study variance by the same factor (panel D). As a result of this, the value of the Q statistic is exactly the same as in the original version of the meta-analysis (Q = 53.8, 5 degrees of freedom, P value < .0001), and so is the value for I 2 . This makes sense, as in both meta-analyses, heterogeneity accounts for the same proportion of the total variation. However, in absolute terms, the estimates in the modified version are much closer to each other than in the original meta-analysis, and this is picked up by estimates of 2 and 2 , as they are both quantify "absolute" heterogeneity. Their confidence intervals in both cases exclude zero, rejecting the null hypothesis of homogeneous effects. However, as pointed out before, we can estimate 2 with higher precision, even with few studies.

DISCUSSION
In this paper, we have addressed several aspects of the fixed effects meta-analysis with within-study estimates of the standard errors. To formally motivate its precision weighting, we described the optimality of the corresponding parameter F , and by studying the behavior of the precision-weighted estimate in detail, we showed the important role of a particular measure of heterogeneity, 2 .
Frequentist methods for the estimation of both the location parameter F and the heterogeneity parameter 2 were proposed, including corrected estimators that take into account the uncertainty in the estimation of the within study variances. Estimation methods based on asymptotic approximations, as well as methods based on parametric bootstrap, were implemented and have been evaluated in a simulation study.
In the results of our simulation study, we observed a better performance of parametric bootstrap methods over those based on asymptotic approximations for the estimation of the location parameter F , specially in small sample size settings. Among these, the confidence interval based on the percentiles of the empirical distribution of the parametric sample would be recommended, because of its simplicity and good performance. However, we also notice that the LSSA method performed reasonably well for large sample sizes (n ≥ 60, per study) and could be used if the parametric bootstrap can not be implemented.
For the heterogeneity parameter 2 , although no method performed uniformly better, the construction of 95% confidence intervals by inverting the probability interval from a noncentral 2 distribution seems to provide close to nominal coverage when the sample size is large enough (around 40 observations per study).
The main limitation in our simulation study is that the proposed methods were implemented with knowledge of how the study estimates (including standard errors) were generated. The independence of the point estimate and standard errors-plausible in most uses of linear regression-may not be as realistic if the study-specific analyses use logistic regression, or other forms of analysis under strong mean-variance relationships. The normality of thêi may also be considered a limitation, but unless the outcome variable is very heavy-tailed and/or sensitive to a few observations, standard central limit theorem arguments suggest that this will only be an issue in extremely small samples.
We also illustrated the results of different estimation methods, as well as different approaches, with a previously published meta-analysis. This example, along with the results of our simulation study, supports the idea of approaching meta-analysis under a fixed effects framework, as a valid alternative to the typically used common effect and random effect approaches. Our approach, based on the estimation of both a location and a heterogeneity parameter, is more flexible than the restrictive common effect approach while allowing inference on the population of interest. Our approach also makes it unnecessary to choose between statistical models based on their adequacy rather than the target inference.
Finally, although we believe that estimation of both F and 2 is useful for describing and combining in a meaningful way the effects of studies included in a meta-analysis, we propose their estimation only as part of a full battery of qualitative and quantitative tools that should be used to review, summarize, and synthesize a group of studies. No single parameter or estimator can always appropriately summarize all there is to say in a systematic review of medical studies, and practitioners should be encouraged and helped to understand the measures they choose to provide.
To minimize this expression, we use Lagrange multipliers:

APPENDIX B: DERIVATION OF THE ASYMPTOTIC VARIANCE OF̂F (SECTION 3.2)
First, we write s 2 i = (n i i ) −1 as the estimator of 2 i = (n i i ) −1 , the variance of̂i in the i th study. We start by assuming that i and̂i are independent and have an asymptotic normal distribution: where f i ( i ) is some function of the distributional moments of the population(s) in study i. For a meta-analysis of studies with different sample sizes, we define i = n i ∕N, with N = ∑ k i n i . Then, dividing (B1) by .
Assuming that the study estimates are all independent, we can write Additional Supporting Information may be found online in the supporting information tab for this article.
Recalling the definition of F : we obtain the following derivatives: Then, as long as f i ( i ) < ∞ for i = 1, … , k, we can apply the delta method to B2, obtaining Here, we notice that for some special cases when f i ( i )∕ i = c, a constant, for i = 1, … , k, this expression can be factorized out and the inflation factor can be then expressed as (1 + c 2 ), a function of the heterogeneity parameter 2 defined in Section 3.3. The specific form of f i ( i ) for some common effect estimators of continuous outcomes are provided in Appendix E.

APPENDIX C: ALTERNATIVE EXPRESSION FOR 2 (SECTION 3.3)
To simplify calculations, we write i . Using this notation, we notice that The parameter 2 can then be written as

APPENDIX D: MOMENT BASED ESTIMATOR OF 2 (SECTION 4.1)
Assuming known variances 2 1 , … , 2 k , with 2 i = (n i i ) −1 for i = 1, … , k, we start by calculating the expected value of a plug-in estimate of 2 : Thus, an unbiased estimator of 2 is then given bŷ

APPENDIX E: ASYMPTOTIC VARIANCE OF FOR SOME COMMON EFFECT ESTIMATORS
In this section, we derive the asymptotic variance of the information parameter for some common estimators of treatment effect for continuous outcomes. This asymptotic variance, denoted f( i ), can then be plugged in estimators for the variance of̂F described in Section 4.2 of the main paper.

E.1 Asymptotic variance of for the mean difference of independent groups
Following Borenstein, 48 we first look at meta-analyses of studies that compare the means of 2 independent groups, when an assumption of equal variances is made. Here, the effect size in the ith study, i = Δ i = i,X − i,Y , is estimated bŷ i =X i −Ȳ i , with Var(̂i) = 2 i = (1∕n i,X + 1∕n i,Y )ς 2 i , where ς 2 i is the population variance, assumed to be the same for the 2 groups in study i, and n i,X and n i,Y are the respective sample sizes (with n i = n i,X + n i,Y ). Here, we can write 2 i = (n i i ) −1 and s 2 i = (n îi ) −1 , with and̂= g(ς 2 i ), whereς 2 i is the pooled estimator of the variance, given bŷ andς 2 i,X andς 2 i,Y are the sample variances of the 2 groups in study i. To obtain an asymptotic distribution for̂i, we start from a standard result for the sample variance: where i denotes to the kurtosis in the population distribution (which we will also assume to be the same in the 2 groups for now). So then, keeping the sample size proportions (n i,X ∕n i and n i,Y ∕n i ) fixed, the derivative of (E1) is The, applying the delta method, we conclude that We notice that for studies with balanced design (n i,X = n i,Y ), the information i = 1∕4ς 2 i and the asymptotic variance in the last expression reduces to ( i − 1)∕4 2 ς 4 i . If all studies in a meta-analysis are balanced and the population variance and kurtosis can be assumed constant across all studies, then the inflation factor in (E2) is given by (1 + −1 4ς 2 2 ). This is the expression used to estimate the inflation in type I error rate, produced when the estimation of standard errors is not taken into account, illustrated in Figure 2 of the main paper. Now, when the assumption of equal variances is not made, the variance of̂i =X −Ȳ is given by Var(̂i) , and similarly for √ n i (ς 2 i,Y − ς 2 i,Y ). Keeping the sample size proportions within each study (n i,X ∕n i and n i,Y ∕n i ) fixed, we can write Taking derivatives, we find that and applying the delta method, we obtain

E.2 LSSA for the mean difference between 2 matched samples
When the effect size in each study is the mean difference between 2 matched samples, we have that̂i =X i −Ȳ i , with Var(̂i) = 2 i = (ς 2 i,X +ς 2 i,Y −2 ς i,X ς i,Y )∕m i , where denotes the population correlation between any 2 matched observations X ij and Y ij in study i and m i = n i ∕2 is the number of the paired observations. 48 Assuming a bivariate normal distribution for the observations (X ij , Y ij ), the following asymptotic distribution can be obtained for the sample variances (ς 2 i,X ,ς 2 i,X ) and sample covariance (ς i,XY = 1 .