Cumulative meta-analysis: What works

To present time-varying evidence, cumulative meta-analysis (CMA) updates results of previous meta-analyses to incorporate new study results. We investigate the properties of CMA, suggest possible improvements and provide the first in-depth simulation study of the use of CMA and CUSUM methods for detection of temporal trends in random-effects meta-analysis. We use the standardized mean difference (SMD) as an effect measure of interest. For CMA, we compare the standard inverse-variance-weighted estimation of the overall effect using REML-based estimation of between-study variance τ 2 with the sample-size-weighted estimation of the effect accompanied by Kulinskaya – Dollinger – Bjørkestøl ( Biometrics . 2011; 67:203 – 212) (KDB) estimation of τ 2 . For all methods, we consider Type 1 error under no shift and power under a shift in the mean in the random-effects model. To ameliorate the lack of power in CMA, we introduce two-stage CMA, in which τ 2 is estimated at Stage 1 (from the first 5 – 10 studies), and further CMA monitors a target value of effect, keeping the τ 2 value fixed. We recommend this two-stage CMA combined with cumulative testing for positive shift in τ 2 . In practice, use of CMA requires at least 15 – 20 studies.

What is already known • Cumulative meta-analysis is a popular method of evaluating and monitoring temporal changes in accumulating evidence. Statistical evaluation of CMA is especially relevant because of the growing popularity of living systematic reviews. • Repeated testing of the overall effect in CMA results in inflation of the Type 1 error rate. However, the amount of this inflation was not yet sufficiently quantified; it depends on the particular effect measure and on the choice of estimators.
What is new • In extensive simulations for CMA of SMD δ, we observed considerable, and continuously increasing with addition of further studies, negative biases of the inverse-variance-weighed overall effect. In contrast, the sample-sizeweighted overall effect is almost unbiased. • This bias produces much larger inflation of the Type 1 error rate for IV REML in comparison with SSW KDB. The use of inverse-variance-weighted estimation of SMD is not recommended in CMA. • A shift in the mean effect results in a very slow change of the overall effect in CMA, but in a considerable and rapid change in the estimated betweenstudy variance τ 2 . This change reduces the power of CMA. • To increase the power of CMA, we introduced a two-stage CMA, in which τ 2 is estimated in Stage 1 without further re-estimation in Stage 2. This twostage CMA with the sample-size-weighted overall effect performs better than the standard CMA and is recommended. • A new CMA method based on re-estimation of τ 2 using the QP-based test is also useful when τ 2 is small. We recommend use of both new methods simultaneously. • We also provide tables enabling choice of significance levels.

Potential impact for RSM readers outside the authors' field
• For CMA of SMD, we recommend the two-stage analysis based on SSW, used simultaneously with a QP-based test for shift in between-study variance. • CMA of SMD has rather low power, which is reduced by inertia of the cumulative estimation and by heterogeneity of the data. At least 15-20 studies are required for CMA to be useful in practice.

| INTRODUCTION
Meta-analysis, a statistical methodology that combines estimated effects from multiple studies on the same topic, to arrive at an evidence-based overall effect, had revolutionized many scientific fields, helping to establish evidence-based practices and to resolve seemingly contradictory research outcomes. 1 The conduct of metaanalysis provides a snapshot of evidence at one time point, but the evidence is not static: as it accumulates, new studies often challenge the results of previous studies. If the evidence changes over time, the conclusions of meta-analysis will strongly depend on when the review was conducted, and any policy-relevant recommendations derived from it will quickly go out of date. 1,2 Numerous studies have shown that substantial changes over time in the magnitude, the statistical significance, and even the sign of the reported effects are common in numerous disciplines, from biomedical research [2][3][4][5] to social sciences, [6][7][8][9] ecology and evolutionary biology, [10][11][12][13] and information systems. 14 These temporal trends could be caused by changes in the true effect, changes in study characteristics that influence the effect (known as moderators in meta-analysis) or biases (e.g., delay in the publication of studies with nonsignificant results). 10,15 Temporal trends are typically visually explored and often formally detected through cumulative meta-analysis (CMA), introduced by Lau et al. 16 CMA is a process of updating the results of an existing meta-analysis to incorporate new study results. It is one of the most popular ways to present timevarying evidence. 4,17,18 Other methods for detecting temporal trends are reviewed in Trikalinos and Ioannidis, 19 Kulinskaya and Koricheva, 20 and Koricheva et al. 21 Until recently, CMA was applied to systematic reviews identified to be in need of updating. 22,23 But the use of CMA has grown substantially because of the growing popularity of living systematic reviews, online summaries updated as new research becomes available. 24 CMA also provides a graphical representation of shifts in evidence associated with other factors such as sample size, precision, or study quality. 25,26 It is well understood that the repeated testing inherent in CMA inflates Type 1 error. This inflation was studied by simulation by Whitehead, 27 Hu et al., 28 and Thorlund et al. 29 among others; but, to our knowledge, it was not systematically quantified. A number of methods addressing this issue are based on methodology originally developed for sequential clinical trials. [30][31][32][33] Another approach uses quality control methods, in particular CUSUM charts. 20 However, the use of the random-effects model in CMA requires consecutive re-estimation of the between-study variance τ 2 as well as the overall effect δ, and makes both classes of methods problematic. 34,35 In this paper, we investigate the properties of CMA, suggest possible improvements, and provide an in-depth simulation study of the use of CMA and CUSUM methods for detecting temporal trends in random-effects meta-analysis. We use the standardized mean difference (SMD) as the effect measure. For CMA, we consider both the standard inverse-variance-weighted estimator of the overall effect (δ) with REML-based estimation of the between-studies variance (τ 2 ) and a sample-size-weighted (SSW) estimator of δ combined with the Kulinskaya-Dollinger-Bjørkestøl 36 (KDB) estimator of τ 2 , recommended by Bakbergenuly et al. 37 For all methods, we consider Type 1 error under no shift and power under a shift in the mean.
The requisite statistical methods are described in Section 2. Section 3 examines the properties of CMA and identifies the problems caused by gross overestimation of τ 2 resulting from a shift in the mean. Therefore, we suggest cumulative testing of τ 2 . Also, we modify CMA in the spirit of quality control methods. At Stage 1 (the first 5-10 studies), we estimate both δ and τ 2 , and then we 'monitor' known or estimated in Stage 1 value of effect, keeping the estimated value of τ 2 fixed. Section 4 presents the design and results of our simulations for standard and two-stage CMA methods and for CUSUM charts. Section 5 applies the methods to data on species richness, and Section 6 concludes with discussion. Our full simulation results are available as e-print. 38 R procedures implementing the proposed methods of CMA are available in the file CMA for SMD.txt and examples and R scripts of their use are provided in Appendices S3 and S4.

| Study-level estimation of SMD
Consider a meta-analysis of k comparative studies, each consisting of two arms, treatment (T) and control (C), with sample sizes n iT and n iC . The total sample size in study i is n i ¼ n iT þ n iC , and the ratio of the control sample size to the total is denoted by q i ¼ n iC =n i . The subject-level data in each arm are assumed to be normally distributed with means μ iT and μ iC and equal variances σ 2 i . The sample means are x ij , and the sample variances are s 2 ij , for i ¼ 1, …, k and j ¼ C or T.
The SMD effect measure is The unbiased estimator of δ i , sometimes called Hedges's g or Hedges's d, is where m i ¼ n iT þ n iC À 2, and J m ð Þ¼Γ m À Á , often approximated by 1 À 3= 4m À 1 ð Þ, corrects for bias. 39 The standard deviation, σ i is estimated by the square root of the pooled sample variance Þis the effective sample size in study i. An unbiased estimator of this variance is 39 The sample SMD g i has a scaled non-central t-distribution on m i d.f. and non-centrality parameter n i q i 1 À q i ð Þ ½ 1=2 δ i :

| Standard random-effects model
In meta-analysis of k studies, the standard random-effects model assumes approximately normal distributions of within-and between-study effects. For a generic measure of effect,δ resulting in the marginal distributionδ i $ N δ, σ 2 i þ τ 2 À Á .δ i is the estimate of the effect in Study i, and its withinstudy variance is σ 2 i , estimated byσ 2 i , i ¼ 1, …, k. τ 2 is the between-study variance, estimated (from k studies) byτ 2 k . The overall effect δ ¼ δ k ð Þ can be estimated from k studies by the weighted mean where theŵ iτ weights. The fixed-effect (FE) model assumes that τ 2 0, and the estimatorδ FE k ð Þ uses the IV weightsŵ i ¼ŵ i 0 ð Þ. The variance ofδ IV k ð Þ is routinely estimated bŷ The standard confidence interval for the overall effect δ k ð Þ uses the IV point estimator as its centre, and its halfwidth equals the estimated standard deviation (square root of the variance (6)) times the critical value from the normal distribution.

| Point and interval estimation of τ 2
Because theŵ iτ 2 k À Á in Equation (5) involveτ 2 k , we need to choose an estimator of τ 2 . It is well known that the maximum likelihood estimator of τ 2 is biased, and the restricted maximum-likelihood (REML) estimator is a good choice. 40 We useτ 2 REML in the IV estimator of δ and the Q-profile (QP) method 41 for the accompanying interval estimator of τ 2 . For SMD, this method is one of the best traditional methods. 37 The Q-profile confidence interval can be obtained from the lower and upper quantiles of F Q , the approximate cumulative distribution function of Cochran's Q statistic 42 The lower and upper confidence limits, τ 2 L and τ 2 U , can be calculated iteratively. This Q statistic is often assumed to have a Chi-square distribution with k À 1 degrees of freedom when τ 2 ¼ 0; F Q ¼ χ 2 kÀ1 in the original Q-profile method. 41 However, χ 2 kÀ1 is an adequate approximation only for very large sample sizes.
For SMD, Kulinskaya et al. 36 derived O 1=n ð Þ corrections to moments of Q and suggested using the Chisquare distribution with estimated degrees of freedom based on the corrected first moment of Q to approximate its distribution. Bakbergenuly et al. 37 proposed an estimator of τ 2 ,τ 2 KDB , based on this improved approximation.
An accompanying KDB confidence interval for τ 2 combines the Q-profile approach and the improved approximation by Kulinskaya et al. 36 Bakbergenuly et al. 37 demonstrated by simulation thatτ 2 KDB is less biased than REML for small sample sizes (n < 100), especially for k ≥ 10, and both estimators become almost unbiased from n ¼ 100. Both the QP and the KDB confidence intervals for τ 2 have too high coverage for τ 2 near zero. For larger τ 2 , QP performs well, and the KDB interval may be somewhat too liberal for small n.
Confidence intervals for τ 2 are related to tests of the null hypotheses τ 2 ¼ τ 2 0 . Values of τ 2 0 beyond the estimated confidence limitsτ 2 L ,τ 2 U À Á are rejected by a twosided α-level test, and values below the lower bound of the one-sided intervalτ 2 L , ∞ Â Á are rejected by a one-sided α=2-level test in favour of τ 2 >τ 2 L . We refer to these tests as QP and KDB.

| Point and interval estimators of δ
Traditional CMA based on REML usesτ 2 REML inδ IV k ð Þ (5) and in its estimated variance (6), in combination with the critical values from the normal distribution. We refer to this method as IV REML.
For SMD, the estimated effectsδ i (1) and their estimated variances v 2 i (2) are not independent. Because of this, the IV estimates of the overall effectδ IV k ð Þ are biased. Use of non-random weights eliminates this bias. The use of effective-sample-size weights (SSW) for estimation of δ, suggested by Hedges and Olkin, 43,p.110 was explored and found superior to IV weights in a comprehensive simulation study by Bakbergenuly et al. 37 These weights depend only on the studies' arm-level sample sizes: w i ¼ e n i ¼ n iT n iC = n iT þ n iC ð Þ ; e n i is the effective sample size in Study i. They coincide with the inverse-variance weights when g i ¼ 0 in (2). We refer to the estimator of δ of the form (5) with these weights as SSW and denote it byδ SSW . The interval estimator corresponding to SSW (SSW KDB) uses the SSW point estimator as its centre, and its half-width equals the estimated standard deviation of SSW under the randomeffects model times the critical value from the t-distribution on k À 1 degrees of freedom. The estimator of the variance ofδ SSW iŝ in which v 2 i comes from Equation (2). Once more, we refer to the tests of the hypothesis δ ¼ δ 0 based on the respective confidence intervals as IV REML and SSW KDB. In particular, the α-level twosided IV REML test compares its statistic d Varδ IV REML À Á À1=2 jδ IV REML À δ 0 j against the normal critical value z 1Àα=2 , and the SSW KDB test compares its sta-tisticVarδ SSW À Á À1=2 jδ SSW À δ 0 j against the critical value from the t-distribution on k À 1 degrees of freedom.

| CUSUM charts and their use in sequential meta-analysis
Methods of statistical quality control (QC) were initially developed in industrial applications of statistics, and are now commonly used in medicine, epidemiology and public health (e.g., to detect a start of an epidemic or to monitor quality of hip implants). Their use in meta-analysis for detection of temporal trends was suggested by Kulinskaya and Koricheva. 20 A process that is operating with only random causes of variation is said to be in statistical control; otherwise, the process is out of control. 44 The standard QC application includes two stages: a setup stage where the parameters of a process in control are estimated, and the subsequent monitoring stage. At the monitoring stage, the samples (of size n) are collected on a regular basis; and if the values of interest (say, the sample mean x n ) fall within the control limits on the control chart and do not exhibit any systematic pattern, the process is considered to be in control.
The cumulative sum or CUSUM chart 45 is one of the most efficient QC tools. It is equivalent to a sequential likelihood ratio test of the null hypothesis H 0 (the process is in control) against an alternative H 1 (the process is out of control). The cumulative log likelihood ratio (LLR) of H 1 versus H 0 is plotted at every step i, i ¼ 1, 2, …, and the test stops in favour of H 1 when the LLR is large. 46 In standard applications, the null hypothesis H 0 : Þ of a practically relevant amount of shift Δ (times the standard deviation) in the underlying mean. Usually, two one-sided CUSUM charts (for positive and negative deviations) are plotted simultaneously. An upper one-sided CUSUM testing the hypothesis H 0 : Δ ¼ 0 versus H 1 : Δ > 0 can be written as a cumulative LLR: Þ . In meta-analysis, when there is no temporal shift, the process is in control, and all effect estimates are approximately normally distributed with the same mean, If a shift of size Δ occurs at some time point, the mean of the process deviates from δ, so that , and the process can be considered out of control. Applying the CUSUM chart to meta-analysis, the LLR for study i is 20 where the weights are the inverse variances of theδ i . Thus, the upper CUSUM accumulates weighted deviations from the target value that are greater than Δ=2. Under positive shift, the expected path of a CUSUM increases linearly with the number of positive deviations and with their average weight, until it crosses the chosen control limit.
An important notion associated with the chosen significance level in sequential testing is the average run length (ARL) of the control chart. The ARL is the expected number of points plotted before a point falls outside the control limits. The CUSUM signals as soon as x i > h. The value of the control limit, h, is chosen to provide good ARL values. The standard choices are h ¼ 4 or h ¼ 5 standard deviations. These values correspond to an ARL of 168 and 465, respectively, when the process is in control, and to an ARL of 4.75 and 5.75, respectively, for a shift of 1:5σ . 44 We use the R package qcc 47 for CUSUM analysis.
The use of QC charts in meta-analysis is justified in the fixed-effect model; but in the random-effects model, re-estimation of τ 2 introduces dependence between the sequential effect estimates, and hence their distribution is not consistent with the standard assumption of independent increments in the QC charts. 35

| ESTIMATION AND TESTING IN CMA
Consider K consecutive studies and a simple randomeffects model for study-level effects, with a simple shift in the mean (at study k 1 þ 1): where G Á ð Þ is an arbitrary distribution with mean δ i þ ΔI i > k 1 ð Þand variance σ 2 i , K is the maximum number of studies, I i > k 1 ð Þ is the 0/1 indicator and 1 < k 1 < K. In the shift-in-the-mean model (9), the true SMD shifts from δ in the first k 1 studies to δ þ Δ in studies k 1 þ 1, …, K. The standard REM (4) holds for the first k 1 studies and, separately, for the studies from k 1 þ 1 to K when the distribution G Á ð Þ is normal. For SMD, the distribution G Á ð Þ is given by Equation (3).
In this section, we consider some general CMA patterns under this model and suggest several approaches to testing.

| Estimation of δ
Define cumulative overall mean effect δ k ð Þ of k ≤ K studies to be a weighted mean of study-level effects δ i , whenever the number of studies k increases.
If there is no shift in δ, the cumulative mean δ k ð Þ δ. However, if there is a shift in δ at study k 1 þ 1, δ k ð Þ becomes a weighted average of δ and δ þ Δ values.
Given a set of weights w i , i ¼ 1, ÁÁÁ, K, and denoting the running sum of weights W k ¼ P k 1 w i , the cumulative mean effect for k ≤ k 1 studies is δ, and for k > k 1 studies it is Consider, for simplicity, that the sample sizes are equal between arms and among studies, n iC ¼ n iT n=2, so that effective sample sizes are e n i n=4. Thus, the SSW estimator uses equal weights w i ¼ n=4, and the cumulative mean Under the scenario of no shift in δ or τ 2 , the population IV weights should be equal. Given a positive shift in the mean from δ to δ þ Δ, the variances of the g i , i ≥ k 1 þ 1 also increase, and the weights decrease accordingly. Therefore, with IV weights the cumulative mean effect (10) will be reduced, compared with SSW. This may reduce the power of the CMA based on the IV weights. Figure S1 in Appendix S2 illustrates changes in δ SSW k ð Þ and in δ IV k ð Þ under the scenario of equal sample sizes. The difference between the two cumulative mean effects is the largest when τ 2 ¼ 0; but it decreases in τ 2 , and for τ 2 > 0 it also decreases rapidly with n and is negligible at n ¼ 100.
However, the estimated variances v 2 i given by Equation (2) and, therefore, the IV weightsŵ i are random variables and are not exactly equal. Additionally, these random weights are not independent from the estimated effects g i , resulting in order 1=n bias of the estimated cumulative mean effectδ IV k ð Þ , even under the no-shift scenario, as we demonstrate in Section 4.4.

| Estimation of τ 2
To understand the pattern of changes in the estimated value of τ 2 in CMA, denoted byτ 2 A number of estimators of τ 2 , including KDB, are based on the first moment of Q. Consider the same simple scenario of equal sample sizes and a shift in the mean from δ to δ þ Δ at study k 1 þ 1 given by (9).
Let ξ i ¼δ i À ΔI i > k 1 ð Þ. The variables ξ i coincide witĥ δ i under the hypothesis of no shift. Writê It follows that The first term of Q k ð Þ accounts for between-study heterogeneity of ξ i . The mean of the second term is zero, and we expect it to be reasonably small. And the third term reflects the contribution of the shift Δ. Under no shift, only the first term is nonzero.
The moment-based estimators of τ 2 described in Der-Simonian and Kacker, 48 which include the popular Der-Simonian-Laird estimator, 49 and the recent SDL estimator 37 where E 0 Q ð Þ is the expected value of Q under homogeneity (i.e., when τ 2 ¼ 0), the denominator is the multiplier at τ 2 in the Taylor-series expansion of whereτ 2 ξ is the estimator of the heterogeneity variance under no shift. For the case of equal sample sizes, the weights are and then it increases in k almost linearly. This increase in estimated τ 2 does not depend on the sample size.
In our simulation study, we do not useτ 2 M , but both the KDB and the popular Paule-Mandel estimator 50 are also based on the first moment of Q k ð Þ , and we expect similar behaviour from these estimators. REML is not a moment estimator, but our simulation results, discussed in Section 4.3, demonstrate this behaviour for both KDB and REML (Figure 1).

| Testing for a shift in the mean in CMA
Consider consecutive testing for a shift at k ¼ 1, …,K in the model (9) based on the cumulative estimateŝ δ k ð Þ ,τ 2 k ð Þ from K studies ordered over time. At each k, the standard CMA tests H 0 : δ k ð Þ ¼ δ 0 against δ k ð Þ ≠ δ 0 for a known value of δ 0 , at the same level α. We consider two types of tests, described in Section 2.4, IV REML and SSW KDB. These test statistics have the form Þ is the total sample size in the k studies. This is easier to see from Equation (8) for the variance ofδ SSW , but it is also true forδ IV REML . Therefore, for sufficiently large sample sizes, these statistics are approximately equivalent to the ratios ffiffiffi k pδ k ð Þ À δ 0 À Á =τ k ð Þ . The power of these tests depends primarily on the value of ffiffi ffi k p δ k ð Þ À δ 0 À Á =τ k ð Þ , but it may be distorted by biases ofδ k ð Þ andτ k ð Þ . values with their confidence intervals for increasing k, and a δ 0 outside the kth confidence interval is rejected by the kth test. An example of these CMA plots is provided in Row 2 of Figure 7; see Section 5 for further details. As discussed in Section 3.2, the cumulative estimatê τ 2 k ð Þ increases with the shift in the mean. Because of this, we also consider two tests, QP and KDB, of τ 2 k ð Þ ≤ τ 2 0 versus τ 2 k ð Þ > τ 2 0 for a known value of τ 2 0 as possible tests for shift. These tests are also performed consecutively for increasing k. However, based on (11) these tests, described in Section 2.3, are likely to be more powerful for early shifts (small k). Similar to the standard CMA tests forδ k ð Þ , these tests are easily performed visually on a forest plot of cumulativeτ 2 k ð Þ values. Examples of these plots appear in the first row of Figure 7.
From the findings in Section 3.2, it also follows that for a test of a shift based onδ k ð Þ , the accompanying increase inτ 2 k ð Þ may decrease its power. It would make sense to use a sufficiently well-estimated value of τ 2 before this increase occurs. Because of this, we consider the following modification to traditional CMA, which we refer to as the two-stage CMA.
Borrowing the quality control concept of two-stage monitoring, in Stage 1 we aim to estimate τ 2 . Therefore, we test for δ k ð Þ ¼ δ 0 for k ¼ 5, …,10 using IV REML or SSW KDB. If that test rejects at k 1 þ 1, we takeτ 2 k 1 (estimated by REML or KDB, respectively) as the 'true' value τ 2 0 ; otherwise we take τ 2 0 ¼τ 2 10 ð Þ . The rationale for this choice of 5 ≤ k ≤ 10 studies for estimating τ 2 in Stage 1 is that an estimate of τ 2 is not sufficiently precise for k < 5, and is reasonably accurate by k ¼ 10 under no shift (see Section 4.3). However, we do not want to use an inflated value of τ 2 if an early shift does occur. In Stage 2, we monitor the (known) mean effect δ 0 , using the τ 2 0 value estimated in Stage 1 as the known between-study variance τ 2 k ð Þ without re-estimating it. We refer to the respective tests and estimators as IV REML (τ 2 0 ) and SSW KDB (τ 2 0 ). We also consider a more realistic scenario of monitoring the estimated value of δ (estimated in Stage 1 as described above at k 1 þ 1 or at k ¼ 10) instead of the 'known' δ 0 . We refer to these tests as IV REML (δ 0 ) and SSW KDB (δ 0 ) when further testing proceeds as in the standard CMA (i.e., with τ 2 re-estimated each time), and IV REML (δ 0 , τ 2 0 ) and SSW KDB (δ 0 , τ 2 0 ) when also using the value of τ 2 0 estimated in Stage 1. For comparison, we also use two-stage CUSUM charts with h ¼ 4, 5 and 6 to monitor the known mean δ 0 for the effectsδ i with variances v 2 i þ τ 2 0 . Table 1 gives a comprehensive summary of methods and abbreviations.

| Simulation design
In the majority of our simulations we use K ¼ 50 as the maximum number of studies, though in some scenarios we also use K ¼ 100 and 1000. For each combination of parameters, we use equal sample sizes from n ¼ 20 to 1000 in all K studies. The sample sizes in the treatment and control arms are equal.
We use a total of 10, 000 repetitions for each combination of parameters. Thus, the simulation standard error for estimated coverage of δ or τ 2 at the 95% confidence level, or testing at 0:05 level is roughly ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 0:95 Â 0:05=10, 000 p ¼ 0:00218. The simulations were programmed in R version 3.3.2.
As the number of studies increases in CMA, with k ≤ K, we examine bias and coverage in estimation of δ and τ 2 , and the accumulating rejection rates (Type 1 error or power) of tests for the shift in the mean effect and in the between-study variance τ 2 . We also consider the cumulative signalling rate and the ARL in CUSUM analysis. 44 We consider both a constant value of δ (the null hypothesis of no shift in the mean) and a shift from δ ¼ 1 to δ ¼ 2 at 0:25, 0:5 and 0:75 of the total number of studies. Our summaries of results in Sections 4.3-4.7 include illustrative figures and are based on examination of the figures in our ePrint. 38 We vary five parameters: the overall true SMD (δ) and the between-studies variance (τ 2 ), in addition to the maximum number of studies (K), the point of shift (if any) ðk 1 Þ, and the studies' total sample size (n). Table 1 lists the values of each parameter.
We generate the true effect sizes δ i from a normal distribution: δ i $ N δ, τ 2 ð Þ. We generate the values of Hedges's estimator g i directly from the appropriately scaled non-central t-distribution, given by Equation (3), and obtain their estimated within-study variances from Equation (2).
We study two approaches to point and interval estimation and testing of τ 2 (REML/QP and KDB) and two resulting approaches to point and interval estimation and testing of δ (IV REML and SSW KDB). Each of these two approaches was studied in the four CMA settings listed in Table 1: traditional CMA setting with known δ 0 value, CMA with estimated in Stage 1 value of δ 0 , the two-stage CMA with known value of δ 0 , and the two-stage CMA with the estimated in Stage 1 value of δ 0 ( Table 2).

| Outcomes recorded in simulations
In all simulations, we assumed the shift-in-the-mean model (9) and, for the CMA methods of interest, we studied the bias of the point estimatorsδ k ð Þ of the cumulative mean δ k ð Þ (10) and (for the standard CMA) the bias ofτ 2 k ð Þ in estimating τ 2 for 5 ≤ k ≤ K. We also investigated coverage of the δ k ð Þ and of τ 2 by the relevant interval estimators and empirical levels of the accompanying twosided tests for the null hypothesis of no shift in δ and (separately) of one-sided tests of no shift in τ 2 . We also investigated cumulative Type I errors of these tests at the 0.05, 0.01 and 0.005 levels for δ and at the 0.025, 0.005 and 0.0025 levels for one-sided tests for τ 2 .

| Bias ofτ 2 k ð Þ
When there is no shift in δ, both estimators of τ 2 (REML and KDB) have non-negligible positive bias for small k, especially for small sample sizes (n ≤ 50), Figure 1. KDB retains small positive bias for larger values of k, whereas the bias of REML becomes negative when τ 2 > 0. Biases do not depend visibly on the value of δ, but they increase in k and increase considerably in τ 2 . The bias of REML is about À0:04, and the bias of KDB is þ0:04 when n ¼ 20, τ 2 ¼ 0:25 and k ¼ 10, compared with À0:10 and þ0:07 when τ 2 ¼ 1. The biases decrease in n; when n ¼ 100, KDB is practically unbiased, and REML has small negative bias of about 2%.
From the point of a shift, both estimators of τ 2 increase rapidly, KDB somewhat faster than REML. However, for larger n, the behaviour of REML converges to that of KDB, and the difference between the estimators is negligible at n ¼ 100.

| Bias and coverage ofδ k ð Þ
The cumulative effect estimated by SSW is almost unbiased under all simulated conditions, regardless of the value or a shift in δ. In Figure 2, SSW coincides with its expected value, given by Equation (10). IV REML is also unbiased when δ ¼ 0 (not shown). However, IV REML has a small negative bias, up to about 5%-7%, when n ¼ 20 and δ ¼ 1, Figure 2. The bias increases in δ and in k. It also increases, though rather slowly, in τ 2 . The bias decreases for larger sample sizes; when n ¼ 100 and δ ¼ 1, the bias is about 1.5%. After the shift in δ, IV REML is somewhat lower than SSW, and it deviates from its nominal mean (10), but these differences decrease in sample size and are practically eliminated by n ¼ 100.
As illustrated by Figure S2 in Appendix S2, coverage of SSW KDB is rather conservative (i.e. above nominal) for small numbers of studies, but it improves for larger values of k and τ 2 . When n ¼ 20 and k ≤ 5, coverage of IV REML is somewhat conservative when τ 2 ¼ 0 and δ ¼ 0, but it drops below nominal for larger k when δ ¼ 1. For larger sample sizes, IV REML provides stable, if somewhat conservative, coverage when τ 2 ¼ 0. When τ 2 > 0, IV REML has very low coverage for k ≤ 10, and it does not improve much in n. Coverage at the nominal 95% level is about 85%-90% when k ¼ 20 and τ 2 ≥ 0:25, and it remains below nominal when n ¼ 100. Coverage is visibly reduced for δ > 0. As we shall see in the next section, this liberal coverage translates into higher Type 1 error in CMA.

| Level and power of tests for δ in CMA
Because of multiple testing over the increasing number of studies k, the empirical levels of SSW KDB and IV REML at the same nominal level are increasing in k, but the empirical levels of SSW KDB are considerably lower. The difference between the two methods is more pronounced for larger values of δ; see Appendix S2, Figures  S3 and S4 for δ ¼ 0 and δ ¼ 0:5 up to K ¼ 50 and Figure  S5 for δ ¼ 1 and K up to 1000. Tables S1-S3 in Appendix S1 provide empirical levels at selected values of k at the nominal 0.05, 0.01 and 0.005 levels for δ ¼ 0, 0:5 and 1. As an example, at the nominal 0.05 level, these levels for δ ¼ 1 are 0.048 for SSW KDB versus 0.118 for IV REML at k ¼ 12 and 0.089 versus 0.187 at k ¼ 25 for n ¼ 20 and τ 2 ¼ 0. These levels increase further in τ 2 (0.079 vs. 0.177 at k ¼ 12 for n ¼ 20 and τ 2 ¼ 0:1); and they increase somewhat in n, in this example to 0.150 versus 0.253 at k ¼ 12 for n ¼ 1000 and τ 2 ¼ 0:1. Testing at the lower nominal levels makes sense for larger values of k, τ 2 and/ or n, see Tables S1-S3 for some guidance.
The power of SSW KDB and IV REML is comparatively low. Figure 3 shows empirical levels for shift from 1 to 2 at Study 26. For both methods, the power is highest when τ 2 ¼ 0 and deteriorates considerably in τ 2 . Taking into account its lower level, SSW KDB is more powerful than IV REML. The power increases in n, and by n ¼ 100 both methods reach power 80% at 31 studies when τ 2 ¼ 0 and at 36 studies for τ 2 ¼ 0:25. Choosing the nominal level of 0.01 safeguards the empirical levels about 0.05 at k ¼ 25 and reduces the power of CMA accordingly. Table S6 provides the number of studies needed to reach power of 80% and 90% for a shift from δ ¼ 1 to δ ¼ 2 at k ¼ 13, 26 and 38. As expected, the power of all tests is lower at smaller shifts, but the direction of shift does not seem to matter ( Figures S9-S20 in Appendix S2).

| Level and power of tests for τ 2 in CMA
We studied one-sided tests for τ 2 k ð Þ > τ 2 0 , and typical results for Δ ¼ 1 at k ¼ 26 are depicted in Figure 4 for nominal levels 0.025, 0.005 and 0.0025. Multiple testing inflates empirical levels, more so for KDB than for QP. Table S4 in Appendix S1 provides empirical levels at selected values of k at the nominal 0.025, 0.005 and 0.0025 levels for δ ¼ 1. The power increases in n and decreases in τ 2 . When τ 2 ¼ 0, the power is quite high from n ¼ 50; but when τ 2 ¼ 0:25, the power reaches 80% for both tests only at k ¼ 41 for n ¼ 100. Power is extremely low when τ 2 ¼ 1, even for very large sample sizes; for shift at k ¼ 26, power barely reaches 30% at k ¼ 50 when n ¼ 1000 (not shown).

| Comparing tests for shift in δ in one-and two-stage CMA
When there is no shift in δ, two-stage CMA and the standard one-stage CMA have very similar inflation of the empirical levels. However, two-stage CMA is somewhat more powerful under the shift. This difference in power is clear for KDB SSW (τ 2 0 ) from n ¼ 20, and for IV REML (τ 2 0 ) from n ¼ 50, Figure 5 and Figure S6. This difference in power is explained by inflation in the estimated τ 2 k ð Þ in the standard CMA, as discussed in Section 3.3.
For comparison, Figure 5 and Figures S6-S20 in Appendix S2 also include CUSUM-based CMA with h ¼ 4 and 5 along with the CMA tests at the 0.05 and 0.01 F I G U R E 2 Weighted cumulative effects δ SSW k ð Þ and δ IV k ð Þ given by Equation (10)  nominal levels, respectively. Under no shift, CUSUMbased analysis results in greater inflation of the empirical levels, but equally, it has more power under shift. Similarly to other tests, its power increases in n and decreases in τ 2 . Figure 5 and Figures S6-S20 also include the twostage methods using an estimated in Stage 1 mean effect δ 0 . In this scenario, the 'standard' CMA methods, which re-estimate τ 2 k ð Þ , have especially inflated levels, and the two-stage methods, which use the estimated τ 2 0 , are clearly the better choice. Unexpectedly, methods that use two parameters estimated at Stage 1 (δ 0 , τ 2 0 ) have somewhat lower Type 1 error and somewhat more power than the comparative methods using known δ 0 , especially for IV REML. Table S5 in Appendix S1 provides empirical levels of two-stage CUSUM analysis of k studies with h ¼ 4, 5 and 6 when δ ¼ 1. Table S7 provides empirical levels of twostage CMA of k studies at nominal levels 0.05, 0.01 and 0.005 when δ ¼ 1, and Table S8 gives the number of studies (k) required for 80%/90% power for detecting a shift from δ ¼ 1 to δ ¼ 2.

| EXAMPLE
As an example, we use data by Bat ary et al. 51 on the role of agri-environment management schemes in conservation and environmental management. We use the data on species richness from 39 studies published from 1992 to 2010, with mostly small to medium sample sizes, ranging from 2 to 37 per arm, though one study has 152 observations in each arm. The effect measure is SMD, and positive values correspond to higher species richness in the extensive (organic) than in the intensive (conventional) fields. The majority of the studies originated from European countries and compared conventional with organic management. The original meta-analysis did not take chronology into account. Observations of multiple taxa and/or of different geographical regions in an individual study were included separately in the dataset, resulting in 109 records in total. We chose a single sub-study with median value ofδ from each study. Figure 6 provides the raw data and forest plot, depicting the 39 sub-study effects with corresponding 95% confidence intervals.
Visual examination of the forest plot shows that the effects in the first 18 studies seem to hover somewhat above zero, and Study 19 is a high outlier. The next subset of effects (Studies 20-33) is somewhat more positive, and Study 33 is another high outlier. The last subset of effects (Studies 34 to 39) seems to drift back toward zero. These observations are clearly confirmed by the plots of cumulative τ 2 values in the top row of Figure 7, using both QP and KDB confidence intervals. Heterogeneity is very high in the first eight studies, but then it settles down and is relatively low up to Study 18. It jumps at Study 19, but then decreases from Study 20-33, indicating that Study 19 is just an outlier and not the start of a shift. The same happens at Study 33. In this example, the KDB values ofτ 2 are higher than the REML values.
In the second row of Figure 7, IV REML provides higher estimates ofδ k ð Þ and narrower confidence intervals than KDB SSW. As we expected from the simulations, IV REML is less conservative and shows a significantly positive effectδ ð8Þ; IV REML ¼ 0: In this example, which involves no significant shifts in effects, SSW KDB CMA appears to be too conservative. However, as our simulations demonstrate, this method would result is a much lower false-positive rate. Different methods provide complementary information, and in practice we therefore recommend the use of multiple plots, including the forest plot, the CMA plot for τ 2 and the two-stage CMA plot for δ. publications on the same topic are available over a number of years. The multiplicity problems inherent in CMA are well known, and a number of alternative statistical methods aimed at resolving these problems are available. However, this does not seem to hinder the popularity of CMA in applied research.
Therefore we investigated, theoretically and by simulation, the level and power of CMA and how to improve both. For the popular effect measure SMD, we compared two approaches for CMA: the first (IV REML) is based on the popular REML estimation of the between-study variance τ 2 and the inverse-variance method for combining the evidence, whereas the second (SSW KDB) is based on the effective-sample-size weights and the KDB estimator of τ 2 . 36 Our simulations clearly demonstrate that the SSW KDB analysis is a much better option when δ ≠ 0.
From theoretical consideration of CMA in Section 3, we recognized the issues with variance inflation in CMA when a shift in the mean occurs and suggested therefore a two-stage approach to CMA, as well as testing for a shift in τ 2 . Our simulations show that the two-stage CMA performs better than the standard one-stage CMA on both Type 1 error and power. Testing for τ 2 also works well for small-to-moderate values of τ 2 ≤ 0:25; the Q-profile method 41 is the preferred option. However, this method has very low power for larger τ 2 . For all studied methods, smaller shifts or higher heterogeneity results in lower power, but the direction of shift does not seem to matter. Power increases considerably in sample size n. However, even for n ¼ 1000 and a large shift in δ from 1 to 2, at least three to five studies after the shift are needed to achieve 80% power at the 0:01 level when τ 2 ¼ 0:1, and seven to nine studies are needed when τ 2 ¼ 0:25. Overall, at least 15-20 studies are required to use any version of CMA.
In our simulations, we also considered CUSUM charts, suggested by Kulinskaya and Koricheva,20 and modified them for random-effects MA by adding estimated at Stage 1 between-study variance τ 2 0 . However, this resulted in too high a Type 1 error rate, and we do not recommend this method. A practical recommendation is to run simultaneously two analyses: testing for cumulative τ 2 at the 0.005 level using the Q-profile method, and the two-stage testing for shift in the mean effect at the 0.01 level using SSW KDB, with either known or estimated in Stage 1 target value of δ, and using the constant value of τ 2 0 estimated in Stage 1 by KDB. The suggested levels guarantee an overall level close to 0.05 for k ≤ 26 studies, as the two tests at levels α 1 and α 2 , with rejection if at least one of them rejects the null hypothesis, result in approximately an α 1 þ α 2 level. Somewhat higher levels would be possible for lower numbers of studies and/or lower between-study variances.
We studied only simple scenarios of equal sample sizes within and between studies, but we anticipate CMA to have even lower power in more realistic unbalanced settings. A study of the use of CMA for other effect measures would also be of interest for practitioners, though we do not believe that the power would improve. We considered only simple alternatives of a shift in δ at one point. Other realistic alternatives may include linear or nonlinear trends in effects and other more complicated options. Lai 52 provides a comprehensive review of the use of sequential methods for a wide class of alternatives. A critical review of these methods for applications in meta-analysis would be very useful.
In the case of high heterogeneity, the power of all CMA methods is very low. It would be of interest to consider the use of runs tests, which are routinely used in a similar quality control context, 53 and which may increase the power of CMA. Another important extension of CMA would be methodology for cumulative analysis when the heterogeneity is reduced through meta-regression. We shall address these possible improvements in further research. 53. Koutras M, Bersimis S, Maravelakis P. Statistical process control using Shewhart control charts with supplementary runs rules. Meth Comput Appl Prob. 2007;9:207-224.

SUPPORTING INFORMATION
Additional supporting information may be found in the online version of the article at the publisher's website.