Evaluation of statistical methods used to meta‐analyse results from interrupted time series studies: A simulation study

Abstract Interrupted time series (ITS) are often meta‐analysed to inform public health and policy decisions but examination of the statistical methods for ITS analysis and meta‐analysis in this context is limited. We simulated meta‐analyses of ITS studies with continuous outcome data, analysed the studies using segmented linear regression with two estimation methods [ordinary least squares (OLS) and restricted maximum likelihood (REML)], and meta‐analysed the immediate level‐ and slope‐change effect estimates using fixed‐effect and (multiple) random‐effects meta‐analysis methods. Simulation design parameters included varying series length; magnitude of lag‐1 autocorrelation; magnitude of level‐ and slope‐changes; number of included studies; and, effect size heterogeneity. All meta‐analysis methods yielded unbiased estimates of the interruption effects. All random effects meta‐analysis methods yielded coverage close to the nominal level, irrespective of the ITS analysis method used and other design parameters. However, heterogeneity was frequently overestimated in scenarios where the ITS study standard errors were underestimated, which occurred for short series or when the ITS analysis method did not appropriately account for autocorrelation. The performance of meta‐analysis methods depends on the design and analysis of the included ITS studies. Although all random effects methods performed well in terms of coverage, irrespective of the ITS analysis method, we recommend the use of effect estimates calculated from ITS methods that adjust for autocorrelation when possible. Doing so will likely to lead to more accurate estimates of the heterogeneity variance.


Highlights
What is already known An interrupted time series (ITS) study is a non-randomised design in which data are collected repeatedly over time before and after an interruption (such as the introduction of a bicycle helmet law).The results from multiple ITS studies may be statistically combined using meta-analysis methods; the findings of which underpin conclusions informing public health or policy decisions.The performance of the statistical methods for analysing single ITS studies has been shown to depend on the length of the series and the underlying correlation between consecutive data points (i.e., autocorrelation).As well, the performance of meta-analysis methods is known to depend on the number of included studies and the underlying variability in the study intervention effects.

What is new
We undertook a numerical simulation study to examine the performance of meta-analysis methods in the context of multiple ITS studies.We found that all meta-analysis methods yielded unbiased estimates of the interruption effects.Furthermore, we found that all random effects methods yielded coverage close to the nominal level, irrespective of the ITS analysis method used and other design features (e.g., the magnitude of heterogeneity).However, heterogeneity was frequently overestimated in scenarios where the ITS study standard errors were underestimated, which is more likely to arise when ITS analysis methods do not appropriately account for autocorrelation.We therefore recommend that metaanalysts should strive to use effect estimates and standard errors that have been calculated from ITS methods that adjust for autocorrelation.Potential impact for RSM readers outside the authors' field ITS studies and systematic reviews of ITS studies are used across disciplines and topics (e.g., public health, crime, economics, war and psychology) to investigate the impact of interruptions.Our findings and recommendations are therefore likely to apply across disciplines.

| INTRODUCTION
Healthcare policy decision-making is often informed by systematic reviews examining the impact of policy or public health interventions, or the impact from exposures such as natural disasters (both referred to as 'interruptions' hereafter).These reviews may need to consider evidence beyond randomised trials, as it is not always possible to randomise interventions targeted at populations (e.g., when evaluating the impact of a media campaign broadcast to an entire country). 1,27][8] The results from multiple ITS studies within systematic reviews may be statistically combined using meta-analysis methods; the findings of which underpin review conclusions. 9,10efore proceeding with meta-analysis of ITS studies, there is a range of issues for analysis of a single ITS study that requires consideration.In a single ITS study, data are often collected continuously over time pre-and postinterruption.Commonly the data are aggregated using summary statistics (such as means or proportions) over regular time intervals (e.g., weekly or monthly) for analysis. 11 commonly fitted model structure is a segmented linear model, [12][13][14] which allows estimation of separate underlying time trends in the pre-interruption period and the postinterruption period.The estimated time trend in the preinterruption period can be used to predict what would have occurred in the absence of the interruption, thus providing a counterfactual for comparison with what was observed, using the estimated post-interruption time trend.Several effect metrics can then be calculated to quantify the impact of the interruption; commonly these include an immediate level change, and a change in slope from pre-interruption period to post-interruption period 15 (see, e.g., Figure 1a).Researchers aiming to include ITS in a meta-analysis may need to re-analyse the original data (which is often possible when data are presented in figures in primary publications 12,16 ) to calculate interruption effects using desired effect metrics, and appropriate statistical methods. 14,17range of statistical methods are available for estimating the regression parameters and effect estimates from a segmented linear model. 12,18,19While Ordinary Least Squares (OLS) is often used, it fails to account for the potential correlation of consecutive time points (known as autocorrelation or serial correlation), which is a key characteristic of time series data. 20,21Failing to account for autocorrelation may lead to incorrect estimates of the standard errors of the regression parameters. 22,23Several methods that attempt to account for potential autocorrelation include, for example, generalised least squares methods (e.g., Prais-Winsten (PW) 24 ), and Restricted Maximum Likelihood (REML). 25A numerical simulation study has provided insight on the performance of these statistical methods (and others) for analysing single ITS studies with continuous outcomes using segmented linear models. 17The authors found that performance differed across the methods, but that REML was often preferable to the other methods; however, its performance was dependent on the length of the series and the underlying magnitude of autocorrelation.
Univariate meta-analysis may be used to estimate a combined effect across ITS studies 26,27 (see, e.g., Figure 1b).Commonly a two-stage meta-analysis approach is used, 8 whereby the interruption effect estimates (e.g., level-change or slope-change) are calculated for each ITS study, and then statistically combined. 27,28The effects are commonly combined assuming the fixed (or common) effect model, or the random effects model. 8,29,30The fixed-effect approach requires estimates of the interruption effect and its standard error only, while the random-effects approach additionally requires the estimation of the between-study variance. 10,30umerous between-study variance estimators exist (e.g., the DerSimonian and Laird (DL) and REML estimators) 31 and, in addition, numerous methods are available for calculating the confidence interval of the combined effect (e.g., Waldtype (WT) and Hartung-Knapp/Sidik-Jonkman (HKSJ) methods). 322][33] The DL estimator, commonly used to estimate the between-study variance, is well known to have suboptimal statistical properties in circumstances where there are few studies and the underlying statistical heterogeneity is large. 21,34he REML estimator has been proposed as an alternative because it has been shown to yield less biased estimates of the between-study variance compared to DL. 12 The WT method, a commonly used method for calculating confidence intervals, has been shown to yield less than nominal coverage levels when there are few studies or when the underlying between-study variance is large, or both.32 The HKSJ method has been shown to yield wider confidence intervals than the WT, although may yield narrower confidence intervals when the number of included studies is small or true between-study variance is small.35 While ITS analysis methods have been evaluated at the individual study level, 17,22 and the meta-analysis methods have been evaluated generally, 31,32,36,37 neither has been evaluated in the context of multiple ITS studies.This context necessitates consideration of both individual study (re-)analysis and meta-analysis simultaneously.Hence, in this simulation study, we aimed to examine the performance of different univariate meta-analysis methods to combine results from ITS studies, and how characteristics of the meta-analysis, ITS design, and ITS analysis methods, modify the performance.Specifically, we examined how the performance was altered when the magnitude of first order [also known as lag-1 or AR (1)] autocorrelation, series length, degree of heterogeneity in the interruption effects and number of included ITS studies were varied.We limited the meta-analyses examined to those that included ITS studies with continuous outcomes, a fixed number of data points, an equal number of data points pre-and post-interruption, and the same pre-interruption level and slope.We did not consider scenarios or statistical methods that include control series. Webegin by describing an illustrative example, to which we later return to demonstrate the impact of applying the methods evaluated in the simulation study.

| Illustrative example
Lane et al undertook a study to examine the effects of cannabis legislation on traffic fatalities. 38Using routinely collected traffic fatality data from 11 states in the United States of America, they examined whether legalising recreational cannabis had an impact on traffic fatality rates in legalising and neighbouring states.The outcome was monthly traffic fatality rates per million residents.The study also included 19 location-based control states which had not legalised recreational cannabis and were not neighbours of the legalising states; these are ignored for this illustrative example.The series had 96 monthly datapoints per series and there were 11 series available for metaanalysis.Fitting segmented linear regression models to each of the state's time series allows estimation of the impact of the legislation immediately (level-change) as well as any change in trend (slope-change; Figure 1a).These effect estimates can then be combined using meta-analysis methods, which also allows investigation of the consistency of the effects across the series (in this example, states).In Section 2, we describe several statistical methods available for estimating the interruption effects for ITS studies and for metaanalysing the resulting estimates.In Section 3.8, we apply these methods to this example and compare the results.background and Aims were described above, while in subsequent sections we outline the Data generation mechanisms (Sections 2.1 and 2.2.1), the Estimands (and their estimation procedures of interest, Section 2.2.2),Methods and Performance measures (Sections 2.2.3-2.2.6).

| Statistical model for an ITS study
An ITS with a single interruption is commonly modelled using segmented linear regression as follows 1 : Y t is a continuous outcome at time t t ¼ 1, …,T ð Þand the interruption time is indicated by T I .D t is an indicator variable that represents the post-interruption period (D t ¼ 1 t ≥ T I ð Þ ): β 0 represents the intercept in the pre-interruption period, β 1 the pre-interruption slope, and β 2 and β 3 represent the interruption effects; respectively, change in level and change in slope.The error term, e t , is constructed from two components (ρe tÀ1 þ w t ).The first, ρ (À1 ≤ ρ ≤ 1), represents the degree of the correlation between the error at time t and the error of the previous time point t À 1 ð Þ, and the second represents 'white noise', which is assumed to be normally distributed (w t $ N 0, 1 ð Þ).Here, the error term accommodates lag-1 (AR (1)) autocorrelation, but can be extended to accommodate longer lags.

| Estimation methods for ITS analysis
There are several estimation methods that can be used to estimate the parameters of the segmented linear regression model.Here, we focus on three estimation methods-OLS, PW and REML.
Ordinary least squares (OLS) estimators, commonly used in practice, 8,12 can be used to estimate the regression parameters and their standard errors. 21A key assumption of OLS is that the model errors are uncorrelated between observations, which may be violated with time series data.In the presence of autocorrelation, estimates of the regression parameters will be unbiased, however, their standard errors may be biased.In the presence of (likely 22 ) positive autocorrelation, they will be too small. 18,40rais-Winsten (PW) a generalised least-squares approach, provides an extension of OLS that allows for lag-1 autocorrelation (AR (1)). 41In brief, the estimation procedure involves first fitting the segmented linear regression model (Equation 1) using OLS, from which an estimate of autocorrelation is calculated from the residuals.The data are then transformed using the estimated autocorrelation, aiming to remove the autocorrelation from the errors.The regression parameters are then reestimated using OLS.Further iteration of these steps may be required until the estimated autocorrelation converges. 42estricted Maximum Likelihood (REML) estimators can be used to estimate the regression parameters and their standard errors.REML is a form of maximum likelihood estimation in which the log-likelihood is partitioned into two terms.The first term, comprised of only variance components, is first maximised to obtain estimates of the error variance and correlation parameters, accounting for the appropriate degrees of freedom.The second term, comprised of both regression and error variance parameters, is then maximised using estimates from the first term.Maximum likelihood variance estimators do not appropriately account for the loss in degrees of freedom that result from estimating the regression parameters, which leads to negatively biased variance components for small samples. 25

| Statistical models for meta-analysis
Meta-analysis may be used to estimate a combined effect from at least two ITS studies. 9,10Here, we focus on the two-stage meta-analysis approach.The two most common meta-analysis models include the fixed-effect (also known as common-effect) and random-effects models.
In a fixed-effect meta-analysis model, it is assumed that the included ITS studies estimate a single true (common) interruption effect, and any variability in the observed effects is only due to sampling variability.The model can be specified by: where β m represents the underlying true interruption effect of the m th regression parameter from Equation 1, and of interest here is β 2 (immediate level-change) and β 3 (slope-change); b β mk is an estimate of the mth regression parameter from the k th ITS study (k ¼ 1,…,K and K ≥ 2), and the error in estimating a particular ITS study k's true effect from a sample of participants, assumed to be normally distributed, is represented by ).In a random-effects meta-analysis model, it is assumed that the true interruption effects follow a (conventionally assumed normal) distribution, and the observed ITS study effects are a random sample from this distribution. 10The random-effects model can be specified by: where β Ã m represents the mean of the distribution of true interruption effects (where, as above, m represents the regression parameter (and effect measure) of interest); δ mk represents a random effect that allows a separate interruption effect in the k th ITS study, where these effects are assumed to be normally distributed about the average interruption effect (β Ã m ), with between-study variance τ 2 m (i.e., δ mk $ N 0, τ 2 m À Á ); and within-study À .These univariate meta-analysis models can be extended to allow joint modelling of the regression parameters, known as bivariate meta-analysis. 43In fitting separate univariate meta-analysis models, any withinstudy correlation between regression parameters is ignored.

| Estimation methods for meta-analysis
For a given effect measure, m, the meta-analytic effect is estimated as the weighted average of the K ITS study effect estimates.For a fixed-effect model, the estimator for the meta-analytic effect is b ), where the weight given to the k th ITS study is simply the reciprocal of the within-study variance, W mkFE ¼ 1 . For a random-effects model, the same estimator is used, but the weights are modified to incorporate the additional source of between-study variation, . A common assumption is that the within-study variances are known, when in practice they are estimated from the observed study data.For large studies, this assumption is generally reasonable, however, for small studies, this can bias the model parameters. 31ifferent between-study variance estimators are available, 31 as well as methods to calculate the confidence interval for the meta-analytic effect. 32Here we consider two between-study variance estimators and confidence interval methods.

Between-study variance estimators
DerSimonian and Laird (DL) is a moment-based betweenstudy variance estimator derived from Cochrane's Q-statistic, 44 chosen for inclusion in this study as it is commonly used 8 and is implemented as the default estimator in many software packages (e.g., RevMan, 45 metan in Stata 46 ).The estimator is given by: where the weights are from a fixed-effect meta-analysis model, and Q is calculated based on the fixed-effect meta-analysis estimate, An alternative between-study variance estimator can be derived using REML, 31 chosen for inclusion in this study as it has been recommended as a preferable estimator compared with DL. 31,47,48 The estimator is given by: The estimate of the between-study variance is calculated through a process of iteration, whereby the initial value of b β
and the process repeated until convergence.However, the algorithm can occasionally fail to converge.

Confidence interval calculation
A range of confidence interval methods for the metaanalytic (summary) estimate are available. 32The two outlined here can be used with both the DL and REML between-study variance estimators.
The method chosen for inclusion in this study for its wide use in practice, 8,34,44 the Wald-type normal distribution (WT) confidence interval, 49 is calculated as: where z 1Àα=2 is the 1 À α 2 À Á th quantile of the standard normal distribution (note that b β Ã m is replaced with b β m for a fixed-effect meta-analysis).This method assumes b β Ã m is normally distributed, despite the within-study and betweenstudy variances not being known and estimated. 32,50artung and Knapp, and independently, Sidik and Jonkman, [henceforth referred to as the Hartung-Knapp/ Sidik-Jonkman (HKSJ)] 51 derived an alternative confidence interval method in an attempt to deal with meta-analyses with few studies, selected here for its better performance when there are few included studies.33,[52][53][54] Rather than assuming normality of b β Ã m , the method assumes the t-distribution (with K-1 degrees of freedom), and includes a small sample standard error adjustment, q, and is calculated as: where and

| Simulation methods
Before providing full details, we briefly outline our simulation approach.We generated ITS studies, analysed these using segmented linear regression using two estimation methods (Section 2.1.2) and metaanalysed the resulting level-change and slope-change effect estimates using a fixed-effect and multiple random-effects meta-analysis methods (Section 2.1.4).
The ITS studies and meta-analyses were generated using a range of design parameters (e.g., varying levels of autocorrelation, varying number of studies per meta-analysis).These design parameters were combined using a fully factorial approach (1620 simulation scenarios), with 1000 replicate meta-analyses generated per scenario.Various criteria (e.g., bias, 95% confidence interval coverage, Appendix 2) were used to assess the performance of the meta-analysis methods.

| Data generation
The design parameters, which were informed by reviews of ITS studies 12,17,22 and of meta-analyses of ITS studies, 8 are provided in Table 1.For each combination of these parameters, ITS studies with continuous outcomes were generated by randomly sampling from the model in Equation 1.We limited our focus to continuous outcomes only, as more research is required to understand the statistical performance of ITS analysis methods for binary (proportion), count or rate outcomes (which are common in ITS studies 8,12 ) prior to their assessment in combination with meta-analysis methods.
Models were constructed assuming level-changes (β 2 ) of 0 and 1, and slope-changes (β 3 ) of 0 and 0.1.We limited the number of level-and slope-changes investigated based on findings of a simulation study 17 that demonstrate that the magnitude of these parameters did not impact the performance (across a range of metrics, excluding power) for the ITS estimation methods considered (Section 2.1.2).Lag-1 autocorrelation was varied between fixed values of ρ ¼0 and 0.6 in increments of 0.2, and values drawn from a distribution ρ Ã $ N 0:4,0:15 2 ð Þ .The values of autocorrelation were selected to reflect magnitudes of autocorrelation seen in real-world data sets (a median of 0.21 across 181 ITS studies, IQR: À0.01 to 0.56). 22We did not select negative values of autocorrelation, because although estimated in real-world datasets, this is more likely to be due to sampling error rather than truly negative autocorrelation.Sampling autocorrelation from a distribution aimed to reflect the likely scenario that autocorrelation will vary across the ITS studies.
Three different series lengths were generated (12, 48 and 100 datapoints) to reflect lengths seen in realworld data sets (median of 48 datapoints per series, IQR: 23-157 12 ), and for which the performance of the ITS analysis methods (OLS and REML) has been shown to differ. 17,22We restricted our examination to an equal number of datapoints as the impact of unequal numbers on the ITS analysis results is not well understood, and to maintain a manageable number of simulation scenarios.
Each meta-analysis was comprised of multiple ITS studies, and therefore the data generation process involved generating data for each of the multiple ITS studies within each meta-analysis.Meta-analyses were generated with 3, 5 or 20 included ITS studies, to reflect the size of meta-analyses seen in real-world datasets [a median of 7 ITS (IQR: 6-10)], 55 and to include scenarios where estimation of between-study variance is likely to be suboptimal (few studies, 3 ITS) and likely to be more accurate (many studies 20 ITS).Furthermore, to incorporate heterogeneity, level changes of the ITS studies within each meta-analysis were generated with a between-study variance of τ 2 2 ¼ 0, 0.1, 2 0.3 2 and the slope changes with a between-study variance of τ 2 3 ¼ 0, 0.01, 2 0.05. 28,17

| Estimands and other targets
The estimands of interest in this simulation study were the meta-analytic effects for the immediate levelchange and slope-change parameters from Equation 1(fixed-effect β 2 and β 3 , and random-effects β Ã 2 and β Ã 3 ), and the between-study variance for both parameters (τ 2 2 ,τ 2 3 ).

| Statistical methods to analyse ITS studies
The ITS studies were analysed with both OLS and REML (Figure 2).When REML failed to converge, we used PW, and if this failed, we used OLS.The choice of ITS estimation methods was based on methods that are commonly used and those shown to have better performance. 8,12,17,22e did not include autoregressive integrated moving average (ARIMA), despite being commonly used, 12,56 as ARIMA and REML have been shown to perform similarly (in terms of confidence interval coverage) for analysing single ITS studies. 17Furthermore, REML yielded less biased estimates of lag-1 autocorrelation.

| Statistical methods for meta-analysis
Level-change and slope-change effect estimates were combined assuming both fixed-effect and random-effects univariate models (Figure 2).We did not examine the performance of bivariate meta-analysis models because our analysis of real-world ITS datasets yielded negligible correlation between estimates of immediate level-and slope-change (n = 473, correlation = À0.01,95% CI: À0.03 to 0.01; further details available from the corresponding author).In this circumstance, bivariate metaanalysis would not lead to improved precision of the regression parameters.For random-effects meta-analysis, we examined two between-study variance estimators (DL and REML) and two confidence interval methods (WT and HKSJ) (Section 2.1.4).All meta-analyses were implemented using meta in Stata version 16.1. 57

| Performance measures
The performance of the meta-analysis methods was assessed by examining bias, 95% confidence interval coverage, empirical standard error, model-based standard error, type I error rate, and statistical power (see Table S1 for definitions).For each of the 1620 combinations of design parameters (scenarios), we used 1000 replicates to keep the Monte Carlo Standard Error (MCSE) for confidence interval coverage and type I error rate below 0.7%. 39he non-convergence rate of the REML and PW estimation methods were tabulated.

| Simulation procedures
Prior to running the simulations, data generation mechanisms were checked by initially simulating series with 100,000 datapoints and meta-analyses with 50 included studies, to ensure the estimated level-change, slope-change, autocorrelation and estimates of heterogeneity matched their input values, that is, that these estimators were all consistent in the statistical sense in large samples.Scatter and box plots were used to visualise the performance for all metrics.
The simulation was conducted using Stata version 16.1 57 and results were visualised using R version 4.1.0(dplyr, 58 foreign, 59 ggplot2 60 ).All code for generating and analysing the simulated data are available in the Monash University repository known as Bridges. 61

| RESULTS
For simplicity of presentation, we restrict the descriptions of our findings to a limited number of the simulation scenarios, meta-analysis methods and ITS effect measures.We focus on scenarios in which the data were generated with an immediate level-change (β 2 ) of 1, a slope-change (β 3 ) of 0.1, and where autocorrelation was fixed at specific magnitudes; autocorrelation variability had no impact on performance (Appendix 3.7).The simulation performance measures (aside from power) were not impacted by combinations of level-and slopechanges.Furthermore, given minimal difference in performance between the random-effects methods with DL F I G U R E 3 Plots of 95% confidence interval coverage of immediate level-change (y-axis) when the data were generated under a fixedeffect model and the ITS studies were analysed with OLS (A) and REML (B) using fixed-effect (red circles), DL + WT (green diamonds) and REML+ HKSJ (blue crosses) meta-analysis methods versus autocorrelation (x-axis).Plots are presented separately by combinations of the number of included studies (rows) and the number of datapoints (columns).The solid red line depicts the nominal 95% coverage level.Simulation scenarios presented include a level-change of 1, level-change between-study heterogeneity of 0, slope-change of 0.1, slope-change between-study heterogeneity of 0, and autocorrelation constant across included ITS studies.CI, confidence interval; DL, DerSimonian and Laird; dps, datapoints; HKSJ, Hartung-Knapp/Sidik-Jonkman; ITS, interrupted time series; OLS, ordinary least squares; REML, restricted maximum likelihood; WT, Wald-type.[Colour figure can be viewed at wileyonlinelibrary.com] and REML between-study variance estimators, we restrict presentation of findings to (i) DL between-study variance estimator with WT confidence intervals (DL + WT) and (ii) REML between-study variance estimator with HKSJ confidence intervals (REML+HKSJ), with the former representing the most common method combination.Results from all random-effects methods are presented in Appendix 3. Finally, we focus on the performance of the meta-analysis methods for combining immediate levelchange estimates, given the patterns for slope-change were similar across the scenarios (see Appendix 4 for all findings).

| Bias of meta-analytic level-change
All meta-analysis method and ITS analysis combinations yielded approximately unbiased estimates of metaanalytic level-change across all simulated scenarios (Appendix 3.2).

| Confidence interval coverage of meta-analytic level-change
When data were generated under a fixed-effect model (i.e., no underlying level-change heterogeneity), fixed-effect meta-analysis of level-change effects estimated from OLS ITS yielded coverage less than the nominal 95% level, except when autocorrelation was zero and the number of datapoints was 48 or greater (Figure 3a).The less than nominal coverage decreased further with increasing autocorrelation.When the number of ITS datapoints was 12 (Figure 3b), fixed-effect meta-analysis of level-change effects estimated from REML ITS yielded coverage less than the nominal 95% level, and was importantly less than when the ITS were analysed using OLS.However, when the number of datapoints was greater than 12, fixed-effect meta-analysis of REML ITS level-change estimates reached coverage close to the nominal 95% level.Random-effects meta-analysis (irrespective of the method) of level-change effects estimated from OLS ITS or REML ITS yielded coverage close to the nominal 95% level.Exceptions to this were when the number of included ITS studies was small (i.e., ≤5) and meta-analysed with DL + WT meta-analysis; in this circumstance, coverage, for some scenarios (e.g., increasing magnitude of autocorrelation and OLS ITS), was less than the nominal 95% level (but still at least 83%).
When data were generated under a random-effects model (Figure 4), random-effects meta-analysis of levelchange effects estimated from OLS ITS or REML ITS yielded coverage close to the nominal 95% level, irrespective of the meta-analysis method used and the magnitude of heterogeneity.The only exceptions to this were DL + WT meta-analysis of REML ITS when the number of included ITS studies was small (i.e., ≤5) and the number of ITS datapoints was 12.

| Standard errors of meta-analytic level-change
When data were generated under a fixed-effect model, fixed-effect meta-analysis of level-change effects estimated from OLS ITS yielded model-based standard errors that were smaller than empirical standard errors (i.e., the ratio was less than one), except when autocorrelation was zero and the number of datapoints was 48 or greater (Figure 5a).The underestimation was exacerbated by increasing autocorrelation.Fixed-effect meta-analysis of level-change effects estimated from REML ITS yielded model-based standard errors that were smaller than the empirical standard errors when the number of ITS datapoints was 12 (Figure 5b) (ratios ranging from 0.37 to 0.64), and the ratio was importantly less than when the ITS were analysed using OLS (ratios ranging from 0.54 to 0.83).However, when the number of datapoints was greater than 12, fixed-effect meta-analysis of REML ITS level-change estimates yielded model-based standard errors that were generally similar to the empirical standard errors (ratios ranging from 0.77 to 0.97).Random-effects meta-analysis (irrespective of the method) of level-change effects estimated from OLS ITS or REML ITS yielded model-based standard errors similar to the empirical standard errors (i.e., all ratios were generally close to one, with ratios ranging from 0.83 to 1.14 for OLS ITS and 0.81 to 1.13 for REML ITS).
When data were generated under a random-effects model (Figure 6), random-effects meta-analysis (irrespective of the method) of level-change effects estimated from OLS ITS or REML ITS yielded ratios close to one (ratios ranging from 0.82 to 1.14 for OLS ITS and 0.79 to 1.13 for REML ITS) (i.e., appropriate estimation of standard errors in the presence of heterogeneity).

| Statistical power to detect a levelchange
To avoid misleading interpretations of statistical power, we limit presentation of results to only scenarios in which coverage was at least 90%; acknowledging that with this threshold, there will be some artificial inflation of power when coverage is less than the nominal 95% level (due to the inflated type I error rate, i.e., 100-coverage%).When the number of ITS was large (i.e., 20), power was reasonable, irrespective of the number of time points, ITS analysis method, or meta-analysis method (Figure 7).When there were few ITS studies per meta-analysis (i.e., 5 or fewer), power importantly reduced with a decreasing number of datapoints and with increasing autocorrelation.Furthermore, power was affected by the meta-analysis method used, with REML +HKSJ yielding less power than DL + WT.

| Convergence of estimation methods
When analysing the ITS datasets using REML, if the analysis failed to converge, we used PW, followed by OLS.Of the 15,120,000 ITS studies analysed using REML, 6% (899,970) failed to converge.When analysing these 899,970 ITS studies using PW, all converged, precluding the need to use OLS.Non-convergence of REML was more common for short series (17.21% for 12 datapoints, 0.57% for 48 datapoints, and 0.08% for 100 datapoints).However, the performance across all measures when comparing ITS studies analysed using REML with those analysed with PW were similar (results shown for some simulation scenarios for coverage, Appendix 3.9).Among the 3,240,000 meta-analyses of level-change performed using REML, 2161 (0.0006%) failed to converge.

| Estimation of heterogeneity
The magnitude of heterogeneity was overestimated in most scenarios by both REML and DL estimators.The DL estimates of heterogeneity were comparable with those from REML (Appendix 3.8).In scenarios where the ITS study standard errors were underestimated [i.e., when there were a small number of datapoints (i.e., 12 datapoints), or when autocorrelation was present but not accounted for in the analysis (i.e., OLS)], the between-study variance was overestimated (Figure 8).As autocorrelation increased, the overestimation of heterogeneity when analysed with OLS increased, and this relationship was not modified by the number of included ITS studies.When there was no underlying heterogeneity, often the estimated between-study variance was greater than zero (in 25,850/45,000, 57%, in scenarios with 3 studies, 31,055/45,000, 69%, with 5 studies and 39,859/45,000, 89%, with 20 studies, when analysed with OLS).

| Performance in estimating the meta-analytic slope-change
All meta-analysis and ITS analysis method combinations yielded approximately unbiased estimates of meta-analytic slope-change in all simulated scenarios.The patterns observed for slope-change reflected those of level-change, although the specific performance values differed for coverage, empirical standard error, and power (Appendix 3).Furthermore, the patterns of between-study variance overestimation were also observed when estimating between-study variance in meta-analyses of slope-change (Appendix 3.8).

| Analysis of illustrative example
The series in the example (introduced in Section 1.1) were relatively long, with 96 monthly datapoints per series, and the number of series available for the metaanalysis (11) was more than seen on average. 8We analysed the ITS data from each of the 11 States using two ITS estimation methods.The level-and slope-change effect estimates (and standard errors) were calculated and combined using fixed-effect and four random-effects meta-analysis methods.The point estimates for both level-and slope-changes were similar when using OLS and REML ITS analysis methods, however, the standard errors for the level-change effects were always smaller when using OLS compared with REML (Table 2).The magnitude of autocorrelation ranged from 0.16 to 0.63 (estimated using REML).The meta-analytic point estimates for levelchange did not vary by meta-analysis method (Table 3).This was a result of the heterogeneity variance being estimated as zero, or close to, for both DL and REML (irrespective of the ITS analysis method).The confidence intervals for level-change were wider when HKSJ confidence intervals were used, compared with WT, when the ITS studies were analysed using OLS.The reverse pattern was observed when the ITS studies were analysed using REML.The same patterns were observed for meta-analysis of slope-change.

| Summary and discussion of key findings
Systematic reviews including meta-analyses of results from ITS studies are important for examining the effects of population-level interventions. 8To date, there has F I G U R E 7 Plots of statistical power (the percentage of simulations that have a 95% confidence interval that did not include zero) of immediate level-change (y-axis) when the data was generated under a fixed-effect model and the ITS studies were analysed with OLS (A) and REML (B) using fixed-effect (red circles), DL + WT (green diamonds) and REML+HKSJ (blue crosses) meta-analysis methods versus autocorrelation (x-axis).Plots are presented separately by the number of included studies (rows) and number of datapoints (columns).been limited evaluation of the performance of metaanalysis methods when combining results from ITS studies, which often have characteristics that might compromise performance (e.g., short series 8,12,18,62 ).Our simulation study provides insight on the performance of meta-analysis methods in conjunction with ITS analysis methods, and factors that impact their performance, (e.g., series length, magnitude of autocorrelation and between-study variance).52][53][54] All meta-analysis methods yielded unbiased estimates of level-change and slope-change effects.This was unsurprising given the ITS analysis methods we examined have all been shown to yield unbiased estimates of level-and slope-change. 17However, the choice of meta-analysis method did impact the 95% confidence interval coverage, standard error, and power.We discuss these findings firstly in the context of scenarios with no underlying heterogeneity (generated under a fixed-effect model), followed by scenarios in which heterogeneity was present (generated under a random-effects model).
In scenarios with no true underlying heterogeneity, fixed effect meta-analysis (of level-and slope-change estimates) yielded coverage below the nominal 95% level for short series (i.e., 12 data points) or when the ITS method did not account for autocorrelation (when autocorrelation was present).In a numerical simulation study examining the performance of statistical methods for single ITS', Turner et al. 17 found that both OLS and REML analysis methods underestimated the effect estimate standard errors for short series (i.e., 12 data points).As the number of data points increased, REML ITS analysis yielded estimates of standard errors closer to the true values, even in the presence of autocorrelation.However, as expected, improvements in the estimates of standard errors were not observed when an OLS ITS analysis was used.Using OLS in the presence of autocorrelation yielded greater underestimation of the standard errors as the number of data points increased.Given that the standard error of a meta-analytic effect estimate in a fixed-effect model is a function of the within-study standard errors only, the patterns observed in the present simulation study directly reflect the patterns of Turner et al. 17 In the same scenarios as above, with no true underlying heterogeneity, random-effects meta-analysis generally yielded coverage that was close to the nominal 95% level.An artefact of underestimating the within-study standard errors was that this induced observed between-study heterogeneity, even in the presence of no true underlying heterogeneity.Greater underestimation of the withinstudy standard errors led to greater overestimation of the between-study variance.For example, greater underestimation occurs when there are many, long series (e.g., 20 ITS studies with 100 datapoints) and autocorrelation is present (0.6) but unaccounted for in the ITS analysis (i.e., OLS is used), while no underestimation of withinstudy standard errors occurs when there are many, long series and autocorrelation is present but accounted for (i.e., REML is used for ITS analysis).Fortuitously, this under and over estimation of variances counterbalance one another when combined in the calculation of the meta-analytic effect estimate standard errors in a T A B L E 3 Meta-analytic level-and slope-change effect estimates (95% confidence intervals), and estimate of between-study variance from the meta-analysis of 11 States of traffic fatality data using two ITS analysis and five meta-analysis methods combinations.random-effects meta-analysis, yielding generally unbiased standard errors.

Level
In scenarios where true underlying heterogeneity was present, random-effects meta-analysis generally yielded coverage close to the nominal 95% level, irrespective of the ITS analysis method used.This occurred for the same reason as above; scenarios in which the within-study standard errors were underestimated, the between-study variance was overestimated, which resulted in generally unbiased standard errors of the meta-analytic effect estimates.Because the resulting meta-analysis standard errors were generally unbiased, even with few studies (i.e., ≤ 5 ITS), the HKSJ confidence interval method offered no (or limited) advantage compared with the WT method.The HKSJ method has been shown to yield better coverage than the WT confidence interval in other simulation studies. 31,33Our findings may have differed if the scenarios we investigated resulted in greater underestimation of the within-study standard errors, thus affecting the estimated between-study variances, and in turn, the standard errors of the meta-analytic effect.
While the parameter of interest when fitting a random-effects meta-analysis is often the average interruption effect (i.e., the meta-analytic level-or slopechange), reporting the average alone provides an incomplete and potentially misleading summary of the impact of the interruption. 29Understanding the consistency of the interruption effects (e.g., through the calculation of a prediction interval), and the factors that may explain observed heterogeneity should be of equal importance. 63Crucially, however, this relies on accurate estimation of the betweenstudy variance, which was most accurately estimated when REML was used for the ITS analysis, as opposed to OLS.

| Strengths and limitations
Strengths of this numerical simulation study include the large number of design parameters (and their factorial combination) examined, and the use of a wide range of performance metrics.This allowed us to understand how the design parameters and their interactions affected key parameters required for interpreting meta-analysis results.Our design parameters were informed by those observed in practice 8,12,18,22,40 in an attempt to create scenarios reflective of practice.We also included scenarios that would test the meta-analysis methods when the underlying assumptions were unlikely to be met.
While our simulation scenarios were extensive, there are many other outcome types (e.g., proportion, count, rate), statistical methods (e.g., ITS analysis methods such as ARIMA, between-study variance estimators and metaanalytic effect confidence interval methods), design parameters (e.g., proportion of datapoints pre-and postinterruption) and their combinations that could be investigated.One particular example, pertinent to simulation studies of meta-analysis methods, is to vary the design parameters across the individual ITS' within the metaanalyses. 36For example, by assuming a varying number of datapoints, pre-interruption levels and/or slopes, and methods used for ITS analysis across the ITS studies.Although we caution against generalising our findings beyond the outcome type, statistical methods and design factor configurations examined in the present study, our study provides a broad understanding of the factors that affect performance, which may be helpful for informing the choice of statistical methods in scenarios beyond the configurations examined here.

| Implications for practice
For meta-analysts, our findings suggest that fitting a random-effects model generally yields coverage close to the nominal 95% level, in scenarios with and without underlying heterogeneity, and irrespective of the number of ITS studies or the method used for their analysis.Random-effects models (as opposed to fixed-effect models) may be a more appropriate model choice in the context of systematic reviews including ITS studies, as these study designs are likely to have more diversity in their characteristics (compared with randomised trials), potentially inducing statistical heterogeneity.While use of random-effects meta-analysis may mitigate some of the consequences of suboptimal ITS analysis methods in the estimation of the average effect, wherever possible, we recommend meta-analysts use effect estimates calculated from ITS methods that attempt to adjust for autocorrelation (e.g., REML).As noted above, this will lead to more accurate estimation of the between-study variance (see Section 4.1).This may require re-analysis of the ITS studies prior to their inclusion in meta-analysis.ITS data are often available in figures in publications, 12 from which data can be digitally extracted and effect estimates accurately calculated. 64However, caution is required in relying on the estimated heterogeneity when there are few ITS studies and the ITS have few datapoints.
For researchers undertaking the analysis of primary ITS studies, our results suggest the length of the time series and method used to analyse ITS studies have important implications for meta-analysis.To facilitate inclusion of eligible ITS studies in potential future systematic reviews, it is critical that their design and analysis methods are completely and accurately reported.Reporting should include a clear description of the ITS design (e.g., number of datapoints in the series), the model and statistical estimation method, including any adjustments made for autocorrelation, the interruption effect measure (e.g., immediate level-change), and the estimate and measure of precision. 8Further, provision of the aggregate-level time series data (e.g., in tables or figures) would be beneficial as it would allow systematic reviewers to re-analyse time series data across the studies using their preferred method, consistently across studies within a meta-analysis and to calculate the impact of the interruption using the desired effect measure. 14,16,22,62,65,66

| Implications for future research
We examined the performance of meta-analysis methods using a two-stage meta-analysis approach; however, in certain circumstances it is possible to fit a single model that includes all the ITS to estimate the parameters in Equation 1, known as the one-stage approach. 26Gebski et al. 26 demonstrated this with a single model fit to ITS from three hospital units and allowing the level-and slope-changes to vary via the addition of fixed effect interaction terms between the interruption effects and the hospital units.Other one-stage models could incorporate random effects for level-and slope-change, to parallel the two-stage random-effects approach.Examination of whether there are scenarios in which the one-stage approach may offer improved efficiency would be of value. 26Further avenues for research include examining, the impact of the different analysis methods on prediction intervals, as well as more complex scenarios such as where the ITS analysis methods differed between studies in the meta-analysis, the included ITS studies have lags of greater than 1, or exhibit seasonal patterns.

| Conclusions
Systematic reviews including meta-analyses of results from ITS studies are important for informing public health policy.Our simulation study provides evidence on the performance of meta-analysis methods when combining results from ITS studies.We found that all metaanalysis methods yielded unbiased estimates of the interruption effects.All random effects meta-analysis methods yielded coverage close to the nominal level, irrespective of the ITS analysis method used.However, the betweenstudy heterogeneity variance was overestimated in scenarios where the ITS study standard errors were underestimated.Therefore, meta-analysts should strive to use effect estimates and standard errors that have been calculated from ITS methods that attempt to adjust for autocorrelation (such as REML).

T A B L E 1
Design parameters used in the simulation study.

F I G U R E 2
Simulation procedure and analysis methods.*The estimation methods for ITS analysis are listed in order of preference, that is, REML is used whenever it converges, while PW followed by OLS are used in the case of non-convergence.DL, DerSimonian and Laird; HKSJ, Hartung-Knapp/Sidik-Jonkman; ITS, interrupted time series; OLS, ordinary least squares; PW, Prais-Winsten; REML, restricted maximum likelihood; WT, Wald-type.[Colour figure can be viewed at wileyonlinelibrary.com]

F
I G U R E 4 Plots of 95% confidence interval coverage of immediate level-change (y-axis) when the ITS studies were analysed with OLS (A) and REML (B) using fixed-effect (red circles), DL + WT (green diamonds) and REML+HKSJ (blue crosses) meta-analysis methods versus level-change heterogeneity (x-axis).Plots are presented separately by combinations of the number of included studies (rows) and number of datapoints (columns).The solid red line depicts the nominal 95% coverage level.Simulation scenarios presented include a levelchange of 1, slope-change of 0.1, slope-change between-study heterogeneity of 0, and autocorrelation of 0. CI, confidence interval; DL, DerSimonian and Laird; dps, datapoints; HKSJ, Hartung-Knapp/Sidik-Jonkman; ITS, interrupted time series; OLS, ordinary least squares; REML, restricted maximum likelihood; WT, Wald-type.[Colour figure can be viewed at wileyonlinelibrary.com]

F
I G U R E 5 Plots of the ratio of model-based standard error (modSE) to the empirical standard error (empSE) of immediate level-change (y-axis) when the data was generated under a fixed-effect model and the ITS studies were analysed with OLS (A) and REML (B) using fixedeffect (red circles), DL + WT (green diamonds) and REML+HKSJ (blue crosses) meta-analysis methods versus autocorrelation (x-axis).Plots are presented separately by the number of included studies (rows) and number of datapoints (columns).The solid red line depicts a ratio of one, where the model-based standard error and empirical standard error are equal and thus that the model-based standard error accurately estimates the true standard error.Simulation scenarios include a level-change of 1, level-change heterogeneity of 0, slope-change of 0.1, slope-change heterogeneity of 0, and fixed autocorrelation.CI, confidence interval; DL, DerSimonian and Laird; dps, datapoints.empSE, empirical standard error; HKSJ, Hartung-Knapp/Sidik-Jonkman; ITS, interrupted time series; modSE, model-based standard error; OLS, ordinary least squares; REML, restricted maximum likelihood; WT, Wald-type.[Colour figure can be viewed at wileyonlinelibrary.com]

F
I G U R E 6 Plots of the ratio of model based standard error (modSE) to the empirical standard error (empSE) of immediate level-change (y-axis) when the ITS studies were analysed with OLS (A) and REML (B) using fixed-effect (red circles), DL + WT (green diamonds) and REML+HKSJ (blue crosses) meta-analysis methods versus level-change heterogeneity (x-axis).Plots are presented separately by combinations of the number of included studies (rows) and number of datapoints (columns).The solid red line depicts a ratio of one, where the model-based standard error and empirical standard error are equal and thus that the model-based standard error accurately estimates the true standard error.Simulation scenarios include a level-change of 1, slope-change of 0.1, slope-change heterogeneity of 0, and autocorrelation of 0. CI, confidence interval; DL, DerSimonian and Laird; dps, datapoints.empSE, empirical standard error; HKSJ, Hartung-Knapp/Sidik-Jonkman; ITS, interrupted time series; modSE, model-based standard error; OLS, ordinary least squares; REML, restricted maximum likelihood; WT, Wald-type.[Colour figure can be viewed at wileyonlinelibrary.com] Simulation scenarios include a level-change of 1, level-change heterogeneity of 0, slope-change of 0.1, slope-change heterogeneity of 0, and fixed autocorrelation.Only scenarios with a confidence interval coverage greater than 90% have been plotted.CI, confidence interval; DL, DerSimonian and Laird; dps, datapoints; HKSJ, Hartung-Knapp/Sidik-Jonkman; ITS, interrupted time series; OLS, ordinary least squares; REML, restricted maximum likelihood; WT, Wald-type.[Colour figure can be viewed at wileyonlinelibrary.com]F I G U R E 8 Plots of level-change heterogeneity estimated using random-effects meta-analysis with the REML between-study variance estimator (y-axis) when the (A) 3 ITS, (B) 5 ITS and (C) 20 ITS studies were analysed with OLS (orange) and REML (purple) ITS analysis methods.Plots are presented separately by combinations of the true level-change heterogeneity (rows) and the number of datapoints (columns).The solid red lines indicate the true level-change heterogeneity.Simulation scenarios presented include a levelchange of 1, slope-change of 0.1, slope-change heterogeneity of 0, and fixed levels of autocorrelation.CI, confidence interval; DL, DerSimonian and Laird; dps, datapoints; HKSJ, Hartung-Knapp/Sidik-Jonkman; ITS, interrupted time series; OLS, ordinary least squares; REML, restricted maximum likelihood; WT, Wald-type.[Colour figure can be viewed at wileyonlinelibrary.com]