A comparison of methods for health policy evaluation with controlled pre‐post designs

Abstract Objective To compare interactive fixed effects (IFE) and generalized synthetic control (GSC) methods to methods prevalent in health policy evaluation and re‐evaluate the impact of the hip fracture best practice tariffs introduced for hospitals in England in 2010. Data Sources Simulations and Hospital Episode Statistics. Study Design Best practice tariffs aimed to incentivize providers to deliver care in line with guidelines. Under the scheme, 62 providers received an additional payment for each hip fracture admission, while 49 providers did not. We estimate the impact using difference‐in‐differences (DiD), synthetic control (SC), IFE, and GSC methods. We contrast the estimation methods' performance in a Monte Carlo simulation study. Principal Findings Unlike DiD, SC, and IFE methods, the GSC method provided reliable estimates across a range of simulation scenarios and was preferred for this case study. The introduction of best practice tariffs led to a 5.9 (confidence interval: 2.0 to 9.9) percentage point increase in the proportion of patients having surgery within 48 hours and a statistically insignificant 0.6 (confidence interval: −1.4 to 0.4) percentage point reduction in 30‐day mortality. Conclusions The GSC approach is an attractive method for health policy evaluation. We cannot be confident that best practice tariffs were effective.

a pay-for-performance (P4P) scheme, 3 a study's policy conclusions can rest on the approach taken to causal inference. 3 The synthetic control (SC) method 16,17 has been viewed as an attractive alternative to DiD as it avoids the parallel trends assumption. In essence, the SC method constructs a comparator for the intervention group, the synthetic control, as a weighted average of the available control units. Each unit is weighted to ensure that the mean outcomes of the synthetic control track those of the treated unit(s) prior to the intervention. 3,[17][18][19][20][21][22][23][24] However, despite its wide use, critics have shown that the SC approach may provide biased estimates in settings when few pre-intervention periods are available 2,25 ; treatment assignment is correlated with time-varying unobserved confounders, 26 or where the outcomes of the treated units cannot be obtained by weighting the control units' outcomes by values between 0 and 1 (ie, the treated units are not within the "convex hull"), leading to poor overlap. 17,25,27 Statistical inference is also somewhat problematic under the SC approach. 28 Concerns about the DiD and SC approaches have encouraged recent methodological advances. [29][30][31][32][33][34][35] However, these methods have not been considered in the health policy evaluation domain, which is characterized by particular challenges, notably the (im)plausibility of the parallel trends assumption, the possibility of heterogeneous treatment effects, and that there may be few pretreatment periods. Here, we consider two of these approaches: (a) interactive fixed effect (IFE) models, and (b) the generalized synthetic control (GSC) method, both are novel to this context. IFE models are flexible regression approaches that allow for multiple time-constant unobserved covariates, each of which may have effects that vary across time [36][37][38][39] relaxing the parallel trends assumption. 40 IFE models nest the fixed effects models routinely used within DiD estimation, but may produce biased estimates when policy effects are modified by unobserved covariates, that is effects are heterogeneous. 41 For instance, hospital quality, which is generally unobserved, may moderate the effect that a new health policy has on outcomes.
The GSC method 41 seeks to overcome this limitation by combining insights from the SC literature with the efficiency gains of IFE models. The GSC approach allows a separate (counterfactual) potential outcome to be estimated for each treated unit, allowing heterogeneous treatment effects to be consistently estimated. It has been argued that the GSC method maintains the approximately unbiasedness property of the SC estimator but offers improved efficiency. Despite these desirable features, the GSC method has not been considered in a published health policy evaluation. a We contrast the IFE and GSC methods with DiD and SC methods in a case study and in Monte Carlo simulations. We revisit an evaluation of a pay-for-performance scheme, best practice tariffs (BPT) for hip fractures, introduced for hospitals in the English NHS. 2,42 The incidence of hip fractures in the UK is rising annually and is currently estimated at 10.2 per 10 000 per year. 43 The cost to the hospital services of hip fracture are substantial, and have been estimated to be £1,131 million in the year of the fracture. 44

| MOTIVATING E X AMPLE: E VALUATI ON OF A B E S T PR AC TI CE TARIFFS SCHEME (B P T )
Hospital pay-for-performance (P4P) schemes link a portion of provider income to achieving predefined quality targets. These schemes intend to encourage providers to engage in "desirable" behaviors.
However, P4P schemes may shift resources toward rewarded vs unrewarded dimensions of care quality, and so have negative spill-over effects. 45 A number of studies have concluded that hospital pay-forperformance schemes have not had the desired impact. 14,46-51 The international evidence on P4P has been criticized for failing to provide reliable estimates of these schemes' relative effectiveness. [52][53][54] The particular P4P scheme considered here, the BPT for hip fractures, was introduced for participating English NHS hospitals from April 2010, 2,42 who were paid a fixed sum, set at £445 in the 2010/11 financial year, 55 for each hip fracture admission if certain conditions

What this study adds
• Health policy evaluations with pre-post designs are challenging as the parallel trends assumption underlying difference-in-differences estimation often does not hold for all outcomes.
• This was the case for the evaluation of the best practice tariffs (BPT) for hip fractures, a pay-for-performance scheme, introduced for hospitals in the English NHS.
• Alternative estimation methods have yielded contrasting estimates of the impacts of this BPT.
• In our simulations, the generalized synthetic control approach outperformed more commonly used methods (difference-in-differences and synthetic control methods) and hence was the preferred approach for the case study.
• It suggests that the BPT for hip fractures increased the proportion of patients who had surgery within 48 hours of admission, but did not statistically significantly reduce 30-day mortality.
representing "best practice" were met. b The BPT payments represented a considerable share of the total payment to providers for hip fracture care, 14% in 2011/12, 55 so one might anticipate that providers would respond to these altered incentives to provide best practice care.
A published survey and qualitative interviews suggested that BPT participation was influenced by factors unobserved by researchers 42c , such as the resources required for this scheme, the quality of facilities available, and the expected benefits from participation.
These may have had time-varying effects on the outcomes. Hence, a priori, it was unclear whether the parallel trends assumption held for each outcome. For one outcome, the proportion of patients who had surgery within 48 hours, the parallel trends assumption appeared plausible (Figure 1), and tests suggested this assumption could not be rejected (P = .9255). d However, for the primary outcome, mortality within 30 days, the parallel trends assumption appeared less plausible ( Figure 2) and the null hypothesis of parallel trends was rejected (P = .039).
Previous analyses, using DiD and SC methods, found that conclusions regarding the effects of the BPT differed by method. 2 Estimates based on DiD reported that the introduction of BPTs led to a statistically significant reduction in mortality, whereas the SC method failed to reject the null of no effect across all outcomes and indicated a smaller impact on mortality compared to DiD. However, the authors raised concerns regarding the efficiency of the SC estimates, motivating this re-analysis using alternative methods.
We re-analyze the data used in a previously published study, 2 consisting of hospital admissions from 62 hospital trusts that reported receiving at least some BPT payments (treated group) and 49 trusts that reported receiving no payments under the scheme (control group). Panel data were available for twelve quarters before, and four after, the scheme's introduction. All analyses were conducted at the level of the hospital-quarter.
The outcomes considered are the proportion of patients receiving surgery within 48 hours of an emergency admission and the proportion of patients that die within 30 days of admission. We adjust for baseline covariates according to age group, gender, and source of admission.

| ME THODS
Suppose there are i = 1,…,n units, and T time periods, where t = 1,…t' are pretreatment, and t' + 1,…,T are post-treatment. The potential outcomes 56 for unit i in period t in the presence and absence of treatment are denoted by Y 1 it and Y 0 it , respectively. Let D it be an indicator equal to one if unit i is treated (exposed to the policy) in period t and zero otherwise. The observed outcome can be written as: We assume the following factor model for the potential outcome in the absence of treatment: where X it is a (1 × k) vector of observed time-varying covariates, β is the (k × 1) vector of their coefficients, assumed to be the same for both groups, µ ir (r = 1, …, R) represents an unobserved time-invariant variable with λ rt capturing the effect of that unobserved variable in period t, and it represents exogenous, unobserved idiosyncratic shocks.
Allowing for an additive treatment effect that may differ by individual

| Difference in Differences (DiD)
Note that if i = [1, i ] and t = [ t ,1], equation 1 would correspond to a two-way fixed effects model: In this case, the parallel trends assumption will hold 57,58 : where t′ represents the final pretreatment period, and the conditional ATT can be estimated using DiD with two-way fixed effects regression. 24,59-61e f

| Interactive fixed effects
Interactive fixed effects models rely on an alternative set of estimation approaches for the common factor structure t ′ i . 37 Here, we estimate the IFE model using the iterative principal component estimator. 37 This approach consists of iterating between (a) estimating t and i using principal components while holding ̂ constant, and (b) estimating by regressing (Y −̂ � t̂ i ) on X, until convergence is achieved. The number of factors to include can be chosen according to cross-validation as described in Algorithm 1 in Xu. 41 It is preferable to include too many rather than too few factors. 62 One limitation of the IFE approach is that when treatment effects are moderated by the unobserved factors, the estimated average treatment effect may be biased, since the heterogeneity in treatment effects leads to biased estimates of the common factors and hence the implied treatment-free potential outcome.

| Synthetic control (SC) method
The synthetic control method has been shown to provide an approximately unbiased estimator of the ATT for a treated unit 17 when outcomes are determined by a linear factor model with time-invariant covariates (Z i ), such as: The SC method aims to estimate the unit level causal effect τ it for the treated unit, by constructing a "synthetic control," or a weighted average of the control units that has similar outcomes and observed covariates to the treated unit over the pre-intervention period: where w j is an element of W representing the weight for control j, with 0 ≤ w j ≤ 1. The synthetic control is formed by finding the vector of weights W that minimizes weights in W being positive and summing to 1, where X 1 and X 0 contain the pretreatment outcomes and covariates for the treated unit and control units, respectively, 17 and V captures the relative importance of these variables as predictors of the outcome of interest. When X 1 and X 0 include all of the pre-intervention outcomes, other covariates do not influence the weights and hence can be excluded as is done in our analysis below. If the synthetic control and treated unit have similar outcomes over an extended pre-intervention period, it is plausible that they have similar observed and unobserved predictors of the outcome. 25 Hence, the postintervention outcome for the synthetic control represents the counterfactual treatment-free potential outcome for the treated unit (Ŷ 0 1t ). The SC method assumes conditional where Y 0 ih is a vector of potential outcomes in the h time periods prior to treatment.
Since the weights are restricted to be between 0 and 1, the treated unit must lie within the "convex hull" of the control units to avoid bias. 17 The treatment effect for the treated unit (i = 1), τ 1t , can be estimated by (Y � 1t −Ŷ 0 1t ) for each postintervention period separately, and these can be averaged over time to obtain an ATT over the postintervention period.
The SC approach can be applied to multiple treated units by applying the method to each treated unit or, as we do here, averaging across the sample of treated units to obtain a single treated unit. 18,20

| Generalized synthetic control (GSC) method
The GSC approach 41 assumes that treatment assignment is independent of potential outcomes conditional on the observed covariates, and R orthogonal, unobserved latent factors ( t = t1 , … , tR ) and their factor loadings ( i = i1 , … , iR ) 41 : which implies that This will hold true if the same IFE data generating process, such as equation 1 above, underlies outcomes for the treated and the control units. The key difficulty in estimating the unobserved treatment-free potential outcome of the treated units in the post-treatment periods is estimating t for the post-treatment period and i for each treated unit. The GSC approach tackles these difficulties as follows: g First, an IFE model, Y 0 it = X it + t i + it , is estimated for the control units only, for the entire sample period, yielding estimates (̂ ,̂ t ) for the control units. Since it D it is zero in equation 1 for the control units, (̂ ,̂ t ) are consistent estimates of ( , t ), which are assumed to be the same for the treated and control units. If we knew i for the treated units, we could use our estimates from the control group (̂ ,̂ t ) to predict the post-treatment treatment-free potential outcome for the treated unit using: Since we do not know i for each treated unit, the GSC method finds the value, ̂ i , that minimizes the pretreatment discrepancy between the observed outcome and the predicted outcome for a given treated unit, based on [4]. 41 Using the estimates for ̂ and ̂ t from the control units and the resulting prediction ̂ i for the treated unit, we can estimate the treatment-free potential outcome for the treated units as: The estimated treatment-free potential outcomes after the program starts can be compared to the actual outcomes for the treated units to obtain an estimated treatment effect ̂i t = (Y � it −Ŷ 0 it ) for each unit in each period. Since, unlike the IFE approach, estimates of ̂ , ̂ t and ̂ i do not depend on post-treatment information for the treated units, ̂i t is not biased by heterogeneous treatment effects.
As with the SC method, when the number of pretreatment periods is small, it becomes harder to distinguish between i and it , which can lead to biased estimates of the treatment effect. This bias shrinks to zero as both the number of pretreatment periods and the size of the control group grow. 41 Unlike the SC method, the GSC method conveniently allows for time-varying observed covariates.
The GSC approach requires data be available for R + 1 pre-intervention periods. h

| IMPLEMENTING THE ME THODS IN THE RE-ANALYS IS OF B P T FOR HIP FR AC TU RE S
We replicated the DiD and SC estimations reported in a previously published study. 2 The DiD estimation was undertaken at the hospitallevel and controlled for covariates (age, gender, source of admission), together with two-way fixed effects for time periods and hospitals.
The SC method averaged the treated units to define a single treated unit, and a synthetic control was formed from the control units. In our implementation of the SC method, we included all of the pre-intervention outcomes as separate variables in the X 0 and X 1 matrices. The variable weights were determined simultaneously with the synthetic control weights 17 as implemented in the Stata package synth.
The IFE model was estimated using the iterative principal component estimator. 37 In our implementations of IFE and GSC, we included the time-varying covariates in the IFE model, two-way fixed effects, and up to five interactive fixed effects with the number A2: Independence conditional on past outcomes . chosen by cross-validation, following Algorithm 1 in Xu. 41 For inference, we used a parametric bootstrap with 500 replications.
For each method, we report p-values using the most common approach to inference for each approach, but recognizing that there are differences across methods that limit comparability of the resultant p-values across methods. i For the SC method, we use placebo tests for inference 2,17 ; for the GSC method, we use a bootstrap approach 41 ; and for the DiD and IFE methods, we report p-values based on cluster-robust standard errors.

| S IMUL ATION S TUDY
We compare the methods in a Monte Carlo Simulation study where the true ATT is known and contrast the approaches according to mean bias (%) and RMSE. Building from the case study, we create 500 datasets of 111 units, of which 62 (49) were assigned to treatment (control) as in the case study j and simulate data for up to 22 periods, with four of these assigned to be post-treatment. The data generating process (DGP) includes one observed covariate (X it ), 2-way additive fixed effects (μ i1 and λ 1t ), and a further two interacted factors and an additive treatment effect: We draw X i , i1 , i2 , and i3 from a standard multivariate normal distribution and 1t from a uniform(0,5) distribution. k To create a time-varying X it , we then define X it = 0.5X i + 0.5 * N 0,1 . Here, it is a standard normally distributed idiosyncratic error term. To introduce imbalance between the treated and control groups, the means of i1 , i2 , and i3 are set two standard deviations higher for the treated units than for the controls. In scenario A, we ensure the parallel trends assumption holds by setting 2t = 3t = 0, so the DGP becomes a standard two-way fixed effects model. In scenario B, we allow for monotonically increasing nonparallel trends by setting 2t = 0.2 * t and 3t = 0.1 * t.
The performance of the SC method in scenario B may be negatively affected by our inclusion of time-varying covariates (X it ) since the SC weights are time-invariant, and by the imbalance in µ leading to treated units that lie outside of the convex hull of the controls. Scenario C represents a setting without these specific challenges. Here, we use X i in place of X it so that we have time-invariant covariates, and to ensure that the average treated unit lies in the convex hull of the controls, for 25% of the control units we increase i2 and i3 by 4 standard deviations so that these unit's outcomes are likely to lie above those of the average treated unit, while the remaining 75% of controls tend to lie below. In scenario D, we include an additional postintervention shock, Δ it = 2, that only affects the treated group.
We consider scenarios (A1, B1, C1, and D1) where the treatment effect is homogenous ( it =1), and otherwise identical scenarios (A2, B2, C2 & D2) with a heterogeneous treatment effect, in which we define it = 1 + i1 − 2 . l We then apply each method to estimate the average treatment effect for the treated group as a whole over the postintervention period. We consider the methods' performance across pretreatment periods of different lengths (6,9,12, and 18 periods). Finally, we assess the impact of imbalance in the numbers of treated (n = 10) vs control (n = 100) units (scenario E; Appendix S1).

| Case study results
The estimated effects of the introduction of the BPT for hip fractures according to method are reported in Table 1. For both endpoints, the IFE method reports that the magnitude of the effect of BPT is larger than for the other methods. However, since differences in unobserved covariates, such as hospital quality, are likely to modify the effects of the policy, this may reflect bias due to heterogeneous treatment effects.
The DiD, SC, and GSC methods provide similar point estimates. The p-values do differ somewhat across the approaches, but the interpretation of these differences must recognize that the SC approach to inference differs to the other methods. The GSC method reports that the introduction of BPT increases the proportion of patients who have surgery within 48 hours, and suggests that the scheme leads to a reduction in mortality although this difference is not statistically significant.  Table 2 reports the corresponding mean bias (%) and root mean squared error (RMSE). We begin by considering the scenarios where effects are homogenous (scenario A1, B1, C1, and D1, panel (a) of Figure 3). As expected if the parallel trends assumption holds, DiD performs best (scenario A1), although IFE and GSC perform almost as well (Table 2, Figure 3(I)). By contrast, SC performs poorly, providing biased estimates attributable to the average treated unit tending to lie outside the convex hull of controls. Where the parallel trends assumption fails (scenario B1), DiD provides biased estimates, whereas IFE and GSC report minimal bias (Table 2, Figure 3(ii)). The SC method again provides biased estimates. In scenario C1, the performance of the SC method improves markedly (Table 2, Figure 3(iii)) since here the treated units tend to lie inside the convex hull of the controls.

| Simulation results
When a shock has a differential effect for the treated vs control group in the postintervention period (scenario D1), all methods provide biased estimates (Table 2, Figure 3(iv)).
In those scenarios with heterogeneous treatment effects (scenarios A2, B2, and C2, panel (b) of Figure 3), the GSC method continues to perform well, providing estimates with low bias and low RMSE (

| D ISCUSS I ON
This paper critically assesses two causal inference approaches, IFE and GSC methods, new to health policy evaluation, and contrasts them with DiD estimation and the SC method. The paper extends previous papers in the health policy and political science literatures [4][5][6][7][8][9][10][11][12][13][14][15]41,74 in contrasting IFE and GSC, but also approaches often considered in the HSR literature (DID and SC). Rather than focus solely on simple scenarios, 41 the paper considers a range of settings relevant to the HSR context, including homogeneous and heterogeneous treatment effects, parallel tends and nonparallel tends, highly imbalanced numbers of treatment and control units, serial correlation, and idiosyncratic shocks. While our paper underscores the main finding from Xu's early simulation study, 41 that GSC performs better than IFE when there is treatment effect heterogeneity, it offers a wider set of insights into the relative performance of GSC vs alternative methods in settings of direct relevance to the HSR context.
Our re-evaluation of the BPT scheme exemplifies many critical issues faced in health policy evaluations. Here, there are multiple outcomes with the parallel trends assumption plausible for some but not others; the effects of the policy are anticipated to differ across hospitals; and data are only available for relatively few periods pre-intervention. An attractive feature of the IFE and GSC methods is that they allow the analyst to adopt a consistent analytical approach across all outcomes, as their factor structure allows greater flexibility in controlling for unobserved confounders. However, the IFE estimator assumes homogenous treatment effects, which is unlikely in this study. Here, the GSC method is preferred in light of its robustness to the assumption of parallel/ nonparallel trends and homogeneous/heterogeneous effects.
It reported that BPT led to a large m and statistically significant increase in the proportion of patients who had surgery within 48 hours of admission, together with a small, but not statistically significant, reduction in 30-day mortality.
The simulation study found that the GSC approach performed better than the alternatives considered across a range of challenging settings typically faced in health economic and policy evaluations that use routine data, namely nonparallel trends, heterogeneous treatment effects, and few (6) pre-intervention periods. However, when deciding which methods to apply to a particular setting, it is important to consider the underlying theory and requirements of the method. In particular, GSC and IFE approaches both require repeated observations of the same units over time (ie, panel data) and also require data for multiple pre-intervention periods (one more than the specified number of interactive fixed effects to include).
Generalized synthetic control reports relatively precise estimates across all these challenging settings. We find the method performs well even if there is limited support for particular underlying causal assumptions (eg, parallel trends). In light of this, for the case study, which has some of these features, we emphasize the policy conclusions from the GSC approach, which is that the BPT intervention increased the probability of surgery within 48 hours, but does not lead to a change in 30-day mortality. We also con- Second, the limitations of the originally proposed SC method 16,17 have led to recent modifications. The augmented SC approach 71 addresses the bias due to non-exact balance on pretreatment outcomes.
The imperfect SC 35 reduces the sensitivity of estimates to idiosyncratic errors by applying SC to predicted rather than actual outcomes.
A number of approaches relax the overlap requirement by allowing for negative weights. 29,35,71 Extensions of the SC method using machine learning methods such as ridge regression 71 and the matrix completion approach 31 appear promising. Inference for SC type methods is an area of active research, with several authors proposing extensions to the originally proposed placebo tests. 29,32,70 Future work is required that considers the relative performance of these methods and reports the coverage of alternative inferential procedures. We allow for correlation between X it , i1 , i2 , and i3 with the correlation matrix, C = (1, 0.5, 0.5, 0.3 \ 0.5, 1, 0.5, 0.3 \ 0.5, 0.5, 1, 0.5 \ 0.3, 0.3, 0.5, 1). l Note that since E(μ i1 |D it = 1)=2 here, the true ATT is 1 in all scenarios.

ACK N OWLED G M ENTS
m The average rate of surgery within 48 hours was 58.3%. n Where the treated unit's outcomes are very different to those of the controls, the most similar control will receive a weight of 1 and be used as the counterfactual for the treated unit, even though it may be very dissimilar.