Approximate balancing weights for clustered observational study designs

In a clustered observational study, a treatment is assigned to groups and all units within the group are exposed to the treatment. We develop a new method for statistical adjustment in clustered observational studies using approximate balancing weights, a generalization of inverse propensity score weights that solve a convex optimization problem to find a set of weights that directly minimize a measure of covariate imbalance, subject to an additional penalty on the variance of the weights. We tailor the approximate balancing weights optimization problem to the clustered observational study setting by deriving an upper bound on the mean square error and finding weights that minimize this upper bound, linking the level of covariate balance to a bound on the bias. We implement the procedure by specializing the bound to a random cluster-level effects model, leading to a variance penalty that incorporates the signal-to-noise ratio and penalizes the weight on individuals and the total weight on groups differently according to the the intra-class correlation.


Introduction
In a study of comparative effectiveness, researchers seek to understand whether a treatment has a causal effect on an outcome of interest for a set of study units.In the causal inference literature, treatment assignment is a critical element of the study design (Rubin, 2007(Rubin, , 2008)).Two key components of treatment assignment are whether the intervention is randomized or not and whether it is grouped or not.First, when interventions are randomly assigned, differences between treated and control groups can be interpreted as causal effects.
In contrast, when subjects select their own treatments, stronger assumptions are needed to identify causal effects.Second, interventions may be assigned to individual units or to intact groups.For example, given that students are grouped in schools, a treatment may be assigned to all students in some schools and withheld from all students in other schools.
When treatment is randomly assigned and grouped, the design is often called a clustered randomized trial (CRT) (Raudenbush, 1997;Hedges and Hedberg, 2007).In many cases, however, treatments are grouped but non-randomly assigned.This design is referred to as the clustered observational study (COS) (Pimentel et al., 2018;Page et al., 2020).COS designs differ from non-clustered observational studies, both because they rely on different identification assumptions and because they require different methods for analysis (Ye et al., 2022).
In a COS design, as is true for any observational study, differences between treated and control outcomes may reflect initial differences in the treated and control groups rather than treatment effects (Cochran, 1965;Rubin, 1974).As such, analysis for COS designs requires statistical adjustment methods to account for observed confounders.Adjustment methods for COS designs, however, may need to remove treated and control differences in the distributions of covariates at the cluster level, the unit level, or perhaps both levels.Recent work has also shown that statistical adjustment in the COS design needs to reflect key substantive knowledge of how the clustered treatment is assigned and whether differential selection of clusters is present (Ye et al., 2022).In short, analysts require a statistical adjustment strategy that takes into account key aspects of the COS design.
In this article, we develop an approximate balancing weights estimator tailored to the of weights that directly minimize a measure of covariate imbalance subject to an additional constraint or penalty on the complexity of the weights (Hainmueller, 2011;Zubizarreta, 2015;Ben-Michael et al., 2021).Approximate balancing weights are a generalization of the standard inverse propensity score estimator.Approximate balancing weight methodology, however, requires a number of key innovations to be used in the COS context.First, we write separate objective functions with different balance measures and variance penalties for the two possible estimands in a COS.The first estimand adjusts for both cluster-and unit-level covariates while the second adjusts for cluster-level covariates only.For both estimands, we derive a bound on the mean square error of a general weighting estimator, and find weights that minimize this upper bound, showing that the level of balance between the treated and re-weighted control group gives a bound on the bias.Next, we write the variance penalty as a cluster-level random effects model, and we show that the variance penalty is comprised of two parts.The first component is the signal-to-noise ratio which measures the overall impact of the variance relative to bias.The second component is the intra-class correlation.These components correspond to hyper-parameters in the corresponding balancing optimization problem, and we develop a data-driven approach for selecting them.For cases where it is sufficient to condition on cluster-level covariates only, we show that it can be more efficient to further adjust for unit-level covariates when they are strongly predictive of the outcome or if the intra-class correlation is moderate to high.In applications with poor overlap or many covariates, it can be difficult to find weights that achieve good balance.For these cases, we develop two separate extensions: using an outcome model to perform additional bias correction, and adjusting the balancing optimization problem to find the maximally overlapping set between the treated and control group.Finally, we derive methods for variance estimation and show how to conduct inference based on asymptotic normality.In a series of simulations, we find that balancing weights are superior to multilevel matching -which is also tailored to the COS design -in terms of bias and variance reduction.
We provide further comparisons with two empirical applications estimating the effects of Catholic schools and surgical training.Overall, we find that balancing weights produce superior balance relative to extant matching methods and have larger effective sample sizes.
In Section 3 we develop approximate balancing weights for the COS design and discuss extensions and inference in Section 4. In Section 5, we evaluate our methods in a simulation study.In Section 6, we analyze the data from two COS designs: one from education and one from health services research.In Section 7, we conclude and discuss directions for future work.

The COS Design
First, we review the formal aspects of the COS design and how the process of differential selection leads to different estimands and adjustment strategies.See Ye et al. (2022) for a detailed treatment of identification issues in the COS design.

Notation
We consider a setup with m clusters, with n units in cluster , and n = m =1 n total units.We denote the treatment status of cluster as A ∈ {0, 1}, n 1 ≡ m =1 (1 − A )n as the total number of treated units, and n 0 ≡ n − n 1 as the total number of control units.
Each collection of treatment status vectors a = (a 1 , . . ., a m ) ∈ {0, 1} m is associated with a vector of potential cluster assignments, J (a) = (J 1 (a), . . ., J n (a)) ∈ {1, . . ., m} n , where J i (a) ∈ {1, . . ., m} denotes the cluster that unit i would belong to under overall treatment allocation a.We denote the observed cluster assignments as J = J (A).Each unit i has a potential outcome Y i (a J i (a) , J (a)) corresponding to the outcome that would be observed if its associated cluster has treatment status a J i and the overall cluster assignment is J (A).
Note that here we have followed Ye et al. (2022) and assumed that a unit's potential outcome only depends on its own cluster's treatment status and not the treatment status of other clusters, except through the potential cluster assignments.We further assume that potential outcomes are independent across clusters -i.e., they are independent for unit i and i if J i (a) = J i (a) -but may be dependent within clusters.We denote the observed outcomes as Y i (A J i , J ).We also assume that we observe unit-level covariates X i ∈ X and that the cluster assignments lead to potential cluster-level covariates, W (J ((a))) ∈ W, which may include summaries of the unit-level covariates and so can depend on the potential cluster assignments.We let W = W (J ((A))) denote the observed cluster-level covariates, and W denote the collection of observed cluster-level covariates for all clusters.Taken together, our observed data consists of tuples (X i , J i , W J i , A J i , Y i ).
Next, we define several conditional expectations of the observed outcomes that we will use throughout.First, we denote m w (a, w) ≡ E[Y i | A J i = a, W J i = w] as the expected outcome for unit i, conditioned on its cluster having treatment status a and covariates w.Next, we to denote the expected outcome if we add more information and additionally condition on unit-level covariates x.Next, let e(w) = P (A = 1 | W = w) denote the propensity score, the probability that a cluster is treated, conditioning on cluster-level covariates, and e(w, x) denote the propensity score conditioning on cluster and unit-level covariates.
2.2 Estimand, Designs, and Assumptions Ye et al. (2022) delineate two primary COS designs.The first we denote as the Cluster-Unit Design (CUD), and the second we denote as the Cluster-Only Design (COD).Both designs focus on a common target causal estimand which is the average treatment effect for the treated units under the observed cluster assignments J , (1) This estimand measures the effect of treatment for units in treated clusters, keeping the cluster assignments fixed.The first term, µ 1 can be written as the expectation of the observed outcomes among units in the treated clusters, However, identifying and estimating µ 0 , the mean counterfactual outcome if those clusters had in fact been assigned to control, is more difficult.Next, we review the key assumptions required for identification of this target causal estimand under the two designs.

Cluster-Unit Design
The key distinction between the identification strategies for these two designs is whether the units within the clusters respond to treatment assignment to clusters.That is, under the CUD, the mix of units within clusters is changed by the fact that clusters are treated.Ye et al. (2022) refer to such bias as differential selection.Differential selection could occur if the student mix in a school changes in response to the school being treated.For example, parents may opt to enroll their child in a school implementing a whole-school curricular reform.For the CUD, we need to account for unit selection into the clusters, which can depend on the treatment assignments.Identification of µ 0 under the CUD requires the following set of assumptions: • For two cluster assignment vectors j and j such that the treatment assignments are the same, a j i = a j i , and the cluster-level covariates are unchanged For every unit i, cluster , treatment value a, and cluster-level covariates value w, • e(w, x) > 0 for all w, x ∈ W, X .
Taken together, these assumptions form an ignorability assumption that includes both cluster-level covariates and unit-level covariates that drive treatment selection.Here, we identify µ 0 as

Cluster-Only Design
For the COD, treatment assignment is restricted to cases where either treatment assignment occurs after the unit-cluster pairing, or units are blinded to the cluster assignments before pairing.As such, treatment assignment only depends on cluster-level covariates.Identification of µ 0 under the COD requires the following set of assumptions: • Cluster assignments are not affected by treatment: J (a) = J for all a ∈ {0, 1} m .
• For every unit i, cluster , cluster assignment vector j, and treatment value a, • e(w) > 0 for all w ∈ W.
Here, conditioning on the cluster-level covariates is sufficient to remove confounding.As such, under the COD, the estimated for the expected outcome conditional on the cluster being assigned to the control condition and the cluster-level covariates is identified as: The CUD requires a stronger assumption about the form of the potential outcomes but a weaker assumption on the treatment selection process than the COD.However, the key distinction between these two identification strategies is the conditioning sets.For the CUD, we must condition on both unit and cluster-level covariates.For the COD, we only need to condition on cluster-level covariates.As such, the COD is a special case of the CUD without unit-level covariates.Below, we consider the implications of conditioning on unitlevel covariates when the assumptions of the COD hold, and we show that while unit-level covariates are unnecessary for identification, they may be beneficial in terms of increased precision.Next, we tailor approximate balancing weight methods to each design.
3 Approximate Balancing Weights for the COS Design Recent work has developed matching methods-known as multilevel matching-that are specifically tailored to the COS design (Keele and Zubizarreta, 2017;Pimentel et al., 2018;Keele et al., 2021).Here, we outline approximate balancing weights as an alternative to multilevel matching.Approximate balancing weights are a generalization of the standard inverse propensity score (IPW) estimator that solve a convex optimization problem to find a set of weights that directly minimize a measure of covariate imbalance subject to an additional constraint or penalty on the complexity of the weights (Hainmueller, 2011;Zubizarreta, 2015).Like matching, and unlike IPW, balancing weights are designed to directly target covariate balance in the estimation process, as opposed using the estimated probability of being selected for treatment.Like weighting, and unlike matching, balancing weights use more of the available data and so often have larger effective sample sizes.
Recent theoretical innovations have bolstered support for balancing weights.For example, it has been shown that balancing weights are implicitly estimates of the inverse propensity score, fit via a loss function that guarantees covariate balance (Zhao and Percival, 2016;Zhao, 2019;Wang and Zubizarreta, 2019;Chattopadhyay et al., 2020).Researchers have proposed non-parametric extensions that allow for flexible specifications of the outcome model or propensity score (Hirshberg et al., 2019;Hazlett, 2019).Finally, recent literature on weighting allows these estimators to target different quantities such as sample overlap (Li et al., 2018).See Ben-Michael et al. (2021) for a general overview on balancing weights.
Approximate balancing weights for the COS design, however, require the development of specific objective functions that are tailored to the identification conditions outlined above.
Next, we develop objective functions that target measures of covariate imbalance and variance penalties that are specific to COS designs.As we will see, one key challenge in developing objective functions for COS designs is accounting for how clustering affects the variance of the weights.Below, we consider a generic weighting estimator of the form where the weights γ i are independent of the outcomes.We will first consider conditioning on cluster-and unit-level covariates to target µ 0wx .Then we will specialize this estimator to only condition on cluster-level covariates and target µ 0w .For both cases, we inspect the mean square error of the generic weighted average of control units' outcomes, then find weights that minimize an upper bound.Under both settings, we will consider the finite sample estimands: which converge to our main estimands.

Conditioning on cluster-and unit-level covariates
To target µ 0wx , we use the following decomposition for the estimation error of the weighting estimator: (2) To understand this decomposition, we compute the design-conditional bias and variance.
First, since the weights are independent of the outcomes, under the ignorability condition for the CUD, the conditional bias corresponds to the first term in the decomposition: ( This bias is the post-weighting imbalance in a particular function of the cluster-and unitlevel covariates: the conditional expectation function m wx (0, •, •).If we knew m wx , then we could remove all bias by ensuring that this function is balanced.Unfortunately we do not know it, or else we would be able to perfectly impute the missing potential outcomes.
Instead, following Hirshberg et al. (2019) and Ben-Michael et al. ( 2021), we consider a class of potential models M wx and consider the worst case bias over this class: Here, since we are conditioning on both unit and cluster-level covariates, the model class M wx can be relatively complex.For example, it might include interactions between W J i and Next, the design-conditional variance corresponds to the second term: where ) is the residual in the observed outcomes after conditioning on cluster-and unit-level covariates.Note that due to potential correlation in the outcomes within clusters, the variance term includes a cross-term involving the product of pairs of weights on each unit, in addition to the typical squared weight term used in settings where the weights are independent.Taken together, we try to find weights that minimize an upper bound on the design-conditional mean square error: Implementing the optimization of Equations ( 9) requires making several modeling choices.
First, we must choose the model class for the conditional expectation function M wx .Second, we must specify the variances and covariances of the residuals.In making these choices, we will attempt to strike a balance between flexibility and practicality.
We begin by selecting the model class for M wx .In the non-clustered treatment assignment setting, many model classes have been considered, from sparse models (Zubizarreta, 2015) to models with expanding basis functions (Wang and Zubizarreta, 2020) to reproducing kernel Hilbert spaces (Hirshberg et al., 2019).See Ben-Michael et al. ( 2021) for a recent review and discussion on the difficulty of balancing these model classes.To target µ 0wx in the COS design, we consider a model class that is linear in a transformation of the covariates and incorporating both cluster-and unit-level covariates with L 2 -bounded coefficients and an unbounded intercept, i.e.: These model classes naturally allow for a non-parametric extension to reproducing kernel hilbert spaces with infinite-dimensional transformations via the "kernel" trick, e.g., defining a kernel k((w 1 , x 1 ), (w 2 , x 2 )) = ψ(w 1 , x 1 )•ψ(w 2 , x 2 ).Notably, these transformations, ψ(w, x), can include interactions between the two levels of variables, allowing for the cluster context to affect the relationship between the outcome and the unit-level covariates.
To specify the variances and covariances of the residuals, we use a random effects model.
Momentarily abusing notation, we write the residual between unit i's outcome and its expected control outcome conditional on cluster-and unit-level covariates as with a cluster-level random effect δ J i and an independent unit-level residual ε i .We parameterize the variance with two terms: Var(δ the random-effects model, σ 2 represents the total residual variance, and ρ is the intra-class correlation (ICC), which is a well-known measure of relatedness within clusters.Under this random effects model, the variance is We then find the weights, γ i that solve the following optimization problem: where we have included some additional optional upper and lower bounds on the weights.
This objective function implements the general balancing weights problem in Equation ( 5) with the model class Ψ wx and a random-effects model.The variance penalty includes two hyperparameters.First, there is the noise to signal ratio, σ 2 /C 2 w , that measures the overall impact of the variance relative to the bias in the MSE.If this ratio is small, then better balance will be prioritized over lower variance; if the ratio is large then the opposite is true.
The second hyperparameter is the ICC, which determines the level of penalization for each cluster.If the ICC is small and units' outcomes are nearly uncorrelated, then the weight on each unit will be penalized the same.Conversely, if the ICC is large then the outcomes are very correlated within clusters, the penalization focuses on the total weight assigned to each cluster instead.We discuss setting these hyperparameters below.Note that this specification of the variance penalty is equivalent to Rubinstein et al. (2022), who consider region-level policy analysis via weighting.
The weights are also constrained to sum to the total number of treated units, and to be bounded between a lower bound L and an upper bound U .The former constraint comes from the possibility of an unbounded intercept in the model class Φ wx ; by ensuring the sum constraint we can be sure that the estimator is invariant to constant shifts in the outcome.The latter constraint acts as a form of regularization, and allows us to perform typically post-hoc adjustments such as weight trimming directly when finding our weights.
If L = 0 and U = ∞, the estimator will be restricted from extrapolating away from the support of the control data (see Ben-Michael et al., 2021, for further discussion on the role of extrapolation).If we additionally set U to be non-infinite, we can prevent any weights from becoming extremely large, at the cost of balance.

Conditioning on cluster-level covariates only
Next, we consider estimating µ 0w , which does not condition on unit-level covariates.This weighting estimator ignores the unit-level information, with weights that are constant within clusters, i.e. γ clus i = γJ i for all units i.Here, the results are a special case of the CUD, removing the unit-level covariates from the optimization procedure.The design-conditional bias and variance in this special case are: 7) Here, the bias only depends on imbalance in a function of the cluster-level covariates, m w .
As above, we consider a model class M w and upper bound the bias by worst-case imbalance: Comparing the variances V clus and V unit , we see that restricting to cluster-level covariates only removes the cross term between weights in the same cluster.So V clus depends principally on the sum of the squared weights, weighted by the total variance of the outcomes in each cluster.Tailoring Equation (5) yields the following optimization problem to control the MSE: To implement this optimization procedure, we will use the same model class and variance model as above, removing the unit-level covariates.For the model class, we use the set of models that are linear in transformations of the cluster-level covariates: For the variance model, we again assume that the residual decomposes into a cluster-level residual and an independent unit-level residual: s 2 (1 − r) and Var(d ) = s 2 r.Here, s 2 represents the total variance in the residual, and r is the ICC.Now, the variance can be written as Putting together the pieces, we find weights γ that solve the following optimization problem (10) The cluster-level balancing problem in Equation ( 10) and the unit-level problem in Equation (6) share many similarities, including the constraints on the weights and the noise-tosignal ratio.1However, the key difference is whether the weights can differ within clusters in order to balance the additional unit-level covariates.This is most apparent in how the ICC affects the variance penalty.When only cluster-level covariates are included, the ICC only appears in the role that the cluster size n has in the variance penalty.When weights are allowed to differ by unit, however, this creates more options in how the weights are penalized.

Hyperparameter selection
Both of the objective functions specified above include two terms that we consider hyperparameters: the overall noise to signal ratio, (s 2 /C 2 w or σ 2 /C 2 wx ) and the ICC.The noise to signal ratio governs the bias-variance tradeoff.As it approaches zero, more and more emphasis is placed on achieving covariate balance; in the (unreasonable) noiseless limit, we would not need or want to penalize the variance.On the other hand, as the noise to signal ratio increases and the covariates are less predictive of the outcome, the optimization will prioritize variance more.In contrast, the ICC term determines the form of the variance penalty.In both cases, when the ICC is 0, the weights on each unit are penalized separately since units' outcomes are independent.When the ICC is 1, units' outcomes completely move together within a group, in which case the total weight placed on the group is penalized.
Intermediate values of the ICC correspond to a mixture of these penalization schemes.
Note that the particular values of the noise to signal ratio and the ICC only enter the estimator through the variance penalties on the weights in Equations ( 6) and ( 10).This is why we view them as hyperparameters in the optimization problem rather than parameters to be estimated.However, we can use the data at hand to guide our choice of these hyperparameters.We do so by regressing the outcome on the covariates with a random intercept model.The random intercept model estimates cluster-and unit-level variance terms that can be used as an estimate for the ICC.To estimate the signal-to-noise ratio, we take the ratio of the estimated residual variance and the squared sum of the estimated regression coefficients from the fitted model.Below, we use this heuristic to choose the hyperparameters in our simulation studies and empirical analyses.Note that this procedure can induce a dependence between the outcomes and the weights through the hyper-parameters, which can be avoided via sample splitting.Other forms of hyperparameter selection are possible, including cross-validation style approaches that evaluate balance on a held-out sample (Wang and Zubizarreta, 2020).
3.4 When should we include unit-level covariates in the Cluster-Only Design?
Given that under the COD it is sufficient to only include cluster-level covariates, one natural question for this design is whether it is useful to also include unit-level covariates.Including unit-level covariates may decrease the variance: balancing these covariates is akin to adjusting for baseline covariates that predict the outcome in randomized experiments.On the other hand, balancing unit-level covariates may lead to more extreme weights, and, depending on the ICC, this could lead to higher variance.
To characterize this trade off, we consider the ratio of the design conditional variance with and without including unit-level covariates, V unit and V clus , respectively.In cases with excellent balance in both cluster-and unit-level covariates, this will give a measure of the expected difference in the overall mean squared error.To simplify the problem, we restrict our attention to the cluster-level random effects model, and assume that the ICC is unchanged when conditioning on unit-level covariates (i.e., ρ = r).Under these assumptions, the variance ratio is where is the design effect of including unit-level covariates.
The first term, σ 2 /s 2 < 1 represents the reduction in total variance in the residuals by conditioning on unit-level covariates, and serves to reduce the variance ratio by incorporating additional information.The more predictive the unit-level variables are, the lower we expect this ratio to be.Counteracting this, we have the design effect d eff (ρ).Rearranging Equation ( 11), we see that in order for it to be beneficial to include unit-level covariates, the variance after conditioning on unit-level covariates must decrease by at least the design effect, σ 2 < s 2 d eff (ρ).
For any given value of ρ, we expect the cluster-level weights to be less extreme, because they do not have to additionally balance the unit-level covariates, so typically d eff (ρ) ≥ 1.
Whether the sum of the squared weights on the units or on the clusters matters more to the design effect depends on the ICC.When ρ is small, there is little effect of clustering on the variance, and so we have a standard tradeoff: including unit-level covariates will increase efficiency as long as the effective sample size does not decrease by more than 1 − σ 2 /s 2 .
Conversely, when ρ is large, the sum of the squared weights on the clusters will dominate, in which case the unit-level weights can differ substantially from their average cluster-level weight without incurring much extra variance.In this case including unit-level covariates can improve efficiency even if they are only somewhat predictive.Hedges and Hedberg (2007) reported that from 41 clustered randomized experiments in education, the ICCs range from 0.07 to 0.31, with an average value of 0.17.Small et al. (2008) report that ICCs in the range of .002 to 0.03 in public health interventions that target clusters such as hospitals or clinics.Thus, in education settings -where ICCs are larger, and we typically have strongly predictive covariates such as prior test scores -it will likely be beneficial to include unitlevel covariates.In contrast, for public health settings it may depend much more on the particular context and the set of unit-level covariates available.
4 Extensions and Inference

Dealing with poor balance
When there are a large number of cluster-level covariates relative to the total number of clusters -or conversely, when there is a small number of clusters overall -it may be difficult to find weights that achieve good balance.Often this occurs when the overlap between the treated and control cluster covariate distributions is limited (Keele et al., 2022).This is true for the application in Section 6.1 below.We consider two different approaches to account for this.First, we outline using an outcome model to correct for the bias due to remaining imbalance.Second, we consider changing the estimand by also weighting the treated units to find a maximally sized overlapping set between the treated and control units.

Bias correction with an outcome estimator
Bias correction, sometimes also called augmentation, is a popular approach to reducing bias due to imbalance.We describe the bias-correction procedure when only including both cluster-and unit-level covariates; restricting to cluster-level covariates alone will be analogous.First we estimate the conditional expectation function to get estimates mwx (0, W J i , X i ).
There are many possible estimation strategies, one choice being regularized regression.Then we perform bias-correction by estimating µ 0 as If we compare this to the bias in Equation ( 2), we see that μbc 0 (γ) uses the estimated model to estimate the bias due to imbalance after weighting.Then it attempts to remove the bias by subtracting the estimated bias off of the estimate.In the bias-variance decomposition above, bias correction changes the imbalance measure to be over the worst-case model error m − m, which will generally be smaller than the imbalance in the worst-case model (Hirshberg and Wager, 2021).This bias correction is akin to what has been proposed for matching estimators (Abadie and Imbens, 2011) and augmented IPW (Robins et al., 1994), and has been used with a variety of balancing weights estimators (e.g.Athey et al., 2018;Ben-Michael et al., 2021).

Subset Weights: Finding a maximally overlapping set
One issue with bias correction based on an outcome model is that the additional use of modeling can lead to extrapolation away from the control units' data (Ben-Michael et al., 2021).One alternative to outcome modeling that can be attractive, especially if overlap is limited, is to focus on a more limited estimand than the ATT.Specifically, one such estimand is the average treatment effect for the overlap population (ATO).The ATO corresponds to the treatment effect for the marginal population that might or might not receive the treatment of interest rather than a known, a priori well-defined population such as the treated group.Li et al. (2018) developed model-based overlap weights for the ATO that continuously down-weight the units in the tails of the propensity score distribution.We develop an analog with balancing weights for the COS setting.
To do so, we will weight the treated units as well as the control units, leading to an estimator By weighting the treated units as well as the control units, we are no longer estimating the expected effect among the treated units.We are instead trimming the estimand.To understand what the new estimand is, consider the design conditional expectation: As before, we want to find weights that balance the conditional expected control outcome.Then τ (γ) will be an unbiased estimator of a weighted average of conditional treatment effects for the treated group.We will once again try to find weights that minimize the worst case balance across a model class M wx , now also weighting the treated units: The design-conditional variance of τ (γ) is where we generalize the definition of the residual to be ε With this, we once again try to find weights that minimize the imbalance and the variance.Focusing on the specialization to the constrained linear transformation model class Ψ wx and the random effects variance model, we find weights γ i that solve: This optimization problem tries to find weights on both treated and control units such that the weighted average of their transformed covariates are similar.Hereafter, we refer to these weights as subset weights, since they weight the subset of the data for which the covariate distributions overlap.The variance penalty in Equation ( 14) serves to ensure that this weighted subset is pushed towards having a larger effective number of units.Finally, Equation ( 14) is also related to the Lagrangian dual of the SVM problem.Tarr and Imai (2021) show that the SVM dual minimizes the same measure of imbalance, but includes a penalty to ensure that the sum of the weights is not small, leading to a different measure of the "size" of the overlapping set.

Variance Estimation and Uncertainty Quantification
We now turn to constructing asymptotically valid confidence intervals for µ 0 by estimating the variance of μ0 (γ) − µ 0 and relying on asymptotic normality.When estimating the variance, it is important to account for dependence within clusters.One method for variance estimation in this context is the cluster-robust sandwich estimator (Huber, 1967;White, 1980), which we adapt here.Focusing on the CUD first, we estimate the conditional expectation function mwx (0, w, x), then compute unit-level residuals εi This leads to a plug-in estimator for the variance: Assumption 1 in the Supplementary Materials lists regularity conditions for asymptotic inference, based on conditions from Hansen and Lee (2019).Importantly, these regularity conditions allow the cluster sizes to grow, but limit the number of clusters.
Here we highlight the dependence of the variances on the sample size by indexing them by n.To state the asymptotic normality result below, define A, J as the covariance between the true residuals and the estimated model predictions in cluster .
Theorem 1.Under the regularity conditions in Assumption 1, with constant lower and upper bounds L and U in Equation ( 6 in probability, and consequently, 1 Theorem 1 implies that we can construct approximate 1 − α confidence intervals as μ0 (γ) ± z 1−α/2 V unit , where z 1−α/2 is the 1 − α quantile of the standard normal distribution.
The key assumption is that the weights γ can achieve good balance in the sense that the worst-case imbalance imbalance Mwx (γ) converges to zero faster than 1/ √ V unit .This rate depends on the correlation within clusters.If units are uncorrelated, it will scale with the total number of units; if units are perfectly correlated then it will scale with the number of clusters; see Hansen and Lee (2019) for further discussion on rates of convergence with clustered data.Whether the weights can actually achieve this level of balance depends on the model class.Hirshberg et al. (2019) show that for Reproducing Kernel Hilbert Spaces with i.i.d.data, the imbalance will be small enough, while Hirshberg and Wager (2021) show that the bias-corrected estimator will have small enough bias in more general settings.
See Ben-Michael et al. (2021) for further discussion.Finally, Theorem 1 assumes that the estimated model m is consistent and that the covariance between the true residuals and the estimated model predictions converges to zero, ensuring that the variance estimator is consistent.The latter can be guaranteed by using sample-splitting or cross fitting, or by using models with restricted complexity.
Below, we estimate m via regularized weighted least squares on the control units, using the weights from Equation ( 10) above.If we only include an intercept and exclude all of the covariates in the regression, then these variance estimates are equivalent to the cluster-robust sandwich estimates of a weighted mean.However, this approach may be conservative as it disregards the variance reduction due to balancing the covariates.
Finally, by including a separate regression for the treated units, we can also compute a plug in estimate of the variance for the maximal overlap estimator τ (γ) above.In addition, to construct a variance estimate for the COD we can follow the same procedure: (i) estimate the conditional expectation function mw (0, w); (ii) estimate the residuals εi ≡ Y i − mw (0, W J i ); and (iii) create a plug-in estimate for the variance:

Simulation Study
Next, we conduct a simulation study to understand the performance of balancing weights using a simulation design from Keele et al. (2021) developed to evaluate multilevel matching.
First, we describe the data generation process (DGP), which we alter slightly to manipulate the level of overlap between the treated and control distributions.The DGP partially depends on empirical data from a summer school reading intervention that follows the COS template.In the data, there are 18 treated schools with 1,367 students, and 26 control schools with 2,060 students.There are 5 student-level variables and 9 school-level variables.
The student-level variables are reading and math test scores and indicator variables for race, ethnicity, and sex.The school-level covariates include the percentage of students who receive free/reduced price lunch, who are English language learners, and who are proficient in math and reading based on state standardized tests.The school-level covariates also include the share of teachers who are novice (e.g., in their first year), the rate of year-to-year staff turnover, and student average daily attendance.
For this DGP, we first fit a school-level propensity score model where we regressed the observed treatment indicator on the following set of school level variables: the percentage of students receiving free/reduced price lunch, the percentage of students who are English language learners, the percentage of teachers who are novice, and student average daily attendance.We denote this estimated propensity score as ê(w).We define the latent probability of treatment as: where c controls the level of overlap between treated and control clusters.Observed treatment status is generated via the following model: Z j = 1(Z * > 0.25).
Next, we fit an outcome model for the observed data.Here, we regressed reading scores on the student-level covariates with the basis expanded to include interactions between race and test scores.After model fitting, we save β0 , the intercept from this regression.We use τ to denote the true treatment effect estimate in the simulation, and set it to be 0.3 of a standard deviation of the raw outcome measure.We denote student-level reading scores with R ij , student-level math scores as M ij , and the percentage of students proficient in math and reading in each school with P j .Next, we generate potential outcomes under control as: where v 1 is a draw from a normal distribution that is mean zero with a standard deviation of 12.We selected these parameter values so that misspecification would produce a bias of approximately 0.3 standard deviations on a standardized scale.A bias of this magnitude is large enough to completely obscure the true treatment effect.Next, we generated potential outcomes under treatment as y 1 = y 0 + τ , and we generated simulated outcomes as Ỹij = Z j y 1 + (1 − Z j )y 0 .Note that in this DGP, selection into treatment at the school level is only a function of school-level covariates, but the potential outcomes under control are a function of the student-level test scores and, to a lesser extent, the overall quality of the school as measured by the percentage of proficient students.This DGP also induces a correlation within clusters.The average ICC across simulations was 0.29 with a standard deviation of 0.04.We judge this to be a common data structure in educational COS designs.Using this DGP, we conduct two different simulation studies.

Simulation study 1
In the first study, we conduct a comparative analysis of adjustment methods for COS designs.
First, we produce a naive estimate of the treatment effect as the difference in means without any covariate adjustment, and an estimate using multilevel matching as implemented in Pimentel et al. (2018).Next, we implement our balancing weights in two ways: first, as balancing weights by solving Equation ( 6) and second, as subset balancing weights by solving Equation ( 14).We set the hyperparameters via the heuristic in Section 3.3, using the estimated coefficients and variance components from a multilevel regression model.In this simulation, we focus on how the performance changes as we vary the level of overlap by setting c to values of 1, 2.5, 7.5, and 10.When c = 1 this will induce poor overlap, and when c = 10 there is excellent overlap, with intermediate values increasing the level of overlap.For each scenario, we repeated the simulation 1,000 times, and we report the bias and root meansquared error (RMSE).In our results, we standardize the bias, dividing it by the standard deviation of the control group's outcomes from the original data.
In the first panel of Figure 1, we plot the bias as a function of overlap for all four estimation methods.We observe that matching and both types of weights reduce bias compared to an unadjusted estimate.However, when overlap is poor, both sets of weights remove substantially more bias than multilevel matching.The performance of matching could be improved by trimming treated observations.However, with the balancing weights, we can reduce the bias compared to matching without having to change the estimand.The subset weights also alter the estimand, but allow for an unbiased treatment effect estimate.We next compare the methods in terms of RMSE.Both matching and weighting discard data.Matching discards some of the control schools and units, while weighting gives some of the control schools and students zero weight.One open question is whether one method or the other does so in a more efficient fashion.In the second panel of Figure 1, we plot the RMSE as a function of overlap for all the estimation methods.In this scenario, the difference in performance between matching and balancing weights is clear.While the subset weights have the lowest RMSE, the difference between the two weighting methods is minor.However, both forms of balancing weights outperform matching across all overlap scenarios by nearly a factor of 10.Balancing weights thus manage to reduce bias while also retaining a much larger effective sample size, which improves efficiency.Overall, we find that balancing weights are a clear improvement over matching.Balancing weights have lower bias than matching when overlap is poor and also are considerably more efficient.

Simulation study 2
In the second simulation, we focus on the performance of the proposed plug-in variance estimator, relative to using the standard weighted cluster-robust sandwich estimator, for the balancing weights solving Equation (6).In this simulation, we fixed the overlap parameter at 10 and vary the number of clusters.We control the number of clusters by resampling clusters with replacement from the original data, and then generate outcomes and treatments following the DGP above.We used cluster sample sizes of 50, 100, 150, 200 and 250.For each scenario, we repeated the simulation 1,000 times, and we report the average standard error estimate and the length of the 95% confidence interval.
Figure 2 shows the results for the good overlap scenario.In the first panel, we compare the magnitude of the two variance estimates.As we expect, our proposed plug-in method produces smaller standard errors by removing variance due to the covariates.The difference between the two methods is largest when the number of clusters is small.These smaller standard errors also produce narrower confidence intervals.We also measured nominal coverage of the confidence intervals for both methods and found that both methods led to over-coverage of the true effect.This result is consistent with the general finding that the sandwich variance estimator for weighting methods is conservative.The pattern from the good overlap scenario is very similar for the other two overlap settings, so we report those results in the supplementary materials.

Applications
Next, we present results from two different empirical applications.The first, from education, compares the performance of Catholic and public schools.The overlap between Catholic and public schools is known to be poor (Keele et al., 2022), which allows us to study the performance of the subset weights in a context for which they were designed.The second application, from health services research, compares different residency programs for surgeons.
Here, overlap is much better, but the sample sizes may prove computationally challenging for multilevel matching.

Catholic Schools
Our analysis is a replication Keele et al. (2022), which used multilevel matching and highlighted the limited amount of overlap between Catholic and public schools.Here, we compare results based on our proposed weighting methods to those based on multilevel matching.The data are a public release of the 1982 High School and Beyond survey and includes records for 7,185 high school students from 160 schools.Of these schools, 70 are Catholic schools and are thus considered treated in this application, while the remainder are public high schools and thus serve as a reservoir of controls.The average number of students sampled per Catholic school is approximately 50 and ranges from 20 -67, while the average number of students sampled per public school is 41 and ranges from 14 -61.The data contain covariates on both students and schools.Student-level covariates are: an indicator for whether or not the student is female; an indicator for whether a student belongs to a particular racial/ethnic group; and a scale for socioeconomic status (SES).Three of the school-level measures are school-level averages of these student-level measures.Three additional school-level covariates are: total enrollment; the percentage of students on an academic track; and a measure of disciplinary climate.The disciplinary climate variable is a composite measure created from a factor score on measures of the number of attacks on teachers, fights, and other disciplinary incidents.This variable ranges from -1.7 to 2.7.
In our analysis, we computed COS balancing weights for the ATT and subset weights for the overlapping set.Using a random effects model, we estimated the two components of the hyperparameter: the estimated ICC is 0.036, and the estimated signal-to-noise ratio is 1.2.
We also include two of the multilevel matches implemented in Keele et al. (2022).The first of these matches retains all the Catholic schools, and the second trimmed 10 Catholic schools to improve the balance and increase overlap.Table 1 contains a comparison of how well each method balanced the baseline covariates as measured by the standardized difference: the difference in weighted/matched Catholic and public schools means divided by the pooled standard deviation before adjustment.The lack of overlap is apparent in the size of the standardized differences in the school-level covariates; several of the standardized differences are larger than 0.50, and three exceed 1.We observe that balancing weights outperform matching in terms of balance.While the standardized differences for the COS balancing weights are still fairly large for two covariates, they are less than half of those obtained via multilevel matching.Moreover, if we are willing to alter the estimand, the subset weights are able to nearly exactly balance the Catholic and public school distributions.Note: Cell entries are standardized differences, the difference in means divided by the pool standard deviation.
To better understand the differences between the COS weights and the COS subset weights, we provide descriptive statistics on the largest weights.For the balancing weights, the vast majority of the weights are between 0 and 1, but there are 63 weights that are larger than 10, and the largest weight has a value of 92.For the COS subset weights the largest weight is 14.This should reduce the likelihood of unstable behavior in the treatment effect estimates due to large weights.Next, we measure the effective sample sizes to understand the loss of information due to weighting.In our data, before adjustment there are 5,273 students.The effective sample size for the balancing weights is 1,710, and 1,036 for the subset weights.For comparison, the effective sample size for the match with all treated schools is 570, and 392 for the match that trimmed treated schools.
Next, we review the point estimates for the Catholic school effect.Note that the outcome is a standardized test score, so estimates are measured in standard deviations.First, the unadjusted effect is 0.43 with a 95% CI of (0.31, 0.55).Next, the Catholic school effect estimated by balancing weights is -0.08.The 95% confidence interval using the sandwich variance estimator is (-0.76,0.59), and the confidence interval based on our proposed plug-in estimator is (-0.47,0.31).The Catholic school estimate based on subset weights is 0.07.
The 95% confidence interval using the sandwich variance estimator is (-0.29,0.44), and the confidence interval based on our proposed plug-in estimator is -0.08-0.24.Note that these estimates are not directly comparable, as they target different estimands, but we observe in both cases small point estimates with confidence intervals that include zero.
Notably, the plug-in variance estimator produces shorter confidence intervals consistent with the simulation results.For comparison, we next report the estimates based on multilevel matching.The estimate based on multilevel matching is 0.26 (95% CI: 0.12, 0.40), and the estimate based on matching with trimming is 0.13 (95% CI: -0.06, 0.31).The estimates based on matching are closer to those based on weighting if we include additional bias reduction via outcome modeling.This suggests that weighting was much more successful than matching at removing bias without reference to outcomes.
Finally, while the subset balancing weights were able to balance the data better even with poor overlap, we had to change the estimand.To aid in the interpretation of this estimand, we plot the covariate means for Catholic and public schools before and after subset weighting in Figure 3.That is, we compare the raw Catholic school means to the weighted Catholic school means, and the same for public schools.We observe that Catholic and public school do not differ much in terms of gender and racial mix.While Catholic and public schools differ some in terms of SES, the key difference is in terms of disciplinary climate.For this covariate, we observe the largest difference between the weighted and unweighted estimates for both Catholic and public schools.In the overlap population of schools, Catholic schools have a much stricter disciplinary climate and public schools have a more permissive disciplinary climate.

Surgical Training
One important question in health services research is whether certain aspects of surgical training have an effect on patient outcomes (Asch et al., 2009;Bansal et al., 2016;Zaheer et al., 2017;Sullivan et al., 2012).Sellers et al. (2018) studied whether surgeons from university-based residency programs produce superior patient outcomes than surgeons values based on estimates from a random effects model.In this data, the estimated ICC is 0.016, and the noise to signal ratio is 0.221.Figure 4 contains a balance plot for the subset of covariates with the largest imbalances.For each of these covariates, weighting improves any imbalance relative to the unadjusted data, giving close to exact balance.
In the full data, there are 279,611 patients.After weighting, the effective sample size is 262,672.Here, the loss of sample size is relatively small.We also attempted to implement a multilevel match on a desktop computer with 32 GB of RAM.However, we received an error that R was unable to allocate enough memory to complete the match.For the balancing weights, we are able to estimate weights is less than a minute.As such, balancing weights have clear computational advantages for COS applications with larger sample sizes.
Next, we estimate treatment effects.First, we estimate the unadjusted treatment effect via a regression model with clustering at the surgeon level.We find that for UBR surgeons, the estimated percentage of cases with a complication is 1.35 percentage points lower, and the 95% confidence interval does not contain zero (95% CI: -2.14, -0.55).Once we account for confounding via weighting, the estimate treatment effect is -0.85 percentage points.We estimated 95% confidence intervals using both the cluster-robust sandwich variance estimator (95% CI: -1.62, -0.09) and our proposed plug-in variance estimator (95% CI: -1.53, -0.18).
Because the balance is excellent, further bias reduction via an outcome model does not meaningfully change the estimate.In sum, we find that UBR surgeons do indeed appear to cause fewer complications, even when accounting for observed confounding.However, the magnitude of the treatment effect is smaller once we control for observed covariates.

Conclusion
We introduced an approximate balancing weight estimator for designs where treatments are administered to entire clusters such as school or hospitals.To do so, we considered two potential estimands -one adjusting for both unit-level and cluster-level covariates and another adjusting for cluster-level covariates alone -that have different identification assumptions associated with them.For both of these estimands, we find weights to minimize an upper bound on the mean square error of the resulting weighting estimator.When adjusting for cluster-level covariates alone, the weights are constant within clusters, while the weights vary across units within clusters when including unit-level covariates.This affects the overall variance, as units' outcomes can be correlated within clusters.We showed that when it is sufficient to adjust only for cluster-level covariates in order to estimate the ATT, there can be efficiency gains to including unit-level covariates as well, depending on the predictive strength of those covariates and the level of between units' outcomes within clusters.We also considered two adaptations to deal with cases where it is impossible to find weights that achieve good covariate balance: (i) bias-correction via an outcome model and (ii) changing the estimand and finding an overlapping weighted subset of the data.We then showed how to construct confidence intervals that are asymptotically valid under certain conditions.In a simulation study, we demonstrated that our proposed weighting estimator outperformed multilevel matching.This was especially true in terms of using more of the sample size which resulted in higher efficiency and lower RMSE.In two empirical applications, we found balancing weights also had several practical advantages over multilevel matching.
There are several avenues for future work.First, while we propose a heuristic for choosing the hyperparameters from the observed data, an important question is how to choose these hyperparameters in a rigorous, data-driven way that keeps weights independent of outcomes.
Second, we can consider resampling approaches to uncertainty quantification such as the block weighted bootstrap proposed by Cui et al. (2022).Finally, many COS designs have a longitudinal structure, where data is available at multiple granularities over time.For example, many policy changes happen at the state-level, but county and municipality-level data series exist for the outcome of interest.Exploring these cases and extending our analysis to such settings will be important areas for future research.

Figure 1 :
Figure 1: Bias and RMSE for matching and two different weighting estimators by overlap condition.

Figure 2 :
Figure 2: Comparative performance for two different variance estimators.

Figure 6 :
Figure 6: Comparative performance for two different variance estimators.-poor overlap scenario.

Table 1 :
Balance Table Comparing Catholic and Public Schools on Baseline Covariate Distributions.