Leveraging baseline covariates to analyze small cluster-randomized trials with a rare binary outcome

Cluster-randomized trials (CRTs) involve randomizing entire groups of participants -- called clusters -- to treatment arms but are often comprised of a limited or fixed number of available clusters. While covariate adjustment can account for chance imbalances between treatment arms and increase statistical efficiency in individually-randomized trials, analytical methods for individual-level covariate adjustment in small CRTs have received little attention to date. In this paper, we systematically investigate, through extensive simulations, the operating characteristics of propensity score weighting and multivariable regression as two individual-level covariate adjustment strategies for estimating the participant-average causal effect in small CRTs with a rare binary outcome and identify scenarios where each adjustment strategy has a relative efficiency advantage over the other to make practical recommendations. We also examine the finite-sample performance of the bias-corrected sandwich variance estimators associated with propensity score weighting and multivariable regression for quantifying the uncertainty in estimating the participant-average treatment effect. To illustrate the methods for individual-level covariate adjustment, we reanalyze a recent CRT testing a sedation protocol in 31 pediatric intensive care units.


Introduction
In cluster-randomized trials (CRTs), entire clusters of subjects, rather than the individuals themselves, are randomly allocated to treatment and control arms (Turner et al., 2017a). Examples of clusters include hospitals, schools, and residential care homes (Murray, 1998). CRTs are essential when an intervention is delivered at the cluster level, and are often also used when there are concerns over contamination or because of their logistical appeal. Due to cluster randomization and the fact that individuals within each cluster share the same physical or social factors, the intracluster correlation coefficient (ICC) that describes the similarity between outcomes within the same cluster plays a central role in the design and analysis of CRTs. This ICC must be accounted for in analysis to avoid inflated type I error rates (Turner et al., 2017b).
Methods to analyze CRTs include cluster-level analyses that use summary measures for each cluster and individual-level analyses that employ generalized linear mixed models or marginal models estimated with generalized estimating equations (GEEs) (Liang and Zeger, 1986). Marginal models are sometimes 2 Zhu et al.: Covariate adjustment in small CRTs with a rare outcome preferred because the interpretation of the intervention effect parameter does not depend on the assumed correlation structure (Preisser et al., 2003). In addition, marginal models coupled with the sandwich variance estimator have been shown to be robust to misspecification of the working covariance structure and to provide asymptotically valid confidence intervals as long as the marginal mean structure is correctly specified (Liang and Zeger, 1986). However, when the number of clusters is small (typically not exceeding 30), the sandwich variance estimator can exhibit negative bias and may lead to under-coverage of the associated confidence interval estimator. Indeed, Ivers et al. (2011) reviewed 300 CRTs published between 2000 and 2008 and found the median number of clusters to be 21. Given small CRTs are common, there remains considerable interest in improving finite-sample inference of the sandwich variance estimator for GEE analysis such trials (Leyrat et al., 2018).
Previous simulation studies have compared the performances of several bias-corrected sandwich variance estimators under certain settings. Specifically, Lu et al. (2007) compared the bias-corrected sandwich variance estimators due to Kauermann and Carroll (2001) (KC) and Mancl and DeRouen (2001) (MD). They found that the normal confidence interval estimator with the MD variance formula often leads to coverage of 95% confidence intervals near the nominal level. Li and Redden (2015) have recommended the corrections proposed by KC and Fay and Graubard (2001) (FG) depending on variations of cluster size when analyzing binary outcomes. Ford and Westgate (2017) have indicated that the KC bias-corrected sandwich variance estimator may still give downwardly biased estimates of the standard error and that the FG bias-corrected sandwich variance estimator performs similarly to that of KC. Further, they note that the MD bias-corrected sandwich variance estimator tends to over-correct resulting in conservative inference, and thus recommended the average of the MD and KC standard error estimators as the top performer in CRTs with continuous and binary outcomes. Similar observations and recommendations were discussed in Li and Tong (2021a) for CRTs with count outcomes subject to truncation. It is evident that the performance of these bias-corrected sandwich variance estimators can depend on the settings of interest.
One limitation of existing comparative studies of analytical methods for CRTs is that they have primarily focused on unadjusted analyses without covariates or adjusted analyses with cluster-level covariates. For example, Westgate (2013) and Ford and Westgate (2017) have investigated the small-sample corrections in the presence of cluster-level covariates. They pointed out that the performance of MD and KC biascorrected sandwich variance estimators may deteriorate with cluster-level covariates, while the average of MD and KC standard error estimator was relatively more robust. However, the performance of smallsample corrections remains to be explored in settings where adjustment for individual-level covariates is being considered. In CRTs, individual-level covariates are frequently collected at baseline, and the need for covariate adjustment can fall into one of the following categories. First, in CRTs where cluster-level aggregates of individual-level covariates are utilized during restricted randomization, adjusting for such individual-level covariates in the analysis model can adequately control for the type I error rate (Li et al., , 2017Watson et al., 2021;Zhou et al., 2022). Second, adjusting for individual-level covariates can be based on precision considerations, analogous to the justifications provided in individually-randomized trials (Williamson et al., 2014;Zeng et al., 2021). Third, individual-level covariate adjustment may reduce selection or recruitment bias compared to the unadjusted analysis in CRTs (Leyrat et al., 2013(Leyrat et al., , 2014Li et al., 2021).
In this article, we design a neutral simulation study to compare methods for analyzing small CRTs in the presence of individual-level covariate adjustment. We focus on a rare binary outcome (event rate ≤ 10% as in our motivating study in Section 5). Besides considering individual-level covariate adjustment, the unique innovations of our simulation study are three-fold. First, we recognize the potential ambiguity in the treatment effect parameter under individual-level covariate adjustment for binary outcomes (due to non-collapsibility) and so employ the potential outcomes framework to define a participant-average treatment effect on the odds ratio scale as a common target estimand. Fixing a common target estimand has been recommended for individually-randomized trials (Benkeser et al., 2021) and can facilitate an objective comparison across different methods where the associated regression coefficients may have different interpretations. This work thus represents a first effort to operationalize this important consideration for 0 (0) 0 3 CRTs. Second, we focus on a rare outcome where adjusting for multiple individual-level covariates may lead to separation issues and non-convergence when fitting multivariable regression models. Complete separation occurs when the value of the outcome may be determined or distinguished for values of the predictors; that is, there is no observed variability in the outcome for certain combinations of covariate values. This convergence problem, however, may be ameliorated using propensity score weighting (Williamson et al., 2014;Zeng et al., 2021), which has been demonstrated to be a valid covariate adjustment approach in individually-randomized trials, and we adapt this approach to estimate causal effects in CRTs. Finally, under a GEE modeling framework, we contribute new evidence on whether the biascorrected sandwich variance estimators can provide valid inference for the participant-average treatment effect under individual-level covariate adjustment when the number of clusters is small.
2 Methods for Individual-Level Covariate Adjustment in CRTs with a Rare Binary Outcome

Notation and estimand
Suppose we have a two-arm parallel CRT with N total clusters and 1:1 randomization, and let Z denote the treatment indicator with Z = 1 corresponding to inclusion in the treatment group and Z = 0 corresponding to being in the control group. Let Y ij denote the binary outcome of the jth subject in cluster i with P -dimensional covariate vector X ij , i = 1, . . . , N and j = 1, . . . , m i . We assume the outcomes are correlated within the same cluster but independent across clusters. Specifically, for the ith cluster, Y i = [Y i1 , . . . , Y imi ] is the outcome vector and X i = [X i1 , . . . , X imi ] is the covariate matrix. In this article, we focus on estimating an average treatment effect on the ratio scale. Briefly, as the outcome of interest is binary, we describe a target estimand as the log causal odds ratio of the treatment to the control. We proceed under the potential outcomes framework and assume {Y ij (1), Y ij (0)} ∈ {0, 1} ⊗2 as the pair of potential/counterfactual binary outcomes under the treatment and control conditions; for example, Y ij (1) is the binary potential outcome of the jth subject in cluster i had cluster i been randomized to treatment, and Y ij (0) is the binary potential outcome of the jth subject in cluster i had cluster i been randomized to control. We consider a sampling-based framework where the observed clusters are assumed to be a random sample of a population of clusters with randomly varying cluster sizes. This framework allows us to define the expectation sign over the distribution of clusters and permits us to formalize the target estimand-participant-average treatment effect on the odds ratio scale-as and we further define ∆ = log(OR) as our target parameter in the ensuing simulations. While our focus is on estimand (1), we refer to Kahan et al. (2022), Wang et al. (2022a), and Su and Ding (2021) for alternative estimands such as the cluster-average treatment effect or other weighted cluster-average treatment effects that may also be of interest in CRTs.

Overview of generalized estimating equations (GEE) analyses of CRTs
To estimate the parameter defined in (1), we primarily consider the GEE approach, under which, the relationship between the marginal mean E[Y ij |X ij ] = µ ij and the covariates X ij may be represented by a generalized linear model, g(µ ij ) = X ij β. With a binary outcome, logistic regression is often used where g is taken as the canonical logit link function, which we use in this article. Let V i denote the working covariance structure for Y i . The parameter estimatorβ in the marginal model is consistent and asymptotically normal, and, even if the working correlation structure is misspecified, its variance-covariance can 4 Zhu et al.: Covariate adjustment in small CRTs with a rare outcome be consistently estimated by The variance V is often referred to as the robust sandwich estimator or the empirical sandwich estimator (Liang and Zeger, 1986). Frequently, the primary analysis of CRTs proceeds with the marginal model without any baseline covariates-the so-called unadjusted analysis. In this approach, X ij = Z i in the marginal mean model and the mean model can be explicitly written as and β = (β 0 , β Z ) . To estimate β, one may choose either the independence or exchangeable working correlation structure. When the true outcome data generating model is indeed the unadjusted model without any further covariates (2), β Z directly corresponds to our target estimand ∆, and the choice of exchangeable working correlation structure corresponds to modeling the ICC and can often provide a more efficient average causal effect estimator when the cluster sizes are variable (Li and Tong, 2021a,b). In this case, even when the exchangeable working structure is not the true correlation structure, GEE models are robust to such misspecification. However, when the true data generating process involves additional individuallevel covariates or when the cluster size is predictive of the treatment effect, we show in the Supplementary File Section 1 thatβ Z is a consistent estimator to ∆ under the independence working correlation structure but often not otherwise. This argument extends the ones provided in Wang et al. (2022a) and Kahan et al. (2022) to ratio effect measures. Because of this rationale, we primarily focus on the case with an independence working correlation model. Beyond the unadjusted analysis, the GEE approach can be extended to leverage baseline individual-level covariates to potentially increase the efficiency for estimating ∆. In Section 2.3, we consider using propensity score weighting for covariate adjustment. In addition, model (2) can be expanded to include additional baseline covariates, in which case a population standardization procedure is required to estimate ∆, because the regression coefficient does not directly correspond to ∆ due to non-collapsibility. We describe this multivariable regression approach in Section 2.4. As CRTs often include a limited number of clusters (typically not exceeding 30), a set of bias-corrected sandwich variance estimators to address the finite-sample bias in inference is also described below.
2.3 Propensity score weighting for individual-level covariate adjustment Rosenbaum and Rubin (1983) developed the propensity score, which is defined as the probability of treatment conditional on observed covariates; that is, e(X) = P (Z = 1|X). Approaches based on the propensity score, such as matching, weighting, and stratification, are commonly employed in the design and analysis of observational studies to control for confounding, since it has been shown that conditional on the propensity score, treatment is randomized. In randomized trials, the true propensity score is known by design and there is no need to model the propensity score for unbiased estimation of treatment effect. However, propensity score weighting has been shown to provide efficiency gains by addressing chance imbalance of baseline covariates. Williamson et al. (2014) have shown in individually-randomized trials that (i) inversely weighting by the estimated propensity score with prognostic covariates can reduce the variance of the unadjusted treatment effect estimator without compromising the population-level causal estimand, and (ii) with a rare binary outcome, the propensity score weighting approach often circumvents non-convergence issues that multivariable regression is vulnerable to. In addition, Zeng et al. (2021) demonstrated that in individually-randomized trials, weighting by overlap weights almost always leads to smaller variance than inverse propensity score weighting. Compared to multivariable regression adjustment for covariates, propensity score weighting has the additional benefit of preserving the population-level causal estimands but have not yet been explored in CRTs as a potentially effective tool for individual-level covariate adjustment. Here, we explore the use of these two propensity score weighting approaches as 0 (0) 0 5 individual-level covariate adjustment strategies for CRTs, without compromising the clarity of the target estimand. Inverse probability of treatment weighting (IPW) seeks to construct a sample in which the distributions of observed baseline variables are similar between treatment and control groups. These weights are defined to be the reciprocal of the conditional probability of being assigned to the treatment group that they were observed to be in. In CRTs, for subject j in cluster i, the weight is given by, On the other hand, overlap weighting (OW) was proposed to overcome possible limitations of IPW when there is limited overlap in covariate distributions between treatment arms in observational studies (Li et al., 2018a) and to also improve upon IPW in individually-randomized trials. Specifically, the overlap weights are defined to be the probability of being in the opposite treatment group (the one the subject was not observed to be in) given baseline covariates: For individual-level randomized trials, the true propensity score is usually a constant and does not depend on individual-level covariates; therefore, IPW and OW correspond to the same population estimand (Zeng et al., 2021), but previous simulations indicate that OW provides better finite-sample performance in that it is more efficient at smaller sample sizes. Under cluster randomization, the true propensity score is determined by study design and does not depend on individual-level covariates. Therefore, using OW and IPW can both target the participant-average treatment effect (1) as a common estimand. The propensity score is often estimated using parametric logistic regression. Alternative models for propensity score estimation that have been considered include neural nets, decision trees, Bayesian additive regression trees (BART), and ensemble learners (Westreich et al., 2010;Zhu et al., 2021). In observational studies, these nonparametric machine learning approaches can provide greater flexibility and more accurate estimates when there are complex relationships among the variables; however, their role in the analysis of CRTs remains under-explored. In particular, BART is a Bayesian nonparametric sum-of-trees model that involves binary splits in the predictor space and a regularization prior to avoid overfitting (Chipman et al., 2010). This model has become popular for estimating treatment effects in observational studies due to its computational efficiency and flexibility in modeling complex relationships between variables (Hill, 2011). In settings where there are nonlinear relationships or interactions among the covariates in the propensity score or outcome models, there may be difficulty in specifying those patterns through a simple parametric model. BART models can handle the inclusion of several predictors, allow for nonlinear relationships and interactions between predictors, and have been demonstrated to be effective in past simulations for observational studies (Dorie et al., 2019). In the ensuing simulations, we explore whether using BART to estimate propensity scores will impact results in the CRT context.
Next, we describe the point estimator and the bias-corrected sandwich variance estimators under propensity score weighted GEE. The propensity score weighted GEE is essentially based on the same marginal mean model as in (2), but including propensity score weights in the estimating equations for β. Specifically, once the propensity scores are estimated for each subject, the weight matrix W i is formed for each cluster i with w ij on the diagonal and 0 for the off diagonal elements. The regression parameters β = (β 0 , β Z ) are estimated by solving the weighted GEE, and V i is the working variance under the independence working correlation model.

© 0
Under the independence working correlation assumption, since the weighted GEE model only contains the treatment indicator as a covariate and the weights are only used to control for chance imbalance, the treatment coefficient estimator is theβ Z = log( OR), which is a consistent estimator for the participantaverage treatment effect ∆. For convenience, we ignore the variability due to the estimation of propensity scores, and the corresponding bias-corrected sandwich variance estimators incorporating these weights are developed in Table 1 i W i is the propensity score weighted leverage matrix for cluster i. Although we focus on the MD and KC bias-corrected variance estimators in our simulations, prior studies have indicated that the FG bias-corrected variance estimator had very similar performance to the KC biascorrected variance estimator (Li et al., 2018b). In fact, without weighting, Scott et al. (2017) and Wang et al. (2022b) have shown that the KC bias-corrected variance is equivalent to a modified version of the FG bias-corrected variance. This analytical insight also holds in the presence of weighting. We therefore do not further consider the FG bias-correction due to its similarity to the KC bias-correction, but still provide code to implement the FG bias-correction in the Supplementary Material. Table 1 Bias-corrected sandwich variance estimators for propensity score weighted GEE and multivariable adjusted GEE estimators of the participant average treatment effect. Notice that under propensity score weighting, Cov(β) is a 2 × 2 matrix and the variance estimator forβ Z = log( OR) is the lower-right element of Cov(β). The matrix M is defined in equation (3).
(a) Variance estimation under propensity score weighting

Direct multivariable regression for individual-level covariate adjustment
As an alternative to propensity score weighting, we also consider the direct multivariable regression approach for covariate adjustment in CRTs. Due to non-collapsibility with the logit link function, the estimator of the coefficient for the treatment variable (when there are other individual-level covariates in the model) only reflects a conditional treatment effect. To ensure we are targeting the correct estimand ∆, we will use the model fit to obtain estimates of the probability of the potential outcomes averaged or standardized over the covariate distribution. Suppose we fit the multivariable model logit ..,β P ,β Z ] denote the estimates of the coefficients from © 0 0 (0) 0 7 the GEE model fit, whereδ represents the log odds ratio estimator conditional on all covariates. For estimating the multivariable adjusted GEE model, one could choose the working correlation to be either the independence correlation or the exchangeable correlation, and we focus on the former in this paper for an objective comparison with the propensity score approaches. Based on the GEE model fit, the predicted risk for subject j in cluster i, if he/she is given treatment isP 1,ij = exp(X 1,ijβ )/ 1 + exp(X 1,ijβ ) , where X 1,ij = [1, X ij , 1] . Similarly, the predicted risk for subject j in cluster i under the control iŝ . Then a standardization estimator (or referred to as the g-computation formula estimator) of the participant-average treatment effect is For variance estimation, using the delta method, we arrive at V ar(log( OR)) = M Cov(β)M, where Cov(β) may either be the robust sandwich estimator or the bias-corrected estimators in the existing literature Ford and Westgate, 2017) and M = ∂log( OR)/∂β. We provide detailed steps of the derivation in Supplementary File Section 2 to show that M takes the form of The analogous robust sandwich estimator and bias-corrected sandwich estimators are also summarized in Table 1(b) for quick reference.

Simulation Studies
We employ the ADEMP (aims, data-generating mechanisms, estimands, methods, and performance measures) structured approach proposed by Morris et al. (2019) to describe the details of our simulation studies.

Aims
The motivation of our simulation studies is to inform practical choices for individual-level covariateadjusted analysis of CRTs with small numbers of clusters and rare binary outcomes. Our goals are two-fold. First, we aim to demonstrate the utility of individual-level covariate adjustment in small CRTs when the binary outcomes are rare, hoping to provide some justification for incorporating baseline individual-level covariates in this particularly challenging scenario. Second, we aim to assess and compare the performances of propensity score weighting and multivariable adjustment models, as well as the associated bias-corrected sandwich estimators in this context. These two aims collectively address an important gap in the existing literature which has primarily focused on evaluating similar methods assuming common binary outcomes and without sufficient attention to individual-level covariate adjustment.

Data-generating mechanism
We generate CRT data with two parallel arms. Suppose N total clusters are randomized to the two arms under 1:1 randomization. The cluster size for each cluster is drawn from a P oisson(m) distribution, where m is the mean cluster size; the exact number of subjects in cluster i is denoted m i . The outcome ICC under the latent response formulation, ρ Logit , reflects similarity among people in the same cluster (Li et al., 2017). This parameter will be set at values according to the range reported in practice (Murray et al., 2004). Defining P 1 to be the population incidence under treatment and P 0 to be the population incidence under the control, we consider two levels of the outcome incidence-low incidence (P 1 ≈ 0.05, P 0 ≈ 0.10) and very low incidence (P 1 ≈ 0.025, P 0 ≈ 0.05).
For the outcome generating mechanism, we consider four parametric models. First, we simulate covariates from a standard normal distribution, X (p) ∼ N (0, 1), p = 1, . . . , P . The first two outcome models assume a constant additive treatment effect with constant covariate effects on the logistic scale-that is, there are no interactions among variables and no treatment effect heterogeneity explained by covariates. Specifically, the latent continuous outcome for the jth subject in the ith cluster is is the random effect and ij is assumed to follow the standard logistic distribution with mean 0 and variance σ 2 = π 2 /3. Then σ 2 u = ρ Logit /(1 − ρ Logit ) × (π 2 /3) according to the latent response definition of binary ICC. The binary outcome is obtained by dichotomizing Y c ij : We consider two sets of fixed coefficients β 1 , β 2 , β 3 for the covariates X (1) , . . . , X (P ) , which correspond to varying strengths of covariate-outcome associations, and β Z is the main effect of the treatment.
(i) Outcome Generating Model 1: β 1 = 0, β 2 = 0.4, β 3 = 0.8. In this model, some of the covariates are not related to the outcome while the others are weakly correlated with the outcome.
Next, we consider more complex outcome generating models that incorporate nonlinearity and treatment effect heterogeneity (interaction between treatment and covariates). For these models, we only consider six (P = 6) covariates, which are simulated from a multivariable model with mean vector 0 and covariance matrix for which diagonal elements are 1 and off diagonal elements are 0.1, reflecting weak correlations among the covariates.
ij + X ij ) Further, the parameters β 0 and β Z for each combination of generating model, incidence level, and the number of covariates are set at values that give the desired incidences. These more granular considerations on parameter specification are summarized in Table 2.

Estimands
To address the issue that regression coefficients of different modeling strategies can correspond to different parameters, we have articulated a common, nonparametric causal estimand of interest in Section 2.1-the participant-average treatment effect on the log odds ratio scale. Note that the main effect of treatment in the data generating model has a conditional interpretation due to other covariates. The "true" participantaverage treatment effect ∆ for each setting is based on the log odds ratio calculated from a simulated population with N = 5000, m = 100. Specifically, each individual has two potential outcomes, which are obtained by taking draws from the Bernoulli distribution with probability given by the outcome generating model under Z = 1 and Z = 0, respectively. From these, the population incidences under treatment and under control are obtained, so that the true parameter value is ∆ ≈ log{P 1 (1 − P 0 )} − log{P 0 (1 − P 1 )}, where P 1 is the large sample (simulated population) incidence under treatment and P 0 is the large sample (simulated population) incidence under control.

Methods: analytical strategies with and without baseline individual-level covariates
We consider two modeling approaches for obtaining propensity scores indicating the estimated probability of being in the treatment group conditional on the subject's baseline covariates. The first is the multivariable logistic model that regresses the cluster-level treatment variable on the main effects of the covariates. We next employ a more flexible BART model with a probit link (Chipman et al., 2010), to test if a more flexible propensity score model can potentially adjust for chance imbalances on higher moments of the covariates beyond the mean. Specifically, we intend to test whether using BART for estimating propensity scores provides improvements over traditional parametric propensity scores for individual-level covariate adjustment in CRTs and to identify settings in which that is the case. Once the individual propensity score values are estimated, we construct two types of weight matrices based on IPW and OW. We consider six different models and evaluate their respective performances. Crude: The crude model is our reference model in which there is no adjustment for individual-level covariates, and only the treatment indicator is included. The model is logit{P (Y ij = 1)} = β 0 + β Z Z i .
Multi: The multivariable model involves individual-level covariate adjustment and includes the main effects of the covariates along with the treatment indicator, given by logit{P ij + β Z Z i . IPW-Logit, IPW-BART, OW-Logit, OW-BART: The individual-level propensity scores are estimated with logistic regression and BART, inverse probability or overlap weights based on the estimated propensity scores are calculated, and then the resulting weight matrix W i is included in the GEE model for estimation and inference.
For the Crude approach, we consider the robust sandwich variance estimator and its bias-corrected versions as in Li and Redden (2015). For the Multi approach, we consider the sandwich variance estimators described in Table 1(b), and for the propensity score weighting approaches, we consider the sandwich variance estimators defined in Table 1(a). We use the R statistical software to estimate propensity scores and perform regression analysis. Specifically, the dbarts package is used to implement BART, and geepack package is used to fit the GEE models. We developed our own code for computing the sandwich variance estimators in Table 1, and the R code is available in the Supplementary Material.

Performance measures
We report five performance metrics for each combination of simulation setting and analytic approach to compare the relative performances of the methods considered. At each replication r we obtain an estimate of the replication-specific participant-average treatment effect,∆ r . Then over the 1000 replications, we can obtain an estimate of the true participant-average treatment effect, ∆ = 1000 r=1∆ r /1000. Bias is calculated as the mean difference in each estimate and the true effect value, Bias = 1000 r=1 (∆ r − ∆)/1000. To determine whether individual-level covariate adjustment provides efficiency gains over the unadjusted model, we present the relative efficiency (RE), which is the ratio of the empirical variance of the crude model to the empirical variance of the regression or propensity score weighting approaches. Further, for each replication, we construct a 95% normality-based confidence interval. Then the coverage (CVG) using standard error estimates from the robust and bias-corrected estimators is obtained as the proportion of intervals across the replications that includes the true estimand value, ∆. Based on the binomial model with 1000 replications, we consider the coverage between [93.6%, 96.4%] to be nominal, and in general, higher coverage represents conservative performance which is typically more tolerable than lower coverage (as a reflection of the sandwich variance estimator being anti-conservative). Here, we consider 1000 simulations sufficient for evaluation of coverage because the associated Monte Carlo standard error ranges from 0.7% (when the empirical coverage is 95%) to 1.3% (when the empirical coverage is as low as 80%) (Morris et al., 2019). Lastly, to get a sense of how often separation issues arise for each model, we report the nonconvergence proportion (Non-Con), which refers to the proportion of replications that resulted in an error when fitting the model; this metric is particularly relevant as we assume a low incidence binary outcome, and success in model fitting represents a common practical issue for analyzing such data.

Simulation Results
Performance measures for each combination of GEE model, variance estimator, and trial parameters are provided in detail in the supplementary tables. To keep the main illustration simple and concise, we focus on the settings with an average cluster size of 100 and ICC of 0.01 since the patterns and results do not vary much for the remaining combinations of simulation parameters.

Outcome Generating Model 1: Additive effect model that includes weakly prognostic and non-prognostic variables
Efficiency: The relative efficiency of the covariate adjustment methods as compared to that of the crude model for data generated using Outcome Generating Model 1 are close to 1 across the number of clusters considered ( Figure 1 and Supplementary Figure 1). This is also the case for the scenarios with other values of average cluster size (m), outcome ICC (ρ Logit ), and total number of covariates (P ). Thus, in this setting, there is limited efficiency gain and, in some cases, slightly less efficiency from covariate adjustment if the included variables are unrelated or weakly related to the outcome, but the number of clusters is limited.
Coverage of confidence interval estimator: Figure 2 and Supplementary Figure 2 present the coverage rates from 95% confidence intervals for varying total number of clusters under Outcome Generating Model 1 and assumed low and very low incidences, respectively. For the crude and multivariable models, most of the estimators result in under-coverage for fewer than 30 clusters with the MD bias-corrected sandwich variance estimator giving closest to nominal. Under IPW and OW, the MD bias-corrected sandwich variance estimator generally provides nominal coverage except for the slight undercoverage observed when the number of clusters N = 6 with logistic propensity score models. Overall, even with N = 30 clusters, the uncorrected robust sandwich estimator may still result in under-coverage, while both bias-corrected estimators give similar coverage rates that are at the nominal level, regardless of whether individual-level covariate adjustment is considered.
Convergence: Non-convergence tends not to be much of an issue when outcome variables are simulated under Outcome Generating Model 1, where there are 6 covariates except at very low incidences and N = 6 and m = 30. Here, the non-convergence proportion is 14.7% and 16.0% for the multivariable model and 13.6% and 15.1% for the crude and propensity score weighted models at ICC of 0.001 and 0.01, respectively (Supplementary Table 2). If 15 covariates are adjusted for, approximately half of the replications do not converge for the multivariable model at 6 clusters with 30 subjects per cluster on average (Supplementary Table 4). However, weighting-based analysis encounters much less non-convergence (Supplementary  Tables 3 and 4) and is a practical solution for individual-level covariate adjustment when multivariable regression fails to provide a point estimate. in contrast to those under small individually-randomized trials, where OW dominated IPW in terms of performance (Zeng et al., 2021).
Coverage of confidence interval estimator: In Supplementary Figures 3 and 4, the coverage based on 95% confidence intervals by the number of clusters under Outcome Generating Model 2 and for low and very low incidences, respectively, are given. When the crude and multivariable models were used, all variance estimators result in under-coverage for N = 6, 10 clusters, although the MD bias-corrected sandwich variance estimator is closest to nominal. Under IPW and OW, the MD bias-corrected sandwich variance estimator consistently leads to nominal coverage. However, once N = 20 clusters are reached, coverage becomes close to the nominal 0.95 for all estimators based on propensity score weighting.
Convergence: Even at low incidences, the multivariable analysis tends to have larger non-convergence proportions than the crude and propensity score weighted models. Specifically, with 6 covariates under N = 6 and m = 30 at very low incidences, the multivariable model does not converge for 33.3% and 30.9% of the replications when the ICC is 0.001 and 0.01, respectively-almost tripling that of the other models (Supplementary Table 6). The problem becomes quite serious when 15 covariates are adjusted for. The analogous non-convergence proportions are 77.4% and 76.8% when incidences are low (Supplementary Table 7) and 97.6% and 98.6% when incidences are very low (Supplementary Table 8). When N = 10 and m = 30, the non-convergence proportions are 79.3% and 82.6% when the ICC is 0.001 and 0.01, respectively (Supplementary Table 8). Further, the multivariable model with 15 covariates is observed to have at least one non-convergence replication even when we have as large as N = 30 clusters. These findings show that, although the multivariable adjustment GEE often provides the largest efficiency gain, it could exhibit serious non-convergence issues when covariates are strongly prognostic and when the number of clusters is limited. In those scenarios, propensity score weighting, both IPW and OW, becomes a practical solution that offers a moderate precision gain over the unadjusted analysis. When the outcome generating model is relatively simple and additive, machine learning propensity score models do not exhibit any apparent advantage over logistic propensity scores.

Outcome Generating Models 3 and 4: Nonlinear covariate-outcome associations with treatment effect heterogeneity explained by individual-level covariates
Efficiency: Under Outcome Generating Model 3, the multivariable analysis gives the largest RE. Further, weighting using BART-estimated propensity scores provides a larger RE than weighting based on logistic-estimated propensity scores, demonstrating that machine learning estimated propensity scores lead to an efficiency advantage over their parametric counterparts when the true outcome model includes complex covariate-outcome associations. Under Outcome Generating Model 4, weighting by BART-estimated propensity scores results in higher efficiency compared to the multivariable analysis. This is somewhat expected because the multivariate analysis does not include correctly specified functional forms of the individual-level covariates. However, this result represents an important piece of evidence that machine learning propensity scores can confer an efficiency advantage in analyzing small CRTs. Further, weighting using BART-estimated propensity scores gives higher RE as the number of clusters increases.

Coverage of confidence interval estimator: Under Outcome Generating Model 3 (Supplementary Figures 5 and 6)
, with propensity score weighting, all the sandwich variance estimators, including the robust estimator, gives coverage that is at least 0.90 starting with N = 10 clusters. IPW with BART-estimated propensity scores and OW for both logistic and BART-estimated propensity scores provide coverage close to 0.95 regardless of the sandwich variance estimator as long as the number of clusters is at least 10. This is in contrast with the multivariable model, which still results in undercoverage at N = 10 for the KC bias-corrected sandwich variance estimator. This set of results highlights the important benefit of propensity score weighting as a simple and effective covariate adjustment strategy when the true data generating model is complex. Under Outcome Generating Model 4 ( Supplementary Figures 7 and 8), patterns in coverage are generally similar to those observed under Outcome Generating Model 3.
Convergence: For these more complex outcome generating models, the non-convergence proportions from the multivariable analyses remain worse than the weighting approaches at very small number of clusters (N = 6). The results are summarized in Supplementary Tables 9-12.

Application to the RESTORE Cluster Randomized Trial
To illustrate the methods for analyzing CRTs with a low incidence outcome, we apply the propensity score weighting and multivariable regression methods to the Randomized Evaluation of Sedation Titration for Respiratory Failure (RESTORE) trial, which was a CRT that took place in N = 31 U.S. pediatric intensive care units (PICUs). RESTORE compared a nurse-implemented, goal-directed sedation protocol against usual care. The intervention was introduced to 17 PICUs from the Pediatric Acute Lung Injury and Sepsis Investigators (PALISI) Network, while 14 others in this network comprised the control clusters and were given usual care. The total sample size was 2,449 children. Additional details about this study may be found in Curley et al. (2015).
We consider a secondary binary outcome that appeared to have clinically relevant effects-the postextubation stridor outcome (P 1 = .072,P 0 = .045). For each of the individual-level covariate adjustment methods, we consider two sets of individual-level covariates corresponding to a small and a large number of variables, all of which are expected to be prognostic, though to varying degrees. The two sets of covariates are: (1) 3 individual-level covariates: age, PRISM III-12 score, and whether baseline POPC score is equal to 1; and (2) 11 individual-level covariates: age, PRISM III-12 score, whether baseline POPC score is equal to 1, pneumonia as primary diagnosis, bronchiolitis as primary diagnosis, acute respiratory failure related to sepsis as primary diagnosis, oxygenation index, prematurity, asthma, cancer (current or previous diagnosis), and intubation at other hospital and transferred to participating PICU.
The baseline characteristics by treatment arm are summarized in Table 3. The absolute standardized difference (ASD) for each covariate is reported as a measure of imbalance between intervention and control arms. The values for age, PRISM score, bronchiolitis, acute respiratory failure, and asthma are greater than 0.10, which is a common threshold for assessing balance (Austin and Stuart, 2015), suggesting residual imbalance between the treatment arms for these variables. In Supplementary Table 13, we present the propensity score weighted covariate distributions based on the two adjustment sets of interest. With either IPW or OW, covariates that are included in the propensity score model had reduced baseline imbalance based on the weighted ASD, with all of them less than 0.10. With logistic-estimated propensity scores, OW completely removes the baseline chance imbalance for covariates that were included, due to its exact mean balance property (Li et al., 2018a;Zeng et al., 2021). Figure 3 presents the data analysis results, under both sets of adjustment variables, and on both the log OR and OR scales. An overall pattern is that the unadjusted analysis appears to provide a larger participant-average treatment effect compared to any covariate-adjusted analysis. This is possibly due to the moderate chance imbalance in baseline covariates with a limited number of clusters, which may exaggerate the treatment benefit. In our simulation results, the MD bias-corrected sandwich variance estimator frequently leads to nominal coverage for the participant-average treatment effect across methods for individual-level covariate adjustment, and therefore, we only present the confidence intervals based on the MD bias-corrected sandwich variance estimator and the original robust sandwich variance (as a benchmark). On the log OR scale (which corresponds to the estimand in our simulations), the RE values (defined based on the MD bias-corrected sandwich variance estimates) between the covariate-adjusted estimators and the crude estimator are 0.  adjustment appears to bring more substantial efficiency gain as shown by the narrower confidence intervals as compared to those obtained based on the crude model. Furthermore, while the new sedation protocol is shown to significantly increase the odds of postextubation stridor, this effect is no longer statistically significant after variables are adjusted for.

Discussion
In this article, we have offered a balanced discussion on the benefits and limitations of propensity score weighting and multivariable adjustment methods, and we identified scenarios under which one of these methods may provide relatively higher statistical efficiency for treatment effect estimation as well as better convergence properties. Furthermore, we have investigated the small-sample inference properties of these estimators with bias-corrected sandwich variance estimators. In what follows, we organize the discussion around three main properties of the competing estimators: efficiency, coverage, and convergence. First, regarding efficiency, our results suggest that individual-level covariate adjustment tends to provide better performance than an unadjusted model, especially when the covariates are at least moderately prognostic. This finding parallels those made for individually-randomized trials and reinforces the need for individual-level covariate adjustment in CRTs. However, the optimal individual-level covariate adjustment method can differ based on the underlying data generating process (Figure 1). When it is successfully fit, the multivariable model provides the largest efficiency gain when covariates are strongly prognostic of outcome and the model is approximately correctly specified. On the other hand, propensity score weighting may be preferred when data complexities, such as interactions and/or a nonlinear response surface, are expected. In some cases, the increase in efficiency gain from propensity score weighting may be greater from multivariable regression with linear-term adjustment. For weighting estimators, the choice of weights (IPW versus OW) did not have much effect on the performance. Rather, efficiency gains from propensity score weighting varied with the type of model used to estimate the propensity score in some settings. This is in contrast to Zeng et al. (2021), who found OW often dominates IPW in small individually-randomized trials. The likely reason is that when the number of clusters is small, the total sample size remains at least moderate in CRTs. For choices of the propensity score model, BART-estimated propensity scores may provide more efficiency gain as compared to parametric propensity scores, such as logistic regression, when the true outcome data generating model includes possible nonlinear interactions between covariates and treatment. Even though machine learning propensity score models have been shown to provide benefit Figure 3 Estimates of participant-average treatment effects and 95% confidence intervals on the log odds ratio scale (upper panels) and the odds ratio scale (lower panels). Dotted lines at log(OR) = 0 and OR = 1 indicate no treatment effect.
in analyzing observational studies (Lee et al., 2010), to the best of our knowledge, this is the first study that demonstrates the utility of a machine learning propensity score model for covariate adjustment in CRTs. Of note, even though BART is inherently Bayesian, we have integrated its posterior mean estimates for the propensity scores into the GEE estimators and have pursued a final frequentist estimator. The consideration of this approach is based on practical utility, and a more formal Bayesian approach is left for future work.
Second, regarding coverage, we found that the MD bias-corrected sandwich variance estimator frequently provides nominal coverage for individual-level covariate-adjusted estimation of the participantaverage treatment effect in CRTs. Although the MD bias-corrected sandwich variance estimator has been shown to be over conservative in previous comparative studies with no or few covariates, it gave approximate 95% coverage in our case, especially when there are very few clusters. We do acknowledge that, however, our findings are based on the z-confidence interval rather than the t-confidence interval, whereas previous studies, for example, have shown the t-test coupled with the KC bias-corrected sandwich variance estimator preserved nominal test size (Li and Redden, 2015) when the cluster size is not extremely variable. In our work, the basis for choosing the z-confidence interval are two-fold. On one hand, we realize 0 (0) 0 17 that it may not be straightforward to determine the appropriate degrees of freedom for the t-confidence interval with propensity score weighting as technically the marginal mean model only includes a treatment indicator, but potentially many covariates can be adjusted for in the propensity score weights. On the other hand, introducing the t-confidence interval with a conservative choice of the degrees of freedom (such as the number of clusters minus the number of covariates adjusted for as in Mancl and DeRouen (2001)) may lead to lower power compared to the z-confidence interval, which is unfavorable in small CRTs. The optimal degrees of freedom under the t-distribution for different individual-level covariate-adjusted estimators and the extent to which the t-confidence interval would change our recommendations require further research.
Finally, regarding convergence, we find that if the outcome incidences are under 0.05, propensity score weighting may be preferred over multivariable regression, which is very likely to have separation and convergence issues. This is especially the case when a large number of individual-level covariates are adjusted for, but the number of clusters is limited. Despite the fact that multivariable regression may lead to the largest efficiency gain when the model is close to correctly specified, the choice between multivariable regression and propensity score weighting for individual-level covariate adjustment in small CRTs should also take into account convergence as a practical challenge. Overall, propensity score weighting represents a useful and feasible approach for individual-level covariate adjustment in CRTs with a rare binary outcome and when the number of clusters is small.
There are three limitations that we plan to address in future work. First, we have only considered the causal odds ratio as a target estimand. However, the methods are readily applicable to estimate the causal risk difference or relative risk, for example, by choosing the the log link (Li and Tong, 2021b) for unadjusted and weighting-based estimators. It would be valuable to compare individual-level covariate adjustment strategies for estimating these two alternative effect measures. Second, we have restricted to rare binary outcomes with an incidence rate ranging from around 2% to 10%, which is similar to that in our motivating study; in this case, we have already observed non-trivial non-convergence issues from fitting multivariable regression models with individual-level covariates. There are other cases with extremely rare outcomes, such as those with incidence rate < 1% (Westgate et al., 2022). In that case, while we suspect the non-convergence issue for multivariable regression models can only be more frequent, whether individual-level covariate adjustment can lead to efficiency gain is an open question. Third, we have considered the independence working correlation model, and therefore, the recommendations are limited to independence GEE. As we show in Supplementary File Section 1, the choice of independence correlation structure allows us to interpret the treatment effect coefficient as our target causal estimand by giving each individual equal weight (Kahan et al., 2022). Since the exchangeable working correlation model is another popular choice in analyzing CRTs, it would be interesting to investigate to what extent the current findings can be generalizable to the exchangeable working correlation model.
(https://biolincc.nhlbi.nih.gov/home/). The authors do not have the permission to share this data directly; however, the data is available to researchers upon request and approval from BioLINCC.
Supplementary File to "Leveraging baseline covariates to analyze small cluster-randomized trials with a rare binary outcome" Here we briefly outline the key steps to demonstrate that the GEE estimator adjusting for the cluster-level treatment indicator under the independence working correlation provides an (asymptotically) unbiased quantification of the participant-average log odds ratio. Resuming the notation in the main paper, Z i denotes the treatment status of cluster i and Y ij is the outcome for subject j in cluster i for a CRT with N total clusters with each cluster containing m i subjects for i = 1, ..., N . Let β 0 denote the intercept and β Z denote the coefficient for the treatment variable in the GEE model. Assuming an independence working correlation matrix, the estimating equations to be solved are Since the right hand side is the 0 vector, the elements on the left hand side must all be equal to 0. Thus, the solution to the above equation must satisfy Rearranging terms, we have so that the GEE estimators can be expressed aŝ Thus, we obtain the independence GEE estimator as exp(β Z ) = {P 1 (1 −P 0 )}/{(1 −P 1 )P 0 }. Furthermore, we can show that where the convergence in probability statement results from an application of the Weak Law of Large Number for independent but non-identically distributed data, the first equality in the second line is due to the consistency assumption that (0), and the second equality in that same line is due to cluster-level randomization. Following the exact steps, we observe thatP and therefore , which is the participant-average treatment effect on the odds ratio scale. And it is immediate that β Z p → ∆. Finally, note thatP 1 is the outcome incidence in the treated group andP 0 is the outcome incidence in the control group. Adding the propensity score weights to the independence GEE estimator does not change the limit ofβ Z due to cluster randomization (as the estimated propensity scores converge to a constant), and thereforeβ Z in a propensity score weighted GEE is also a valid estimator for ∆. In contrast, if the working correlation model is exchangeable,β Z may converge to a different target parameter than ∆. For example, the probability limit of the exchangeable correlation estimatorβ Z will involve an unknown intracluster correlation coefficient and will often differ from ∆ unless the cluster sizes m i 's are all equal; see, for example, ?.
3 Additional simulation results 3.1 Relative efficiency under very low incidence outcome