## 1 Introduction

The use of propensity scores to control for pretreatment imbalances on observed variables in non-randomized or observational studies examining the causal effects of treatments or interventions has become widespread over the past decade. Authors have used propensity scores to match [1, 2], stratify (subclassify) [3, 4], or weight [5, 6] the samples from the treatment and control groups so that the distributions (or features of the distributions such as the means) of observed pretreatment characteristics are similar across the treatment and control groups, thereby reducing or eliminating confounding.

Propensity score techniques are advantageous compared with regression-based, covariate-adjustment techniques – which correct for imbalances between groups on pretreatment covariates by controlling for them in regression models for the outcomes – for at least five reasons. First, by summarizing all pretreatment variables to a single score, propensity scores are an important dimension reduction tool for evaluating treatment effects. This characteristic of propensity scores is particularly advantageous over standard adjustment methods when there exists a potentially large number of pretreatment covariates [3]. Second, propensity score methods derive from a formal model for causal inference, the potential outcomes framework, so that causal questions can be well defined and explicitly specified and not conflated with the modeling approach as they are with traditional regression approaches. Third, propensity score methods do not require modeling the mean for the outcome. This can help avoid bias from misspecification of that model [7]. Fourth, propensity score methods avoid extrapolating beyond the observed data unlike parametric regression modeling for outcomes, which extrapolate whenever the treatment and control groups are disparate on pretreatment variables [1]. Lastly, propensity score adjustments can be implemented using only the pretreatment covariates and treatment assignments of study participants without any use of the outcomes. This feature of propensity score adjustments is valuable because it eliminates the potential for the choice of model specification for pretreatment variables to be influenced by its impact on the estimated treatment effect [8].

Most studies that use propensity scores to control for imbalances compare just two treatment groups of interest (e.g., treatment and control). Nonetheless, a number of papers have shown that propensity score methods can be extended to the multiple treatment case with three or more conditions of interest (e.g., treatment A, treatment B, and control; [9-11]). Theoretical work of Imbens [9] and Imai and van Dyk [10] developed the causal models and justification for the use of propensity scores to remove bias in cases with multiple treatment conditions, whereas the work of Robins and colleagues [12] developed the use of marginal structural models and inverse probability of treatment weighting (IPTW) for modeling causal effects from multiple treatments. Other authors have provided more specific guidance on implementing the approaches from Imbens [9] and Imai and van Dyk [10] in practice. For example, Lechner [13] provided a step-by-step matching protocol for multiple treatments, which has been utilized numerous times in the economics literature. A citation search found 76 papers that cited Lechner's paper, most involving economic evaluations. Zanutto *et al.* [14] described how to use stratification (subclassification) on the propensity score in the multiple treatment case, and most recently, Spreeuwenberg *et al.* [15] presented a tutorial for using the multinomial propensity scores as controls in the outcome regression model.

Despite these developments on the use of propensity score matching and stratification for more than two treatments, practical guidance on the use of propensity score weighting when examining multiple treatment conditions has received very limited attention. In particular, there is very limited guidance on how to estimate the propensity scores or the subsequent weights. Moreover, the existing applications have generally relied on parametric estimation of the propensity score via the multinomial, nested, or ordinal logistic regression model for multiple treatments [12-16]. Spreeuwenberg *et al.* [15] provided a step-by-step guide to causal modeling with multiple treatments that suggests multinomial logistic or probit models or ordinal logistic models be used to estimate the propensity scores and gives guidance on when one model may be preferable to another. However, the paper offers no guidance on variable selection or propensity score model tuning within this parametric framework. Zanutto and colleagues [14] also suggested using ordinal logistic regression to estimate the propensity scores for multiple doses of treatment. They recommend using the iterative approach of Rosenbaum and Rubin [3] and offer the necessary modifications to apply the approach to multiple treatments.

Recent studies of propensity score estimation in the binary case of two treatments show that, in terms of bias reduction and mean squared error (MSE), machine learning methods outperform simple logistic regression models with iterative variable section [17-19]. By extension, machine learning methods may also be advantageous in the multiple treatments setting. One such machine learning technique that has been frequently utilized in the two-treatment case [5, 18, 20] is the generalized boosted model (GBM). GBM estimates the propensity score for the binary treatment indicator using a flexible estimation method that can adjust for a large number of pretreatment covariates. GBM estimation involves an iterative process with multiple regression trees to capture complex and nonlinear relationships between treatment assignment and the pretreatment covariates without over-fitting the data [5, 21-24]. It works with continuous and discrete pretreatment variables and is invariant to monotonic transformations of them. Further, one of the most useful features of GBM for estimating the propensity score is that its iterative estimation procedure can be tuned to find the propensity score model leading to the best balance between treated and control groups, where balance refers to the similarity between different groups on their propensity score weighted distributions of pretreatment covariates.

In light of the potential advantages of boosting in the case of three or more treatment conditions, this paper provides researchers with a tutorial on implementing propensity score weighting using GBM when examining multiple treatments. Building on Frölich [11], we begin by describing a variety of causal effect estimands of potential interest when examining more than two treatment conditions (Section 2). Then, in Section 3, we describe how to estimate the multiple treatment propensity scores using GBMs, and we introduce useful diagnostic criteria for assessing balance. In Section 4, we use data from a large, observational study of adolescent substance users to illustrate GBM-based propensity score estimation and evaluation of diagnostic criteria for assessing balance in the context of an outcomes analysis of the relative effectiveness of three different outpatient substance abuse programs for adolescents.