On variance estimation of the inverse probability‐of‐treatment weighting estimator: A tutorial for different types of propensity score weights

Propensity score methods, such as inverse probability‐of‐treatment weighting (IPTW), have been increasingly used for covariate balancing in both observational studies and randomized trials, allowing the control of both systematic and chance imbalances. Approaches using IPTW are based on two steps: (i) estimation of the individual propensity scores (PS), and (ii) estimation of the treatment effect by applying PS weights. Thus, a variance estimator that accounts for both steps is crucial for correct inference. Using a variance estimator which ignores the first step leads to overestimated variance when the estimand is the average treatment effect (ATE), and to under or overestimated estimates when targeting the average treatment effect on the treated (ATT). In this article, we emphasize the importance of using an IPTW variance estimator that correctly considers the uncertainty in PS estimation. We present a comprehensive tutorial to obtain unbiased variance estimates, by proposing and applying a unifying formula for different types of PS weights (ATE, ATT, matching and overlap weights). This can be derived either via the linearization approach or M‐estimation. Extensive R code is provided along with the corresponding large‐sample theory. We perform simulation studies to illustrate the behavior of the estimators under different treatment and outcome prevalences and demonstrate appropriate behavior of the analytical variance estimator. We also use a reproducible analysis of observational lung cancer data as an illustrative example, estimating the effect of receiving a PET‐CT scan on the receipt of surgery.


INTRODUCTION
Propensity score (PS) methods were originally introduced by Rosenbaum and Rubin for the estimation of the causal effect of a treatment (or exposure) on an outcome in observational studies prone to confounding. 1 The PS for an individual is defined as the probability of receiving treatment conditional on a set of measured pre-treatment covariates.Once the PS is estimated (under the assumptions of consistency, positivity, no unmeasured confounding and no model misspecification), it can be used to balance individuals' characteristics between treatment groups and therefore remove confounding bias.2][3][4][5] After applying one of these methods, patients are on average expected to have similar characteristics between the different treatment groups.Thus, unlike multivariable outcome regression followed by some sort of standardization as in G-computation, PS approaches 5,6 focus on modeling the treatment allocation process instead of the outcome.This is one of the assets of propensity scores, in cases of rare outcomes or when the specification of the outcome model is very complex.Among PS methods, IPTW has gained popularity and is widely implemented in observational studies and more recently, also in randomized controlled trials (RCTs).The main difference between the two is that in observational studies, the imbalance on patient baseline characteristics is most of the time systematic, whereas in RCTs a chance imbalance is usually the case.In RCTs, IPTW always increases precision in the effect estimate. 7,8Briefly, in randomized settings there is no need for inclusion of variables that are predictive of the treatment (as treatment is randomized by design).Hence, one may include all the variables predictive of the outcome in the analysis-these are proven to reduce the variance while not biasing the effect estimate. 9In this way, using PS modeling in the randomized setting always adds to the precision of the point estimate.IPTW has the advantage of targeting the same marginal estimand, no matter what covariates are adjusted for.We focus on IPTW particularly because it has more attractive mathematical properties.Propensity score matching often does not make use of the full sample and is harder to implement when the estimand of interest is the average treatment effect (ATE).On the other hand, stratification leads to more biased effect estimates when compared to various weighted estimators. 10Lastly, covariate adjustment using the PS does not separate the design from the analysis stage. 11PTW easily accommodates different measures of associations (eg, risk difference, marginal risk ratio and marginal odds ratio) and is widely used to estimate the ATE or the average treatment effect in the treated (ATT).Nonetheless, a problem frequently encountered is large weights, which leads to very imprecise effect estimates.There are a few solutions to this: (1) trimming the tails of the PS distribution, that is, remove individuals who have values of the PS that fall outside a specified range of the PS distribution-this however shifts the target population to a new one, which is no longer clearly defined, 12,13 (2) truncating the weights, 14 that is, set the value of weights that are greater than percentile p to the value of percentile p, or (3) using an estimand that focuses on a population for which extreme weights do not occur.The first two are simple and easy, and therefore popular.The third, while more complex, has an explicitly defined interpretation.][17] PS methods can be viewed as two-step estimators: in the first step we estimate the PS value for each individual.In the second step, the point estimate of the treatment effect and its corresponding standard error (SE) are obtained (eg, by applying a weighted regression model and a robust sandwich variance estimator).This robust estimator usually takes into account the dependence in the data introduced by the weighting procedure, while it may or may not correct for the PS estimation.In practice unfortunately, the robust estimator applied 18 incorrectly assumes that the individual propensity scores estimated from the first step were quantities known with certainty.Hereafter, we refer to this estimator as the uncorrected estimator.When targeting the ATE, this always results in an erroneously inflated variance and conservative confidence intervals. 19When looking at the ATT, not accounting for a source of variability may lead to either over or underestimation of the variance, as shown by Reifeis and Hudgens. 20Using an uncorrected variance estimator could lead to confidence intervals which are invalid, in the sense that we may end up with a type I error rate that is not equal to the advertised one.Non-parametric bootstrap can be used to obtain valid SEs, but the bootstrap may be computationally intensive for large databases, which motivated the development of analytic approaches.
Several authors have proposed closed-form estimators which provide unbiased IPTW variance estimates, mostly focusing on the ATE weights.This article is mainly driven by the work of Williamson et al in the context of randomized trials, 8 as well as by the results of Lunceford and Davidian. 10Williamson et al suggested an analytic variance estimator for the traditional ATE weights, when the risk difference, the marginal risk ratio and the marginal odds ratio is the treatment effect of interest.We extend this formula to a unifying variance estimator for different PS weighting schemes (ATE and ATT weights, OW and MW).We illustrate the case of a binary treatment and a binary outcome, but the formulas are generalizable to continuous outcomes as well.Published papers rarely use the correct way to estimate the variance, and one reason may be the absence of a step-by-step guide to do so.Hence we aim to: (i) offer a unifying formula to calculate the 95% confidence intervals that correspond to the IPTW estimates, applicable to different types of PS weights and (ii) provide a comprehensive step-by-step implementation in R software.In this way, we hope to contribute in disseminating the statistically reliable approach to estimate the SEs when using IPTW, as obtaining point estimates is half the inferential problem. 21The article is organized as follows: Section 2 presents a brief introduction to the potential outcomes framework and the assumptions under which causal effects can be identifiable.Section 3 briefly describes the lung cancer study that will be used as an illustrative example in Section 5 .Section 4 offers an intuition for the derived formula for variance estimation for different PS weighting schemes, via the linearization method.Section 5 demonstrates the analytic calculation of the marginal large-sample IPTW variance estimator for different types of weights.Section 6 presents a Monte Carlo simulation to investigate the behavior of the estimators under different treatment and outcome prevalence values.Finally, we close in Section 7 with a discussion.

CAUSAL FRAMEWORK AND GENERAL ASSUMPTIONS
We describe briefly how the IPTW estimator is derived and under which assumptions this is an unbiased estimator of the causal effect of a binary treatment on an outcome in observational studies.

Notation, definition and assumptions
Consider an observational study of i = 1, … , n individuals.We denote the observed individual outcome as Y i (where Y i , binary or continuous), the observed treatment as Z i (1: treatment, 0: no treatment) and a set of measured baseline characteristics acting as potential confounders, X-where, X an n × (p + 1) design matrix, with p the total number of covariates.Herein, vectors or matrices are presented in bold and by default a vector defines a column vector.Now, assume Y 1i the potential outcome of individual i if they were to receive treatment level 1 and correspondingly Y 0i the potential outcome of individual i if they were to receive treatment level 0. Unfortunately, it is not feasible to observe both of these two outcomes simultaneously, leaving only one of these being observed for each individual (with the other being counterfactual).We can describe the treatment effect by contrasts of  1 = E[Y 1 ] and  0 = E[Y 0 ], such as risk differences.For example, for the ATE, we would have In RCTs, when targeting the overall population, we can write: , thanks to randomization.Contrary to that, in observational studies treatment is not randomly allocated between treatment groups, creating systematic imbalances of individual characteristics-thus, the last equality no longer holds.If these imbalances are not accounted for in the analysis, the treatment effect estimates will be biased.To remedy this, we define a propensity score estimator, which is named the inverse-probability-of-treatment weighted estimator (IPTW) and will estimate the causal treatment effect, in two steps: 1. Estimation of the PS. 2. Derivation of the estimates for the estimand of interest (eg, ATE).
For the causal effect of a treatment on an outcome to be identifiable, the below assumptions must be met: (i) consistency 22 ; (ii) positivity 23 ; (iii) no unmeasured confounding (or conditional exchangeability). 24he above assumptions are strong and untestable (however, positivity and no unmeasured confounding are valid by design in RCTs).Causal consistency involves both the assumptions of (i) treatment variation irrelevance, and (ii) no interference. 25We note that identification, although required, is not sufficient for estimation. 26,27Hence, these assumptions are essential so that the causal interpretation holds, along with correct specification of the PS model.The former is vital for identifiability, whereas the latter is needed for the estimation step.

2.2
Inverse probability-of-treatment weighted estimator of treatment effect First, we define the propensity score e i for individual i, as the probability of receiving treatment Z i conditional on pre-treatment characteristics X i .Suppose that Z i |X i follows a generalized linear model: where g is a specific link function.In our demonstration, we derive the formulas for any possible link function.The most widely applied regression model is logistic regression, which uses the logit link function: g(x) = ln , but other link functions may be applied as well (see Section 4).9][30] The PS model should include at least all confounders, and, potentially, predictors of the outcome.Once the chosen model is estimated, the estimated PS could be obtained for each individual: where x i is the p + 1 column vector of pre-treatment covariate values and α is the estimated coefficients vector (including the intercept).For example, for logistic regression: êi Second, we can estimate the population means under treatment and under no treatment, that is, with the target population depending on f (x).When function f (x) is equal to 1, we target the mean of the full population.When equal to the PS, we target the mean of the population of the treated.Finally, for values equal to e(x)(1 − e(x)) or to min(e(x), 1 − e(x)), we target the overlap population via the overlap and matching weights respectively.The choice of the target population determines the type of weights, w 1i , w 0i leading to the resulting estimators, motivated by the Hájek estimator 31 : and How to choose the appropriate weights is the topic of the next section.

Types of propensity score weights
Usually, the research question defines the target population-that is, the group of individuals for whom inference is to be made.The estimand-the true value that one targets to estimate-is in turn determined by the target population (see Table 1).In this setting, weighting achieves two things: (i) re-balances individuals in the data so that those less represented in one treatment group are given more weight to compensate; and (ii) re-balances characteristics overall to reflect the target population.Consequently, depending on the target population, we choose the respective type of weight.We now discuss what populations are considered by the different weighting schemes: ATE weights, ATT weights, MW and OW. 32

ATE weights
To estimate the average effect of a treatment on an outcome if the whole population were to receive treatment vs if the whole population were not to, we apply the corresponding ATE weights (see Table 1).The ATE is useful when one is to implement a treatment/policy and so forth on the whole population and is interested in observing the effect of that treatment on the whole population.For the ATE to be identifiable, the assumption of conditional exchangeability need to hold for the whole population-a rather strong assumption to be made.Positivity needs to hold for both the whole population and the sample (two observed treatment groups).In practice, it is often the latter that is usually the problem-this however, may be relaxed by the following weighting schemes.

ATT weights
For the ATE on the outcome if the population of the treated were to be treated vs if the population of the treated were not to be treated, we define the ATT weights.An important setting in which the ATT is applied is when investigating the effect of withholding a harmful treatment from a population with characteristics similar to those who are being treated (eg, the effect of preventing smoking from pregnant women on the birthweight of infants-in this setting, there is no interest in estimating the effect of withholding smoking from every pregnant woman vs not withholding it; the public health question surrounds the impact of preventing smoking among those who currently do smoke).For the ATT, assumptions of positivity and conditional exchangeability are relaxed (conditional exchangeability need to hold for the potential outcomes under no treatment only, while positivity requires patients to have non-zero propensity not to receive any treatment).In the same manner, we don't need everyone to have a non-zero propensity to receive treatment.

Matching (MW) and overlap (OW) weights
In some cases, extreme weights happen due to positivity violations and/or strong confounding effects, which lead to biased estimates and excessive variance. 13,33,34These are more likely when targeting the ATE rather than the ATT.Extreme weights might be an indication that a new population is of interest-this is because extreme weights occur when certain types of people are very likely to receive one treatment, thus are not necessarily amenable to intervention.This new population of interest can be defined as those who achieve the "most overlap in observed characteristics between treatment groups," to cite Li et al. 15 In this situation, one targets the ATE in the overlap population or often described as the equipoise population (ATO). 32The ATO can be easily estimated by applying either the so-called overlap weights (OW) or the matching weights (MW)-with the latter being a weighting analogue to 1:1 caliper matching without replacement or pair matching. 16he ATO answers questions regarding the subset of patients that usually have approximately equal chances of either receiving the treatment or not.This is a population very similar to the patients who would be enrolled in a clinical trial.In practice, application of the ATO is typically a consequence of having used MW or OW, rather than a conscious decision to target inference at the equipoise population.Despite that, it is important to have in mind the correspondence between weighting scheme of choice and population of interest.Regarding the identifiability assumptions, both the conditional exchangeability and the positivity assumptions are weaker than for the ATE-as these need to hold only within the overlap population.

Target population
Weighting techniques to target the ATO compared to those for the ATE or ATT are relatively recent.However, ATO methods have a growing body of use.This enhances the importance of having a correct variance estimator at hand that can be applied universally to any of the aforementioned settings.

ILLUSTRATIVE EXAMPLE: NON-SMALL CELL LUNG CANCER (NSCLC) DATA
To illustrate the use of the IPTW variance estimator for various weighting schemes, we analyse a population-based dataset of patients diagnosed in England with a non-small cell lung cancer (NSCLC).Details on the main aims of this study can be found on the original article. 35For the purposes of this tutorial, we will estimate the effect of PET-CT scan receipt on the probability of receiving surgery when targeting (i) the overall population of patients (ATE), (ii) those who actually receive PET-CT scan (ATT), and (iii) those with equal chances of either receiving a PET-CT or not (ATO).
The ATE answers "how the probability of receiving surgery would have been affected if all NSCLC patients (of stage I to III) at cancer diagnosis had received a PET-CT scan," a practice that has not been implemented yet everywhere in the UK health system.PET-CT scan is usually recommended to cancer patients before undertaking surgery, as it offers reliable knowledge about the localization and the spread of the tumor.Therefore, the treatment of interest is a binary variable which indicates if a PET-CT scan was undertaken, while the outcome is surgery with curative intent (no: 0; yes: 1; n = 10 398 NSCLC patients diagnosed in England in 2012; 6898 received a PET-CT scan).Both risk differences or risk ratios are appropriate measures for causal interpretations.When it comes to odds ratios, the fact that they are not collapsible quantities may make the comparison of results from different models harder; also, it is more difficult to interpret (because of the concept of odds).We emphasize that subsequent analyses are mostly for educational purposes, thus we did not account for other complexities in the data that otherwise must be accounted for such as missing values (we conducted complete-case analyses) or competing risk of death.Now, we consider the validity of the required assumptions.The potential confounders accounted for are: sex, age at diagnosis, deprivation score, performance status, and stage of cancer at the time of cancer diagnosis; these have been extracted from an assumed causal diagram (Figure 1)-however, these may not be a sufficient set to consider (eg, variables like type of hospital could be considered as well).Regarding consistency and no-interference, we assume those to be  valid in practice.Nonetheless, when empirically examining the distributions of the propensity scores within those having received a PET-CT scan vs those having not, we notice sufficient overlap between the two treatment groups, especially for PS values that are greater than 0.5, for the chosen PS model (see Figure 2, where the PS model included the main effects of age, sex, deprivation score, stage and performance status, as presented in Table 2).

Causal effect measures of interest
Williamson et al 8 applied M-estimation to derive a variance estimator that accounts for the two IPTW estimation steps (see Section 2.1).M-estimation is an estimating method that can be applied for variance estimation of a target parameter, among many other settings. 36,37For more details, the reader may consult Sections 1 and 5 in Appendix I.In the following subsections we present an alternative approach, based on linearization-a more intuitive method and applicable to estimate the variance of a target parameter, under broad assumptions. 38Section 1.5 of Appendix I demonstrates that the formulas for variance estimation derived from linearization coincide with those derived from M-estimation, when applying the logit link to estimate the PS.To begin with, the parameter of interest  is a function of the two marginal means (ie,  = ( 1 ,  0 )).We shall denote as: the marginal risk difference estimator (or mean difference estimator if the outcome is quantitative rather than binary), the log marginal risk ratio estimator, and, the log marginal odds ratio estimator.
For the last two, researchers typically target the marginal risk ratio and marginal odds ratio.Nonetheless, we estimate the natural logarithms of these quantities, as these are more likely to be asymptotically normally distributed, making the construction of 95% confidence intervals a rather straightforward procedure.One may then exponentiate to approximate the 95% confidence intervals of the actual quantities rather than their logarithms.
In the following subsection, we present the main principles of linearization of a parameter estimator.Then, we derive each element used in the final variance estimator from each of the estimation stages described in 2.1: (1) the propensity model and the estimated weights, and (2) the population means under treatment or under no treatment, and the marginal treatment effect.Details on intermediate steps of the final derivations via linearization not shown in the main manuscript are presented in Appendix I.

Introduction to linearization
Linearization is a method applied to estimate the variance of an estimator θ of a parameter .The parameter may be the population mean , or a treatment effect . 38 In linearization, we seek for an artificial variable l i ( θ), which we shall call the linearized variable of θ, to express the difference between the estimator θ and the parameter , as a sum of l i ( θ), plus a negligible quantity, o p (n −1∕2 )-which approaches zero as the sample size increases at a rate of n −1∕2 , where n is the sample size: Equation ( 8) ensures that Var( θ) ≃ n −1 Var{l( θ)} (note that in our setting, l i ( θ) are assumed i.i.d).Hereafter, a statistic or parameter estimator, θ, for which we can derive a linearized variable l i ( θ), shall be called linearizable.
In our setting, we aim to estimate the variance of the causal treatment effect estimator of interest (ie, δ1 or δ2 or δ3 ).This ceases to be a straightforward task, as we need to account both for the correlation of the individual outcomes after weighting and the uncertainty in the PS weights estimation.This motivates to prove that the corresponding estimator is linearizable-that is, can be written in a form such as Equation (8).For example, for the variance of δ1 , we must identify l i ( δ1 ), such that: From Equation ( 5): By rearranging terms in (10): where, to derive line 12 from 11, we apply the so-called linearization rules presented in Reference 38; specifically, the second rule, states that the linearized statistic of a sum of estimators is equal to the sum of the linearized statistics of the estimators.As a result, to estimate the variance of δ1 , one needs to derive a linearized statistic for each of the two population mean estimators, μ1 and μ0 .For the variances of δ2 and δ3 , since these two are not linear functions of the estimators for the population means, the first rule of linearization is applied, which states that the linearized statistic of a derivable function, f of a vector of estimators, is the product of the linearized statistics of the estimators and the matrix of the partial derivatives of the function f .More on this is demonstrated in Section 4.5.Now, to linearize μ1 , we may write: And, likewise for μ0 : From expression ( 16), we see that μ1 is a function of Y i , Z i and ŵ1i .The first two, are observed quantities, while the third component is a parameter estimated from the sample.Hence, to linearize the population mean under treatment, we must linearize ŵ1i .The same thought process goes for the linearization of μ0 .However, both of ŵ1i and ŵ0i are functions of the estimated PS (see Table 1).Consequently, we must start by the linearization of the PS model and the respective weights, before moving to linearize the population means.This is presented in the following two subsections.
Of note, l i ( θ) cannot be applied directly for variance estimation, since l i ( θ) usually depends on unknown parameters.Replacing these unknown quantities with their estimators leads to the estimated linearized variable ̂li ( θ), and to the variance estimator 38 : Lastly, we mention that linearization is similar to the delta method and most of the time leads to identical variance estimators.In the following subsections, we exploit the fact that Taylor expansion may be applicable to our settings-which simplifies greatly the variance estimation procedure.A brief explanation of Taylor expansion and the delta method, can be found in the Glossary section of Appendix I.

Linearization of the propensity model and propensity score weights
Assume that Z i |X i follows the generalized linear model of Equation ( 1) and that the estimator α is obtained from solving the generalized estimating equation: with and with g ′ (⋅), the first derivative of the link function, g(⋅).We note that the first line in Equation ( 21) is the score equation for  and can be directly obtained from the generalized estimating equations theory. 39Estimators α in Equation ( 20) could be alternatively seen as M-estimators. 37We have Ū( α) − Ū() = − Ū(), and under mild conditions: More details on the derivation of Equation ( 22) can be found in Appendix I.
We have: From Equations ( 21)-( 23), we obtain: ] . ( This general formulation allows logistic regression to be considered as one method among others for estimating the propensity score, and we provide several examples of generalized linear models.Under a logistic model, we have and Equation ( 24) can be rewritten as: Under a probit model, we have where Φ(⋅) and (⋅) the cumulative density and probability density functions of a standard normal distribution, respectively.Equation ( 24) can be rewritten as Under a complementary log-log model (cloglog), we have and Equation ( 24) can be rewritten as Similar developments can be obtained for other modeling techniques, provided that an explicit estimating equation is available (which excludes tree-based regression approaches, for example, generalized boosted models 29 ).
Whatever the underlying regression model, Equation (24) gives a linear approximation of α − , which captures the uncertainty in the estimation of the propensity model coefficients.Estimating the variance of these coefficients is not the objective pursued here, but this approximation leads to one of the components in the linearized variable for the propensity score weight, and to the first element of the linearized variable l i ( θ).For this purpose, we first obtain a Taylor expansion for the estimated weights.We denote as ŵ1i ≡ w 1i ( α) and ŵ0i ≡ w 0i ( α): and Therefore, for each possible weighting scheme, we need the corresponding derivative.For the weights summarized in Table 1, these derivatives are functions of x i and e i , and are shown in Table 3.

TA B L E 3
Derivatives of weight functions.

Target population
Overlap or clinical equipoise ATO c MW d 0 i f e i < 0.5 if e i < 0.5 if e i > 0.5 0 ife i > 0.

Linearization of the population means
From the definition of μ1 in Equation ( 3), we have: By plugging ( 26) and ( 31) into (33), we obtain (after some algebra): where C is given in Equation ( 24).The linearized variable l i ( μ1 ) contains two components.The first one is the usual linearized component associated with the estimating equation for  1 , and incorporates the corresponding variability, while the second one incorporates the variability associated with the estimation of .
The estimated linearized variable of μ1 is obtained by replacing into (34) all the unknown quantities by estimators.Finally, we obtain the estimated linearized variable: (35)   with and .
By using a similar proof, we obtain the estimated linearized variable for μ0 : with and As mentioned in the introductory paragraph of this section, these two estimated linearized variables can be applied to estimate the variance of the estimated population means: Var( μ1 ) = 1 n Var( ̂li ( μ1 )) and Var( μ0 ) = 1 n Var( ̂li ( μ0 )).The estimated linearized variables have also the advantage of explicitly highlighting the different components used to calculate the counterfactual means μ1 and μ0 .If we limit the estimated linearized variables to their first term in Equations ( 35) and (37), that is, l uncor 1i and l uncor 0i respectively, then 1 n Var( ̂li ( μ1 )) and 1 n Var( ̂li ( μ0 )) would estimate the variance of μ1 and μ0 as if the individual propensity score values were known; this corresponds to the uncorrected variance estimator.This estimator directly depends on the type of weights used.The uncorrected variance estimates are equal to the sandwich robust variance estimates, when this estimator does not account for the PS estimation.This estimator, accounts for the correlation between individual outcomes-as by weighting, a new, pseudo-population is generated, where the same individual may appear more than once when their weight is greater than 1.However, it deals with the propensity score predictions as if these were the true propensities of treatment assignment, rather than estimates.An equivalent formula of this, can be derived via M-estimation (see Appendix I).
On the other hand, including the second term, l cor 1i in Equation ( 35) or l cor 0i in (37), allows to obtain the corrected variance estimator, which additionally takes into account the fact that the propensity score values (and respective weights) are estimated from the predictions of a regression model.This can be broken down into two parts: the first (in gray) depends directly on the derivative of the chosen weighting scheme, while the second depends on the regression model used to estimate the propensity score.Therefore unlike some other estimators previously proposed in the literature, this expression supports the generalization of the variance estimator to any type of weighting scheme and to a wider variety of PS models.

Linearization of causal treatment effect
Finally, we derive the estimated linearized variables for the estimators of the causal treatment effect measures.For the estimator of the risk (or mean for quantitative outcome) difference δ1 , we have: And, the respective estimated variance would be obtained from Equation (19): By using the computation rules for linearization given in Deville, 38 we obtain for the estimator of the marginal risk ratio δ2 the estimated linearized variable: and for the estimator of the marginal odds ratio δ3 the estimated linearized variable: And, correspondingly, their variance estimators: and, In the next section we provide the corresponding R code to generate the variance estimates.For the sake of brevity, the code focuses on the risk difference and overlap weights (targeting the ATO), with a propensity model estimated using a logistic regression.We used non-stabilized weights throughout, which for the case of a binary treatment at a single time-point should provide the same results to the stabilized ones. 26Nevertheless, stabilized weights are not expected to provide the same results for cases of time-varying or continuous treatments.All the rest calculations of different weights and effect measures follow a similar manner, and can be found in Appendices I and II (corresponding to the theoretical derivations and R code application, respectively).Results for all the different combinations of weights and effect measures for the illustrative example are also provided in the following section.

GUIDED IMPLEMENTATION
In Sections 5.1-5.3we show a step by step implementation of the corrected variance estimator using R version 4.0.3.In Section 5.4, we present the results of applying this to the NSCLC data set.We maintain the notation introduced in Section 2.1.The covariates are denoted as X1, X2, X3, X4, X5, X6 (following the order presented in Section 3) throughout the implementation.A description of the characteristics of patients diagnosed with NSCLC in England in 2012 stratified by receipt of PET-CT scan is presented in Table 2.

Estimation of marginal means
First step: Prediction of the propensity scores ê i .The first required step of IPTW estimation is to fit the propensity score model. 26Individual propensity scores are typically estimated using the predictions of a logistic regression model of Z on X.The predicted probabilities of receiving a PET-CT scan conditional on the measured confounders (Equation 2) are obtained in the box below:

Results
Standard errors obtained from the uncorrected, corrected large-sample variance and bootstrap estimator appear in Table 4 along with the IPTW estimates for the effect of PET-CT scan on the receipt of surgery and their corresponding 95% confidence intervals.
In this particular example, results show all uncorrected variance estimates to be larger than the analytic ones (Table 4).As anticipated, bootstrap estimates are almost identical to those analytically derived.Bootstrap was performed non-parametrically on the entire procedure (ie, including PS estimation for each separate bootstrap sample).We provide the normal-based 95% CI for 1000 bootstrap repetitions in Table 4.As for the point estimates, we interpret the risk difference when targeting the ATO: "had the population of those with equal chance of either receiving a PET-CT scan or not received a PET-CT scan, the probability of undertaking surgery would have increased by 19.3% compared to had the same population not received any."Lastly, across the different association measures we notice a difference between the ATE/ATT effect estimates vs the ATO.

6
SIMULATION STUDY

Aim
We performed a simulation study to assess the behavior of the analytic correction of the variance, as compared to the uncorrected variance estimator in finite samples.By doing so, we will be able to highlight scenarios where the variance may be either over or underestimated when using the uncorrected variance estimator, and illustrate the importance of applying a corrected variance estimator for correct inference.

Data-generating mechanisms and estimands
We randomly generated 10 independent and identically distributed baseline characteristics X 1 , X 2 , … , X 10 ∼ N(0, 1) for n = 10 000 individuals.Values for the binary treatment variable Z were drawn from a Bernoulli(p Z ) for each individual (we suppressed the individual indicator i for brevity), with: where expit(x) = In Equation (46),  denotes the conditional log(OR) relating the treatment Z to the outcome Y .Lastly,  0,Z ,  0,Y and  were set to values that induce the desired treatment prevalence  Z , event rate  Y and marginal effect Γ (RD, RR, or OR in combination with ATE, ATT, MW, or OW estimator) in the simulated sample.So, for each paired combination of effect measure (RD, RR, OR) and estimand (ATE, ATT, MW, OW), there is a different set of parameter values.The process to identify these parameter values used a minimization approach, which is described in detail in Appendix I.
We allowed the following parameters to vary across simulations: Hence, for each paired combination of effect measure and estimand we applied a full factorial design with a total of 3 × 3 × 5 = 45 data-generating mechanisms.

Methods of analysis
For the analysis methods we applied the true propensity score models and for the variance of the estimated treatment effect we compare the performance when (i) we do not account for the PS estimation step (uncorrected variance estimator) and (ii) the PS estimation step is accounted for (corrected variance estimator).

Performance measures
For each scenario, we used n sim = 10 000 replicates to calculate the following performance criteria 41 ): 1. Bias: , where Ŝ E( Γi ) = √ V( Γi ) is the estimated standard error of treatment effect Γi (obtained using the corrected or uncorrected estimator).
Relative (%) error in ModSE allows the evaluation of the performance of the variance estimators: a value > 0 (or < 0) suggests that model-based standard errors overestimate (respectively, underestimate) the variability of the treatment effect estimate.Regarding coverage, it evaluates if the procedure for constructing the confidence interval achieves the advertised nominal level (95% here).

Simulation results
Bias is reported separately in Appendix I in Figures 1-3 for the risk difference, logarithm of the risk ratio and logarithm of the odds ratio estimates, respectively; this was nearly zero across scenarios, as anticipated.When using the corrected variance estimator, model-based standard error was always very close to the empirical standard error; the relative (%) error in average model-based standard error was always very close to zero throughout scenarios (see Figures 6-8 in Appendix I).Hereafter we report coverage rate of the resulting 95% confidence intervals for the logarithm of the marginal risk ratio (for the risk difference and the logarithm of the marginal odds ratio, see Figures 4 and 5, respectively, in Appendix I).
On the left-hand side of Figure 3, we present the coverage values for the uncorrected variance estimator, whereas the right-hand side displays those for the corrected variance estimator.Looking at the log risk ratio (Figure 3), overcoverage for the uncorrected estimator was prevalent in the majority of the examined scenarios, while the corrected estimator achieved the advertised nominal level of 95%.A similar pattern was observed for both the risk difference and log odds ratio.However, for an assumed log risk ratio equal to 1.50, treatment prevalence of 10% and event rate of 50%, we observed undercoverage when targeting (i) the ATT and (ii) the ATO (either via overlap or matching weights).For the remaining scenarios, over-estimation was observed when using the uncorrected variance estimator.Relative (%) model-based standard error results align with those observed for the coverage results and are not shown here.

DISCUSSION
We have provided a unifying formula for the corrected variance estimator, which treats IPTW weights as a special case of a wider class often referred to as balancing weights. 17In addition, we have highlighted the risks of bias when applying the uncorrected variance estimator when the PS is modeled via a logistic model.The risks are especially prominent when one targets effects in the overall population, where overcoverage and conservative confidence intervals occur (Figure 3).We have shown via an illustrative example of examining the effect of PET-CT scan on receiving surgery in NSCLC patients that the corrected variance estimator may be applied for multiple estimands and effect measures via R software, either manually or via the packages PSweight and geex.In our simulations, we have illustrated that under scenarios of large-samples and very small treatment prevalence, the ATT and ATO variance estimates may be underestimated, which leads to incorrect inference.When targeting the ATE, the difference between the corrected and uncorrected variance estimators (ie, accounting for PS estimation) consists of the negative of a quadratic form around a positive definite matrix (see Appendix I) 8 ; therefore, the corrected variance estimate will always be lower than the uncorrected one.This result is not universal to different propensity score weighting schemes, as we have demonstrated in our simulations.For the rest, the correction term may be either positive or negative, respectively leading to either under or overestimation when omitted.
We note that the proposed unifying formulae (Equations 35 and 37) should be applicable to any type of propensity score weighting scheme as long as these are once differentiable functions of the propensity score.While matching weights are not differentiable at the value of 0.5, it can be proven that the matching weights estimator follows an asymptotically normal distribution, an assumption required for M-estimation to be applicable, but not necessary for linearization (see Appendix I and Reference 42).When applying techniques to ameliorate problems caused by extreme weights, like trimming, this formula is no longer valid, so use of weights that target the ATO are recommended as an alternative.

Uncorrected variance
Corrected variance This work adds to the existing research around variance estimation when weighting by the estimated propensity score functions.Lunceford and Davidian 10 have applied M-estimation 37 to estimate the IPTW large-sample marginal variance of mean (or risk) differences for continuous (or binary) outcomes.Furthermore, Williamson et al have expanded the closed-form variance estimator for ATE weights for the natural logarithm of the marginal risk ratio and marginal odds ratio. 8Lastly, Hajage et al, motivated by Austin, 43 have proposed an estimator for the marginal hazards ratio along with an R package for the implementation. 44he resulting formula (see Equation (11) in Appendix I), is an extension to Williamson et al's and coincides with the ATE-specific derivation.To our knowledge, calculations may be performed via the R packages geex 45 or PSweight, 46 as shown in Section 5. Briefly, geex automates the procedure of variance estimation via M-estimation, given the user stacks the estimating equations required to estimate the corresponding quantity of interest (eg, for the case of any weighting scheme, the estimating equation for the logistic PS model will have the same general form, etc.).This package can utilize any procedure that applies M-estimation, hence may be used within a much wider scope than that of variance estimation for IPTW.PSweight also offers a wide range of options to the user, as it supports a variety of balancing weights, multiple treatments, augmented weighting estimators and so forth. 46It also allows for application of data-adaptive methods to estimate the PS, such as generalized boosted models (GBM). 29However, our tutorial adds to the current functionality of PSweight, as the former allows for application of any link function, compared to the latter, which provides only the case of the logit link.At the same time, the linearization method does not require the specific estimating equations-as geex does-which are not so simple with non-canonical links, such as probit.
In our illustrative example we used the non-parametric bootstrap 47 as the "gold standard" approach for the estimation of the variance.Bootstrap SEs were almost identical to the corrected ones across estimands and effect measures as expected, with the sample size being sufficiently large.However, a recent study by Austin, 48 comparing the performance of the bootstrap vs the corrected asymptotic variance estimator, showed that the bootstrap estimator gave more accurate estimates when sample size was ≤1000 under several treatment prevalences for the ATE and when the treatment prevalence was moderate to high for the ATT.Nonetheless, for the overlap and matching weights both methods under comparison performed equally well.
Comparing the corrected analytic estimator to the bootstrap was not the aim of this article, nonetheless it is important to mention that depending on the underlying scenario, the bootstrap may sometimes perform slightly better.The analytic estimator, similarly to the bootstrap, is asymptotic, which means that its performance becomes better as the sample size increases (recall, that in our simulations, the sample size chosen is 10 000, which allows the closed-form estimator to show its full potential).Despite that, the analytic estimator may be an alternative to the bootstrap under computationally demanding situations, such as when having very large databases.This last statement should need further research on "how large" is large enough for the closed-form estimator to have optimal performance (especially, under extreme scenarios of very low or very high treatment prevalences).If multiple imputation (MI) is used to handle missingness in the data, this raises problems combining the MI and the bootstrap. 49Another situation that inflates both computational intensity and time, are numerical simulations, making the use of bootstrap within this framework less appealing.Lastly, ease of reproducibility is another aspect of the analytical estimator that makes it more attractive than the bootstrap, for which reproducibility requires not only the use of the exact same data set, but also the same seed number applied at the starting point of the bootstrap procedure. 44he formulas generated are based on large-sample theory, thus results are valid if the sample size of each treatment level is sufficiently large.However, even for point estimation, IPTW requires a large enough sample size anyway, so if a researcher is working with small samples, IPTW may not be the appropriate method to begin with.The uncorrected variance estimator is invalid when fitting a logistic regression model on the PS as demonstrated in our simulations.Of further research would be the evaluation of the performance of the corrected vs the uncorrected estimators across weighting schemes and different link functions for the PS.We have provided variance estimators for IPTW assuming the PS is modeled via a GLM; however, there are recently developed machine learning techniques, offered for PS prediction.Nonetheless for these, even the bootstrap may no longer be valid. 50,51Moreover, modified versions for the IPTW estimators have been recently proposed.These modified smoothed versions suggest that they mitigate the large variance problem, which can be even greater in the presence of positivity violations, extreme weight values and so forth. 52Additionally, another category of weights reported relatively recently 53 are the entropy weights.Although we have not applied the proposed formula to these, this should be straightforward.Lastly, regarding complex settings mentioned in section 3: for missing data problems (on confounders or on the outcome), IPTW can be used after multiple imputation.Assuming a correct specification of the imputation model, that must include the outcome, treatment effect estimates and variance estimates (applying the proposed variance estimator to account for the uncertainty in PS estimation, obtained from each imputed data set) can be combined to get an overall estimate via Rubin's rules. 54For cases with competing risks, we must highlight that the target of inference is different and outside the scope of this study, however, an introduction to the relevant setting, estimands and variance estimation can be found in Reference 55.
To summarize, the proposed analytic estimator is a less computationally intensive approach than the bootstrap and with better statistical properties than the uncorrected estimator, with the latter likely to provide misleading inference.This tutorial was oriented for applied researchers and we provided many details on each step of the process for estimating the variance of the causal effect under study.Finally, we hope this work helps disseminate the use of the corrected variance estimator and improve the statistical inference when applying IPTW.

1 2
Assumed directed acyclic graph (DAG): Pre-treatment covariates acting as potential confounders.Histograms of the estimated propensity scores (before weighting) among those not having (pink) and having (cyan) received a PET-CT scan in the overall population, n = 10 398.

TA B L E 4
Point estimates and standard errors (SEs) for the different effect measures of PET-CT scan on the probability of receiving surgery a .10 398 patients diagnosed with non-small cell lung cancer in England in 2012; 6898 received a PET-CT scan.b Does not account for the uncertainty in PS estimates.c Accounts for the uncertainty in PS estimates.d Accounts for the uncertainty in PS estimates; 1000 non-parametric bootstrap replications.e Average treatment effect.f Average treatment effect in the treated.g Average treatment effect in the overlap.
) ,  L = log(1.1), M = log(1.25), H = log(1.5)and  VH = log(2) represent low, medium, high and very high effects of covariates on treatment.Each characteristic was assumed to have the same effect on outcome as it does on treatment.Therefore, a binary outcome was generated for each individual from a Bernoulli(p Y ), with:

3
Coverage rate of 95% confidence intervals in simulation studies (n sim = 10 000) comparing the corrected vs uncorrected variance estimator for the log risk ratio (RR) for four different propensity score (PS) weighting schemes: Average treatment effect (ATE), average treatment effect in the treated (ATT), matching weights (MW) and overlap weights (OW).
Matching weights.Note that matching weights have a non-differentiable point at 0.5; see "A weighting analogue to pair matching in propensity score analysis" and "Comments on 'A weighting analogue to pair matching in propensity score analysis' by L. Li and T. Greene." bAverage treatment effect in the treated.c Average treatment effect in the overlap.d e Overlap weights.
Coverage: proportion of times Γ is enclosed in the confidence interval, calculated as Γi ± Z 1−  2 × Ŝ E( Γi ), where 1 −  is the confidence level and Z 1−