Propensity score matching for estimating a marginal hazard ratio

Propensity score matching is commonly used to draw causal inference from observational survival data. However, its asymptotic properties have yet to be established, and variance estimation is still open to debate. We derive the statistical properties of the propensity score matching estimator of the marginal causal hazard ratio based on matching with replacement and a fixed number of matches. We also propose a double-resampling technique for variance estimation that takes into account the uncertainty due to propensity score estimation prior to matching.


Introduction
Survival analysis plays an increasingly important role in treatment effect estimation due to the frequent occurrence of time-to-event outcomes in biomedical studies.By comparing the hazard functions of survival times between treated and untreated individuals, the marginal hazard ratio is commonly used to measure the effect of treatment on a time-to-event outcome for a particular population of interest.The log of marginal hazard ratio corresponds to the coefficient indexing a univariate Cox proportional hazard model, where the hazard of the outcome is regressed on an indicator denoting treatment status (Cox, 1972).
In observational studies, propensity score (PS) methods are standard approaches for reducing the effect of confounding.However, depending on the specific type or implementation of the propensity score methods, the population parameters or treatment effects being estimated may not be the same.In general, a measure of treatment effect can be classified as conditional or marginal.Conditional effects correspond to an average effect at the individual level, whereas marginal effects denote an average effect at the population level.Hazard ratios are non-collapsible measures, which means that the marginal and conditional hazard ratios generally do not coincide even in the absence of confounding (Martinussen and Vansteelandt, 2013).Therefore, the non-collapsibility of the hazard ratios renders methods that estimate the conditional treatment effects biased for estimating the marginal hazard ratios (Austin et al., 2007;Austin, 2013).For instance, stratification on the propensity score and covariate adjustment using the propensity score estimate conditional treatment effects, so they generally yield biased estimates of the marginal hazard ratio (Williamson et al., 2012;Vansteelandt and Daniel, 2014).The inverse probability weighted (IPW) estimator fits a marginal Cox model with each observation weighted by the reciprocal of its estimated probability of receiving the observed treatment.This estimator is consistent for the marginal hazard ratio if the propensity score model is correctly specified and the overlap assumption is satisfied.However, it is sensitive to slight misspecification of the PS model and can yield large variance when the estimated PS is close to 0 (Kang and Schafer, 2007;Busso et al., 2014;Austin and Stuart, 2017).To protect against misspecification of the propensity score model, the doubly robust augmented inverse probability weighted (AIPW) estimator is proposed, which combines IPW with an outcome regression model (Tchetgen Tchetgen and Robins, 2012).It is consistent as long as either the outcome model or the propensity score model is correctly specified.However, non-collapsibility of the hazard ratios makes specifying a correct outcome model increasingly difficult since it needs to marginalize to a survival curve that satisfies the marginal proportional hazard assumption.Literature also suggests the performance of AIPW in finite sample scenarios can be unstable when both models of the PS and the outcome are misspecified and are also sensitive to extreme values of the estimated PS (Kang and Schafer, 2007;Waernbaum, 2012).
Alternatively, matching methods enjoy multiple desirable features.First, matching is transparent and has great intuitive appeal as it seeks to emulate a completely randomized experiment using observational data (Dehejia and Wahba, 2002;Rubin, 2006;Stuart, 2010).Second, empirical evidence suggests that while matching on the PS and IPW do not uniformly outperform one another in all situations, PS matching tends to be more robust to misspecification of the PS model and to extreme values of the estimated PS (practical violation of the overlap assumption) (Frölich, 2004;Waernbaum, 2012;Busso et al., 2014;Austin and Cafri, 2020;Greifer and Stuart, 2021).Moreover, matching does not rely on an outcome model and thus avoids the aforementioned congeniality issue between the outcome model and the proportional hazard assumption when applying AIPW.Simulation results have shown that greedy nearest neighbor matching on the propensity score without replacement results in unbiased estimation of the marginal hazard ratio over the subpopulation of treated individuals (Austin, 2013).However, when the amount of treated and control units are comparable, bias could arise due to incomplete matches.PS matching with replacement not only circumvents this issue but also permits estimation of the marginal treatment effect on the overall population containing both treated and untreated subjects (Austin and Cafri, 2020;Abadie and Imbens, 2006;Williamson et al., 2012).
While PS matching with replacement is potentially attractive for estimating the marginal hazard ratio, no formal asymptotic results have been established.For variance estimation, most existing approaches do not take into account the uncertainty of parametrically estimating the propensity score prior to matching and restrict inference to the matched sample (Gayat et al., 2012;Austin, 2013;Austin and Cafri, 2020).Moreover, although Austin and Small (2014) demonstrated that bootstrap is valid for matching without replacement, it is inappropriate for matching with replacement because the bootstrap sample fails to preserve the distribution for the number of times each individual is used as a match (Abadie and Imbens, 2008).Thus, it is important to develop a variance estimation procedure for the matching with replacement estimator of the marginal hazard ratio.
This article concerns propensity score matching with replacement and with a fixed number of matches for estimating the marginal hazard ratio for a point binary treatment given pre-treatment covariates.We will simply refer to this procedure as PSM or the PSM estimator.We note that PSM involves imputing the missing potential outcome processes, and is distinct from paired (one-to-one) matching without replacement, which could result in dropping units from the analysis (Abadie and Imbens, 2006).
In this article, we first derive the large sample distribution of the PSM estimator based on known propensity scores.Secondly, because the propensity score function is often estimated prior to matching, we also derive the large sample distribution of PSM accounting for the uncertainty due to nuisance parameter estimation.Since PSM is a nonsmooth functional of the distribution of the data, our derivation is based on the technique developed by Andreou and Werker (2012), which offers a general recipe for deriving the limiting distribution of nonsmooth statistics that involve estimated nuisance parameters.This technique has been successfully applied by Abadie and Imbens (2016) for estimating the average treatment effects for a continuous outcome.Our derivation extends their results to the survival context.This extension is not trivial because the survival outcome is often right-censored.We utilize the martingale theory of the counting process to establish asymptotic distributions of the PSM estimator of the marginal hazard ratio.In addition, we propose a replication-based variance estimator for PSM that accounts for the uncertainty of nuisance parameter estimation.For a continuous outcome, Adusumilli (2022) proposed a bootstrap inference procedure, which improves upon the asymptotic variance estimator proposed by Abadie and Imbens (2016).Our proposed method extends Adusumilli's technique to the survival context.Simulation results suggest that it generally outperforms Wald-type inference based on the asymptotic theory in finite samples.
The rest of this paper proceeds as follows.Section 2 introduces the notation, model, and assumptions for identification.Section 3 presents the PSM estimator, including the matching procedure and estimating equations.In Section 4, we show the main asymptotic properties of the PSM estimator.Section 5 proposes a resampling-based inference procedure.Section 6 uses simulation to evaluate the finite-sample properties of the estimators.Section 7 applies the new estimator to the IMS Health Oncology electronic medical record data to evaluate the effects of two treatments on treating non-small cell lung cancer.Section 8 concludes with potential extensions.The supporting information contains the technical details, proofs and additional simulation results.

Potential outcomes and the causal PH model
We use the potential outcomes framework under the Stable Unit Treatment Value Assumption, and let T (ω) and C (ω) be the potential values of the survival outcome and censoring indicator had the individual received treatment ω (ω = 0, 1).We assume non-informative censoring under which T (ω)  |= C (ω) , where |= means "independent of".This assumption is reasonable if the censored times occur at the end of study follow-up τ , the so-called administrative censoring.
We define λ ω (t) = lim δt→0 δ t −1 P t ≤ T (ω) < t + δ t | T (ω) ≥ t as the causal hazard rate of failing at time t for a population of patients had they received treatment ω.Adopting notation used by Cox (1972) in the potential outcomes framework, we define U (ω) = min(T (ω) , C (ω) ) as the time to a clinical event or censoring, ) the counting process of observed event and Y (ω) (t) = I(U (ω) ≥ t) the at-risk process for a population of patients had they received treatment ω.Following Tchetgen Tchetgen and Robins (2012), we assume a causal PH model.

Definition (Causal PH Model
).The structural model for comparing treatment ω and treatment 0 is where λ 0 (t) is the population hazard rate if all individuals had treatment ω = 0.
The parameter β 0 describes log of the relative hazard of having a clinical event if all individuals received treatment ω = 1 compared to if all individuals received treatment ω = 0.Although focusing on the marginal hazard ratio is controversial in the literature, it is still a useful estimand that summarizes the overall average of the hazard ratios over a certain time period (Hernán et al., 2000).This explains its wide acceptance among clinical trial statisticians and regularity agencies such as the U.S. Food and Drug Administration.

Observed data and identification assumptions
Let W i be the binary treatment, and let T i and C i be the times to a clinical event and censoring, respectively, for individual i = 1, . . ., n.We define U i = min(T i , C i ) as the time to a clinical event or censoring, ) the observed data counting process, and Y i (t) = I(U i ≥ t) the at-risk process.Suppose we observe a set of pre-treatment baseline covariates X i ∈ R d .Let e(X i ) = P (W i = 1 | X i ) be the propensity score.The observed data are summarized as {O i = (X i , W i , U i , ∆ i ) : i = 1, . . ., n}.We assume that {O i : i = 1, . . ., n} are independent and identically distributed.
We make the consistency assumption that links the observed data processes with the potential outcome processes.In order to use the observed data to estimate the parameters in Model (1), we require the assumptions of unconfoundedness and positivity (Robins, 2004).
i } | X i .

The PSM estimator of the marginal hazard ratio
Although it is evident that paired matching (propensity score matching without replacement) is unbiased for the marginal hazard ratio over the treated population, this approach does not target estimation of β 0 , the marginal hazard ratio over the population containing both treated and untreated individuals.Unlike paired matching, the PSM (with replacement) estimator permits estimation of β 0 and will be the focus of this article (Abadie and Imbens, 2006).Previous simulation studies have provided evidence for the unbiasedness of PSM for estimating β 0 (Austin and Cafri, 2020).In this section, we describe the PSM estimator to set the stage for our main results.
For the units in the sample, only one of the potential counting processes, N i (t) and N (1) i (t), is observed and the other is unobserved or missing.The same is true for the potential at-risk processes.The construction of the PSM estimator first involves imputing all the missing potential outcome processes.To do so, for each unit i, it finds the first M closest units in the opposite treatment based on the Euclidean distance between the propensity scores.Note that M is fixed and does not vary between units.Define J M,i as the set of indices for the first M matches for unit i.For all i and ω, the PSM estimator imputes the missing counting processes and at-risk processes respectively as and Let ) denote the number of times unit i is used as a match.Note that each unit can be used as a match more than once so k i can be larger than 1.In practice, the true propensity score is unknown.Following most of the empirical literature, we assume that the propensity score follows a generalized linear model, e(X T i θ 0 ) (Imbens and Rubin, 2015;Li et al., 2017;Austin and Stuart, 2017).The matching procedure can be carried out with the estimated propensity score e(X T i θ), where θ is a consistent estimator of θ 0 .We now denote k i to be k θ,i to reflect its dependence on θ.
Once the missing potential outcome processes are imputed, PSM then fits a marginal structural model to the imputed dataset.Define Λ 0 (t) = t 0 λ 0 (v)dv as the cumulative hazard function for ω = 0 at time t.The estimating functions for Λ 0 (t), t ≥ 0 and β 0 are where N * (ω) i (t) and Y * (ω) i (t) are defined in (2) and (3).We can equivalently write (4) and (5) using observed outcome processes respectively as Setting (6) equal to zero, we can obtain the estimator for dΛ 0 (t) for fixed β as Substituting ( 8) into (7), we obtain the estimating equation to solve for β, where Equation ( 9) is the partial score equation for a Cox PH model with W i as the covariate and weights 1 + k θ,i /M .Thus, the PSM estimator β can be calculated by standard software.We conclude this section by summarizing the steps for calculating the PSM estimator for β 0 as follows.
Step 1. Fit a propensity score model e(X T i θ), and obtain an estimate θ.
Step 2. Based on the estimated propensity scores e(X T i θ), carry out the matching procedure described above.Record the number of times each unit is used as a match k θ,i .
Step 3. Obtain the PSM estimator β for β 0 by solving (9) using standard software, i.e. by fitting a Cox PH model to the observed data with covariate W i and weights 1 + k θ,i /M .

Asymptotic results with known propensity scores
In this section, we establish the asymptotic normality of the PSM estimator of the marginal HR assuming the propensity scores are known.The results in this subsection are applicable to rare situations when the propensity scores are known, and are useful for understanding the effect of estimating the propensity scores on the PSM estimator when compared to the results in the next subsection.Under regularity conditions in Assumption S1, we show that there exists a neighborhood B of β 0 and a function Q(β 0 , t) such that for all (β, t) ∈ B × [0, τ ], Q(β, t) → p Q(β, t), as n → ∞.We then define Because β is the solution to the estimating equation G n (β) = 0 in (9), the key step is to characterize the asymptotic properties of G n (β 0 ).With a known θ 0 , we can show that Based on (12) and the M estimation theory, we derive the asymptotic results for β as follows.
Theorem 1.Under Assumptions 1-3 and the regularity conditions in Assumption S1 presented in the supplementary material, with the known propensity score, and, Recall that the estimating function of the PSM estimator is n −1/2 G n (β 0 ) as in ( 12), which targets estimating E{ 1 ω=0 H i (ω)}.Theorem 1 shows that this estimating function has the asymptotic variance V G as in ( 14).From the standard semiparametric estimation literature, the semiparametric efficiency bound for the target parameter E{ and Robins, 2005).Thus, the PSM estimator does not attain the semiparametric efficiency bound.The asymptotic variance V G in ( 14) becomes closer to the efficiency bound as the number of matches M gets larger and e(X) can explain all the variation of H i (ω) given X, i.e., µ H (ω, X) = µ H {ω, e(X)}.

Asymptotic results with estimated propensity scores
We now study the asymptotic properties of the PSM estimator with the estimated propensity score.The technique we will use is based on (Andreou and Werker, 2012).The main idea is to apply Le Cam's third lemma to locally asymptotically normal models (Le Cam and Yang, 1990).Let P θ be the distribution of {(A i , X i , U i , ∆ i ) : i = 1, . . ., n} indexed by the parameter θ from the propensity score model.Let θ n be contiguous to θ 0 , and P θn be the distribution indexed by the local parameter θ n .Under P θn , denote the true parameter value as β 0 (θ n ), the PSM estimator based on the true parameter θ n as β(θ n ), and the log likelihood of P θ0 with respect to P θn as Λ n (θ 0 | θ n ).Assume that under P θn , has a limiting normal distribution.Le Cam's third lemma states that under P θ0 , n 1/2 { β(θ n ) − β 0 (θ n )} has a limiting normal distribution.Then heuristically by replacing θ n with θ, one can then approximate the asymptotic distribution of n 1/2 ( β − β 0 ) as shown in Theorem 2.
Theorem 2. Under Assumptions 1-3 and the regularity conditions in Assumption S1 presented in the supplementary material, with a correctly specified propensity score model, , where V G is defined in Theorem 1, I θ0 is the Fisher information of θ 0 , ė(X T θ 0 ) = ∂e(X T θ 0 )/∂θ, and By comparing V 2 and V 1 , the change in asymptotic variance after adjusting for estimating propensity score function is −{A(β 0 )} −1 c T I −1 θ0 c{A(β 0 )} −1 , which can either be negative or zero.This reduction is a result of utilizing the available treatment assignment information in the data, which improves the efficiency of the propensity score matching estimator.Therefore, Theorem 2 shows that matching based on the estimated propensity score generally improves the efficiency of the matching estimator compared to matching based on the true propensity score even if it is known.This phenomenon is in line with that in the setting with a continuous outcome (Abadie and Imbens, 2016).Theorem 2 motivates a variance estimator based on an approximation of the asymptotic variance formula (see supplementary information).

Resampling-based inference
We propose resampling variance estimation that has a better finite-sample performance.Abadie and Imbens (2008) demonstrated that the nonparametric bootstrap is invalid for the matching estimator based on matching with replacement and with a fixed number of matches.Otsu and Rai (2017) suggested resampling the matching estimator directly based on its martingale residual terms, which only works for matching based on the full vector of covariates.In order to reflect the uncertainty in the estimation of the propensity score, Adusumilli (2022) proposed a hybrid bootstrap procedure by re-assigning new values for the treatments and resampling the martingale residuals under both treatment conditions.This necessitates imputation of the unobserved martingale residuals under the opposite treatment.Extending this procedure to survival outcomes, we define the martingale residuals as where µ H (ω, e(X T i θ)) is obtained by a nonparametric regression estimation of H i on e(X T i θ) among individuals with W i = ω.For individual i, define the secondary nearest neighbor matching function as returns the index of the nearest neighbor in the opposite treatment group, where nearest neighbor is determined based on the full X rather than on the propensity score.The pair of potential residuals for individual i are defined as We propose a double-resampling procedure as follows.
Step 0. Complete Steps 1-3 for obtaining the PSM estimator.For each individual i, compute the secondary nearest neighbor matching function m(ω, X i ) for ω = 0, 1. Estimate the matching function we use the following imputation strategy: create q N quantile partitions based on e(X T i θ), identify the quantile partition individual i falls, randomly sample one, say l, from that partition with W j = ω, and let k i (ω) = k θ,l .
Step 2. Based on (W * i , X i ) n i=1 , re-fit the propensity score model and obtain the replicate θ * .
Step 3. Generate a sequence of independent and identically distributed random variables where k θ,i (ω) is imputed at Step 0. Re-center the bootstrap residuals and compute the replicate of Step 5. Repeat Steps 1-4 for a large number B times and denote the bth replicate of We construct the 100(1 − α)% confidence interval for ), which has the nominal coverage level asymptotically.The double-resampling procedure is parallel to the weighted bootstrap procedure of Adusumilli (2022).Thus, the validity of the double-resampling procedure is a straightforward adaptation of the proofs in the work of Adusumilli (2022) to our setting under regularity conditions given in Assumption S2.The interested reader is encouraged to consult the original article by Adusumilli (2022) for details.
Remark 1.We provide some discussions to Step 0. Note that the secondary matching procedure matches on the full set of covariates rather than on the propensity scores.Doing so preserves the conditional covariance in ( 16) between X and the error terms r 2i , given the propensity scores.This term reflects the improvement when using the estimated propensity score.See Adusumilli.Adusumilli (2022) We impute k i (ω) by drawing a value from the empirical distribution of k i (W i ) in the propensity score strata for W i = ω to re-create the distribution of k i (W i ).The number of the propensity score strata q N is required to go to infinity as the sample size increases.In finite samples, we suggest using quintiles (q N = 5) and recommend conducting sensitive analysis with different choices of q N .Our simulation study shows that the performance of the proposed double-sampling procedure is not sensitive to this choice.
6 Simulation study

Overview
We conduct a simulation study to compare the performance of PSM estimator to existing approaches for estimating the log-marginal hazard ratio and to assess the performance of the proposed resampling-based variance estimator.Motivated by previous findings, we consider data generating mechanisms that simulate a varying level of covariate overlap under the marginal proportional hazard assumption.This simulation design attempts to highlight the differences between the PSM estimator and other existing approaches, by investigating the following hypotheses: (1) When overlap is poor or when the propensity score model is misspecified, the PSM estimator is anticipated to outperform the IPW and AIPW estimators because PSM is more robust to extreme values of the propensity scores (Kang and Schafer, 2007;Waernbaum, 2012;Busso et al., 2014).See supplementary information for a discussion comparing PSM and AIPW from a theoretical and practical point of view.
(2) Matching on high-dimensional covariates will result in noticeable bias because the search for close matches becomes increasingly difficult in higher dimensions (Abadie and Imbens, 2006;Yang et al., 2016;Zhao et al., 2022;Zhao, 2023).
Bias, empirical variance, the average of variance estimates, and the coverage rate for nominal 95% confidence intervals are obtained.The performance of the point estimators are evaluated using bias and empirical variance.Any deviation of the averaged variance estimates from the empirical variance reflects bias in variance estimation.The coverage rate for nominal 95% confidence intervals of the estimated log-marginal hazard ratios is used to measure the overall performance of the estimation procedures.

Data generating mechanism
We consider a sample size of n = 1000 throughout.For individuals i = 1, ..., n, we simulate the following data: Step 1: vector of baseline covariates X = (X 1 , • • • , X 6 ) T , each independently generated from an exponential distribution with mean 1.
Step 2: indicator of assignment to treatment W generated from the propensity score model: logit , where we let (θ 0 , θ T 1 ) be (−3.0,0.5 × 1 T 6×1 ), (−4.5, 0.75 × 1 T 6×1 ), and (−5.0, 1 T 6×1 ) to represent weak, moderate, and strong levels of confounding, leading to strong, moderate, and weak covariate overlap.The distribution of the true propensity scores is shown in Figure 1.We also examine a scenario where there is no confounding, i.e. each subject has a true propensity score of 0.5 (see supplementary information).
Step 5: event time ∆ = I(T < C) and event indicator U = min(C, T ), where C is generated from a uniform distribution with support [0, a] and a is set so that between 20% to 30% individuals are censored.
In sum, we allow the following two factors to vary in our simulation designs: covariate overlap (weak, moderate, strong, and very strong) and the true log of marginal hazard ratio (0, -0.5, 0.5).This gives rise to a total of 12 scenarios.For each scenario, we simulate 1000 datasets.

Methods
We compare the following estimators of the log-marginal hazard ratio: (i) β nai , the unadjusted estimator obtained by fitting the Cox PH model with the treatment status as the only covariate, which serves as a benchmark for demonstrating the level of confounding bias; (ii) β ipw , the IPW estimator obtained by fitting a weighted Cox PH model with the treatment status as the only covariate, where each observation is weighted by the inverse of the probability of receiving the actual treatment; (iii) β aipw , the AIPW estimator, where the working outcome model is the Cox PH model with all baseline covariates and treatment status as covariates (Tchetgen Tchetgen and Robins, 2012); (iv) β m.x , the estimator based on matching on all covariates, which works the same way as the PSM estimator, except during the matching stage, nearest neighbors are determined based on the Euclidean distance between vectors of covariates rather than between propensity scores; the number of matches is set to M = 1; (v) β psm,0 , the PSM estimator based on the true propensity score; the number of matches is set to M = 1; (vi) β psm , the PSM estimator based on the estimated propensity score; two versions of the PSM estimator corresponding to two different choices for the number of matches (M = 1 and M = 5) are considered.
For variance estimation of β nai , β ipw , and β m.x , we use the robust output from the standard software.The nonparametric bootstrap is used for estimating the variance of β aipw .For β psm,0 , we use the proposed asymptotic variance estimator without the adjustment term.We compare four variance estimators for β psm : (i) the robust output from the standard software, (ii) the proposed asymptotic variance estimator (iii) the nonparametric bootstrap and (iv) the proposed double-resampling procedure with q N = 5.

Sensitivity analysis
For each simulation scenario above, we examine the effect of misspecifying the propensity score model on the three propensity score based approaches ( β ipw , β aipw , and β psm ).We also investigate the sensitivity of the double-resampling approach to a different choice for the number of strata (q N = 10).

Simulation study results
Figure 1 and Table 1 show the results when true β 0 = 0 and the propensity score model is correctly specified.The results for the misspecified propensity score model and results for β 0 = −0.5 and β 0 = 0.5 are presented in the supporting information.The unadjusted estimator β nai has a severe bias and thus barely captures β 0 , which becomes worse as the covariate overlap between the treatment groups becomes weaker.The matching estimator based on all covariates β x.m has the second-largest bias following the naive estimator, β nai , confirming the theoretical result in Abadie and Imbens (2006) that matching on more than one covariate may lead to a biased matching estimator.The IPW estimator β ipw and the AIPW estimator β aipw greatly reduce the bias; however, as shown in Figure 1, they are unstable when the two treatment groups have weak overlap in propensity scores.Moreover, β aipw proposed by Tchetgen Tchetgen and Robins ( 2012) is supposed to be doubly robust in the sense that its consistency relies only on the correct specification of either the propensity score model or a working outcome mean model given the covariates.However, in practice, specifying a correct outcome mean model that is compatible with the marginal structural model is difficult.The posited Cox regression model in the AIPW estimator is a convenient choice but is misspecified.Thus, the AIPW estimator is no longer doubly robust.For all investigated simulation scenarios, β psm can significantly reduce the bias and are more stable when the two treatment groups have weak overlap in propensity scores compared to β ipw and β aipw .The PSM estimator with M = 5 provides a smaller variance than the counterpart with M = 1.This is because increasing the number of matching can improve the efficiency of the PSM estimator.As a trade-off, when the sample size is small, a large number of matching could lead to a slight increase in bias.Moreover, the PSM estimators possess coverage rates around 95% using the proposed asymptotic variance estimation and double-resampling.When the two treatment groups have weak overlap, it is noticed that the asymptotic variance estimator tends to underestimate the variance and may result in negative values.In contrast, the double-resampling method recovers the 95% confidence level better than the asymptotic method for such cases.Comparing to the proposed inference approach, using the standard software's robust method and naive bootstrap method always overestimate the variances of β psm,0 and β psm .Thus, our proposed variance estimation approach is apparently beneficial for making a reliable inference.Note: "Var" is the variance of point estimates of β 0 across 1000 simulated datasets; "VE" is the average variance estimation for the point estimators over simulations, thus VE minus Var reflects the bias in estimated variance; "CR" is the empirical coverage rate of 95% confidence intervals.The proposed PSM estimator β psm is calculated with number of matches M = 1 and 5. Four types of variance estimates for β psm were compared: "software", output from the standard software; "asymp", the proposed asymptotic variance estimation; "naiveboot", the naive nonparametric bootstrap; "double-rsp", the proposed double-resampling method.
In the extended simulations (see the supporting information), we conduct a sensitivity analysis on a different choice of the number of strata, q N , for the double-resampling approach.The results show that Naive is the β nai estimator; IPW is the β ipw estimator; AIPW is the β aipw estimator; M.X is the β m.x estimator; tPSM is the β psm,0 estimator; ePSM1 is the β psm estimator with M = 1; ePSM5 is the β psm estimator with M = 5; some AIPW point estimations with absolute value greater than 1.5 are excluded from the plot in the scenario where the level of covariate overlap is strong.
the estimated variances are similar with a difference less than 10 −3 across all scenarios, indicating that the variance estimator is insensitive to the choice of q N .Moreover, when the propensity score model is misspecified, β ipw , β aipw and β psm become biased.Nonetheless, β psm is still more robust than β ipw and β aipw .The AIPW estimator performs the worst among all estimators when the level of covariate overlap is medium and weak, consistent with the results in Kang and Schafer (2007) that the bias and variance of AIPW estimator can increase dramatically when both the propensity score and outcome models are misspecified.For the scenario where there is no confounding, while β psm with M = 5 results in similar bias than β ipw and β aipw , the PSM estimators are generally less efficient than the weighting approaches.Among all the variance estimators of β psm , the asymptotic variance estimator shows the smallest bias.

An application
Non-small cell lung cancer (NSCLC) is the most commonly diagnosed lung cancer; typically, around half the NSCLC patients receiving chemotherapy will receive additional treatment in the post-progression setting, i.e., second-line treatment setting, where "carboplatin + paclitaxel" and "erlotinib" are two historically commonly used treatments; see Cui et al. (2018).In this section, we use the IMS Health Oncology electronic medical record (EMR) data to conduct a comparative effectiveness analysis of the two treatments in the second-line setting with the proposed causal inference of hazard ratio based on PSM estimator and other existing estimators mentioned in the previous section.
The EMR data is deidentified observational patient-level clinical data with demographic and baseline clinical characteristics collected from medium and large community-based oncology practices across 50 states of the USA.The dataset used contains a retrospective cohort of 10,634 eligible patients at least 18 years old who received at least two lines of therapy, from 1 January 2007 to 31 December 2014; see Cui et al. (2018) for details.
Overall survival was defined as the time from the start date of second-line therapy to the death date.The death date of patients was assumed to be the last visit if a sufficient period had elapsed between the last visit and the end of the EMR data; patients with the time between last visit and the end of dataset shorter than a sufficient period were censored at the date of the last visit.Here we define a sufficient period as at least twice the average visit interval in the 3 months prior to last visit, given that a patients normally have multiple visits to the clinic.Patients alive at the end of the time period were censored at the end date of the dataset.Missing data were classified into its own category for each categorical variable; see de Rooij (2018).Among the eligible patients in the dataset, 1241 patients were treated with "carboplatin + paclitaxel", and 895 patients received single-agent "erlotinib" as second-line therapy, while the remaining patients received one of three other therapies.For illustration of our approach, we subsetted the data to contain only patients receiving "carboplatin + paclitaxel" or "erlotinib," to compare the treatment effects of the two therapies.
The propensity scores were estimated using a logistic regression model with predictors: age at the initiation of second-line therapy, gender, race, region, disease stage at initial diagnosis, Eastern Cooperative Oncology Group performance status score at initiation of second-line therapy, facility types of academic or community cancer center, year of index diagnosis, and days from index diagnosis to initiation of second-line therapy; see Cui et al. (2018) for details.
Table 2 shows estimated log hazard ratio β, point estimates and 95% confidence intervals for the hazard ratio of "carboplatin + paclitaxel" to "erlotinib" based on the unadjusted estimator β nai , the IPW estimator β ipw , and matching estimator based on the full set of covariates β x.m , with the robust variance estimation from the standard software for constructing Wald confidence intervals.We also include the AIPW estimator proposed by Tchetgen Tchetgen and Robins (2012) where the working outcome model is the Cox PH model based on the same set of covariates as the propensity score model and treatment indicator.The variance of β aipw is estimated by a naive bootstrap procedure.For the proposed inference based on the PSM estimator β psm with the number of matches M = 1 and 5, we constructed both Wald confidence intervals based on the empirical asymptotic variance estimator and bootstrap percentile confidence intervals with the proposed double-resampling method.Note: The proposed PSM estimator is calculated under the number of matches M = 1 and 5.For each PSM estimators, its variance is estimated by the proposed asymptotic variance, "asymp" and the proposed double-resampling method, "double-rsp".
All adjusted methods give larger hazard ratio estimates than the unadjusted naive method.Although the point estimates of the PSM approach with M = 1 and M = 5 indicate that "carboplatin + paclitaxel" might be slightly more advantageous, the 95% confidence intervals using both inference approaches include 1, implying insufficient evidence for a statistically significant difference at the 0.05 level in the effectiveness comparison.Similar to the simulation results, the PSM estimator with M = 5 provides narrower interval than the one with M = 1 for the double-resampling approach.The 95% confidence intervals of the other adjusted approaches presented in Table 2, including IPW, AIPW, and matching on covariates, also contain 1, reaching the same conclusion as the PSM estimators.
8 Discussion and future studies PSM is prevailing in practice to handle confounding in observational studies.We establish the statistical properties of the PSM estimator of the marginal causal hazard ratio based on matching with replacement and a fixed number of matches.We also propose a double-resampling technique for variance estimation that takes into account the uncertainty due to propensity score estimation prior to matching.Existing simulation studies have indicated that for estimation of the average treatment effect, IPW or AIPW can perform unstably when extreme values of the estimated propensity scores are present, and instead PSM or alternative approaches such as the overlap weights should be considered and leveraged.Our simulation results echo the previous findings in the survival context for estimation of the marginal hazard ratio.
The theoretical results established in this article hold for any fixed number of matches M .We illustrated two different choices for M in the simulation study and real data analysis in order to show the validity of the resampling-based variance estimator.Intuitively, different choices of M affect the PSM estimator through a bias-variance trade-off.In practice, often a small M is recommended, since the increase in bias is often more significant than the reduction in the variance (Abadie and Imbens, 2006;Austin, 2010).Following the cross-validation idea for the matching estimator with continuous outcomes proposed in Xu (2023), we leave the development of an optimal way of choosing M for the PSM estimator as a topic for future research.
In this work, we derived the asymptotic distribution of the PSM estimator of the marginal hazard ratio assuming a generalized linear model for the propensity score.Notably, the same proof technique can potentially be extended to cases when the propensity scores are estimated assuming certain semiparametric or nonparametric models (e.g., single-or multiple-index models; Huang and Chan (2017)).Take the semiparametric single-index model for example.In this case, the key insight is that PSM does not rely on the exact functional form of the propensity score model but a sufficient dimension reduction of the mean space of A i given X i .We have e(X i ) = e(X T i θ 0 ), where θ 0 ∈ R p is a vector of unknown parameters and e(•) is left unspecified and does not require a restrictive parametric model assumption.Although the link function does not permit a root-n consistent estimator, the index X T θ enables a root-n consistent estimator, denoted as X T θ (Huang and Chan, 2017).Then it suffices to implement the PSM based on X T θ.More specifically, in the proof of Theorem 2, under a parametric propensity score model, P θ0 for (15) is naturally the probability measure governed by the likelihood function of θ 0 .Under the single-or multiple index propensity score model, following Andreou and Werker (2012), we can formulate a working model for P θ0 based on the asymptotic distribution of X T θ.The remaining steps for the proof would remain the same and the inferential framework can be carried over.When assuming other nonparametric machine learning models for the propensity score, the same procedure for implementing the PSM estimator can still be applied, and we leave the development of inferential frameworks in those cases as future endeavors.
It is important to draw connections between clinical trials and observational studies from both design and analysis perspectives, which highlights the advantages of PSM and also motivates several future research directions.Similar to clinical trial designs, the matching step uses only the covariate and treatment information and does not touch the outcome data.Therefore, it mitigates the possibility of data snooping and dredging.As discussed in the introduction, PSM alone emulates a completely randomized trial, which may not be the most efficient.Stratified block randomization is often used to improve the complete randomization in clinical trials.Thus, we can imagine that combining stratification and PSM can also improve PSM alone in observational studies.Moreover, in trial data analysis, instead of a simple analysis of the outcome data, ANCOVA can be used to borrow information from auxiliary information.Thus, it is important to continue the development of more efficient analysis methods, such as general M and Z estimators (Vaart, 1998), for the matched observational data.We leave these topics to future research.
This work focuses on estimating the causal PH ratio, a scalar estimand, that summarizes the treatment effect over a certain period of time.There are many other treatment effect estimands for survival outcomes, such as the restrictive mean survival times, restrictive mean lost times, the difference in survival medians, and so on (Yang et al., 2020).In the context of survival analyses, Chen and Tsiatis (2001) proposed a regression model approach to estimate the average causal effect of restricted mean survival times.Xie and Liu (2005) developed the adjusted Kaplan-Merier estimators of treatment-specific survival functions using IPW.Zhang and Schaubel (2012) combined the regression method of Chen and Tsiatis (2001) and the inverse probability weighted Nelson-Aalen estimator, resulting in a doubly robust estimator of the average causal effect of restricted mean survival times.Díaz (2019) proposed data-adaptive doubly robust estimators of treatment-specific survival functions.In our future work, we will develop matching estimators of general causal estimands for survival outcomes and compare them with existing approaches.

APPENDIX A Comparison between AIPW and PSM for estimating the marginal hazard ratio
The AIPW estimator is also a popular approach for estimating the average treatment effect for a continuous outcome and has been adapted to estimate the marginal hazard ratio in the survival context Tchetgen Tchetgen and Robins (2012).The AIPW estimator is consistent as long as either the propensity score model or the outcome model is correctly specified.In comparison, PSM does not rely on the outcome model and is consistent if the propensity score model is correctly specified.The AIPW estimator is semiparametrically efficient when both models are correctly specified.In comparison, based on Theorem 1, the PSM estimator based on a correctly specified propensity score model will be less efficient than AIPW asymptotically as it does not in general attain the semiparametric efficiency bound.However, as shown in our simulation study, in cases where the overlap of covariate distributions between the two treatment groups is poor, PSM can yield more stable marginal hazard ratio estimates than AIPW in finite samples.This phenomenon is in line with findings from earlier numerical studies involving survival or continuous outcomes (Busso et al., 2014;Austin and Stuart, 2017).This is in part because AIPW weights the outcomes by the inverse of the estimated propensity scores, whereas PSM reuses the original value of the outcomes regardless of how extreme the propensity scores are.
Trimming offers a general solution to violation of the overlap assumption by excluding units with extreme propensity scores from the analysis (Crump et al., 2009;Yang and Ding, 2018).However, trimming methods necessary alter the target estimand and restrict inference to regions of the covariate space with sufficient overlap.PSM is more resistant to the violation of the overlap assumption because even for units with extreme propensity scores, their missing potential outcomes can still be imputed with minimal bias as long as they have close matches from the opposite treatment group.Therefore, trimming is not needed for PSM except maybe when the matching discrepancies become noticeably large.

B Asymptotic variance estimation
In this section, we discuss estimation of the large sample variances of β adjusting for first step estimation of the propensity score.Recall that we have that We will estimate each component on the right hand side of the equation separately.We first estimate the Fisher information I θ0 using Let m k {ω, e(X T i θ)} denote the index of the k-th nearest neighbor matched to unit i based on the estimated propensity scores.For estimation of V G , we first create an imputed dataset for ω = 0, 1, where where Next we can construct an estimator of c by averaging over the sample: For estimation of the conditional covariance, we follow the same matching procedure to create an imputed dataset where {t 1 , ...t K } are distinct observed time points.Putting everything together, our final estimator of the asymptotic variance is

C Regularity conditions and Lemmas
In this section, we provide the regularity conditions and lemmas.For simplicity, we introduce more notations.Let N ≡ {θ : ||θ − θ 0 || < ϵ} be a neighborhood of θ 0 given an ϵ > 0. We use a generalized linear specification for the propensity score, e(x) = e(X T θ) where e(•) is a link function.Moreover, denote Assumption 4. The following regularity conditions hold: (i) θ 0 ∈ int(Θ) with Θ compact, X has a bounded support and E[X T X] is non-singular; (ii) e(•) : R → (0, 1) is twice continuously differentiable with strictly bounded first and second derivatives, ė(•) and ë(•) where ė(•) is strictly positive; (iii) ∀θ ∈ N , the random variable e(X T θ) is continuously distributed with interval support, and its pdf g θ (•) is uniformly Lipschitz continuous over θ; (iv) there exists a component of X that is continuously distributed, has nonzero coefficient in θ 0 , and has a continuous density function conditional on the rest of X; (v) ∀θ ∈ N and ω = 0, 1, µ H (ω, X) and σ 2 H (ω, X) is Lipschitz-continuous in p with the Lipschitz constants independent of θ; (viii) There exists τ > 0 is such that τ 0 λ 0 (t)d(t) < ∞; (ix) there exists a neighborhood B of β 0 and functions s 0 (β, t) and s 1 (β, t)such that (x) the function s 0 (β, t) and s 1 (β, t) is bounded away from 0 on B × [0, τ ]; The regularity conditions (i)-(vi) are adopted from Abadie and Imbens (2016) and Adusumilli (2022), modified for survival outcomes.The regularity conditions (vii)-(xi) are standard in the survival analysis literature and are often assumed for technical convenience.We note here that since β is a scale and ω only takes values in {0, 1}, the first and second derivatives of S n (β, t) with respect to β are the same.Therefore, it suffices to impose the regulation conditions on the first derivative in (vii)-(xi).
The Assumption 5 are adopted from Adusumilli (2022) to ensure the validity of the double-resampling bootstrap estimator.
Lemma 1 below is Lemma S.11 in Abadie and Imbens (2016), which is useful in the proofs of our results.
Lemma 1. Suppose that (W 1 , X 1 ),...,(W n , X n ) are independent and identically distributed, where X i is a scalar continuous variable with a bounded support.Suppose also that σ 2 H (ω, x) is uniformly bounded over the support for X.Let n ω = n i=1 I(W i = ω) be the number of individuals received treatment ω, and p * = pr(W = 1) > 0 and f ω (X) be the density function of the scalar continuous variable X when W = ω.Then, for a given ω,

D Proof of the asymptotic unbiasedness of Equation (12)
This section includes three parts that follow the similar logic of proof.The first and the second parts provide some results useful for later sections.The proof for the asymptotic unbiasedness of n −1 G n (β 0 ) is located in the third part. For From the standard theory for the counting process, dM (ω) (t) is a martingale process with respect to the population and its baseline hazard is Λ 0 (t).Next we will prove that is a martingale for the imputed pseudo-population which means that the imputed pseudo-population has similar covariates distribution with the target population.First, we show that for ω = 0, 1, as n → ∞.We show (18) for ω = 1.The proof for ω = 0 is similar and therefore omitted.We express (18) for ω = 1 as where the second line follows by the consistent assumption, and Under Assumption 4 (i) -(vi), Abadie and Imbens (2006) showed that k δ i is bounded almost surely for any δ > 0, and the discrepancy due to matching is |M −1 j∈J M (1,e(Xi)) e(X j ) − e(X i )| = O p (n −1 ) for a scalar e(X).It follows that T 1 is consistent for zero.Moreover, under Assumption 4 (vi), T 2 is consistent for zero.Lastly, by the strong law of large numbers, T 3 is consistent for zero.Therefore, (18) follows.Since

E Proof for Theorem 1
Taylor expansion of G n ( β) = 0 around β 0 leads to where β is on the line segment between β and β 0 .Then, Under Assumption 3 (ix), n −1 dGn( β) dβ > 0 for β ∈ B. Then, the reminder is to show the asymptotic distribution of n −1/2 G n (β 0 ).Theorem 3. Suppose Assumptions 1-3 and Assumption 4 hold and X is a continuous scalar variable.Then, in distribution, as n → ∞, where Proof.We will show that n −1/2 G n (β 0 ) can be expressed as a sum of n independent and identically distributed random vectors plus a term that converges in probability to a zero vector.By some algebra, we obtain Therefore, continuing with (20), we obtain where ( 27) follows from (25).Under Assumptions S1 (viii), (ix) and (xi), there exists a function Q(β 0 , t) such that, in probability, as n → ∞.Therefore, (28) becomes where H i (ω) is defined in (11).Moreover, where the last line follows by the martingale property for the potential outcome process.
We write Similar to (19), we have , for ω = 0, 1.Therefore, we can write Consider the σ-fields and based on Lemma 1 as n → ∞.Apply the Central Limit Theorem for martingale arrays, (23) follows.

F Proof for Theorem 2
Theorem 4. Suppose that Assumptions 1-3 and Assumption 4 hold.Suppose that e(X) follows a logistic regression model e(X T θ) with the true parameter value θ 0 .Let θ be the maximum likelihood estimator for θ, and I θ0 be the Fisher information matrix.Then, based on matching on the estimated propensity score e(X T θ), , and let L (θ | Z 1 , . . ., Z n ) be the log likelihood function of θ, i.e., Following Abadie and Imbens (2016) we use the local experiment argument.Let θ n = θ 0 + n −1/2 h, and P θn be the data distribution under e(X T θ n ).Also, we define Under P θn , we can express , where We shall show that under P θn : in distribution, as n → ∞.Then, by Le Cam's third lemma (Le Cam and Yang, 1990), Then, under P θn : in distribution, as n → ∞.We also note that under P θn : in distribution, as n → ∞, where and Then, Thus, under P θn , (34) follows.This completes the proof for Theorem 2.

G.1 An data-generating algorithm
We describe the algorithm for generating T (0) and T (1) that are congenial with Model (1) as follows.
By Algorithm 1, T (1) given X follows Under our parameter specification, we have (i) .
G.2 Illustration of propensity score distributions with weak, medium, and strong covariate overlap This section demonstrates the mentioned simulation settings of weak, medium, strong covariate overlap between the two treatment groups.Figure 2 shows density curves for the true propensity scores of the two treatment groups are presented in dashed lines of W = 0 and in solid lines of W = 1.

Figure 1 :
Figure 1: Simulation results of various point estimators of β 0 from 1000 Monte Carlo simulated datasets.Naive is the β nai estimator; IPW is the β ipw estimator; AIPW is the β aipw estimator; M.X is the β m.x estimator; tPSM is the β psm,0 estimator; ePSM1 is the β psm estimator with M = 1; ePSM5 is the β psm estimator with M = 5; some AIPW point estimations with absolute value greater than 1.5 are excluded from the plot in the scenario where the level of covariate overlap is strong.

Figure 2 :
Figure 2: Plots of the true propensity score distribution between treatment groups W = 0 and W = 1 for strong, medium and weak covariate overlap.