Greedy caliper propensity score matching can yield variable estimates of the treatment‐outcome association—A simulation study

Abstract Purpose Greedy caliper propensity score (PS) matching is dependent on randomness, which can ultimately affect causal estimates. We sought to investigate the variation introduced by this randomness. Methods Based on a literature search to define the simulation parameters, we simulated 36 cohorts of different sizes, treatment prevalence, outcome prevalence, treatment‐outcome‐association. We performed 1:1 caliper and nearest neighbor (NN) caliper PS‐matching and repeated this 1000 times in the same cohort, before calculating the treatment‐outcome association. Results Repeating caliper and NN caliper matching in the same cohort yielded large variations in effect estimates, in all 36 scenarios, with both types of matching. The largest variation was found in smaller cohorts, where the odds ratio (OR) ranged from 0.53 to 10.00 (IQR of ORs: 1.11‐1.67). The 95% confidence interval was not consistently overlapping a neutral association after repeating the matching with both algorithms. We confirmed these findings in a noninterventional example study. Conclusion Caliper PS‐matching can yield highly variable estimates of the treatment‐outcome association if the analysis is repeated.


| BACKGROUND
In observational research, treatment allocation is not random, but allocated by the treating physician. Therefore, patient characteristics will likely influence the physician's decision to give a patient a certain treatment, or not 1 . Adjusting for these characteristics can decrease this bias, and the propensity score (PS) is often used for this purpose 2 .
Besides using the PS for adjustment, weighting, or stratification 3 , using the PS for matching is a popular way to achieve cohorts with comparable baseline characteristics 4,5 .
Greedy caliper matching is a popular method used in PS matching 6 .
This method orders the treated subjects, and the first treated subject is randomly matched to an untreated (or alternatively treated) subject with a PS that is within a predefined caliper width. The initial ordering of subjects is often done randomly but may also be based on a subject's PS or other parameters. In addition to caliper matching, nearest neighbor (NN) caliper matching is often used, where the treated subject is matched to an untreated subject that has the closest propensity score within the caliper. Both methods do not consider that the untreated subject can potentially form a better pair with another treated subject that is further down the line; hence they are "greedy" algorithms. Because of this, both methods are dependent on the random order in which the treated subjects are placed, if patients are not ordered based on their PS. In addition, caliper matching is also dependent on which untreated patient within the caliper is randomly matched.
Most statistical programs use a pseudo-random ordering, which can allow for the random ordering to be reproduced if the same random seed is used. However, how much the matching, and ultimately the estimated treatment effect, can differ with a different random seed, is unknown. We evaluated the extent to which observational studies analyzed using greedy caliper PS matching with random ordering and greedy NN caliper PS matching are susceptible to variable results due to the randomness in the matching.

| METHODS
The study consisted of three parts. First, we conducted a review of matching procedures used in epidemiologic studies to identify realistic scenarios for a simulation study. Second, we repeatedly applied PS matching in several simulated cohorts. Third, we sought to replicate the findings in a real observational study of drug effectiveness.

| Literature search
We performed a literature search to find realistic parameters for our simulation. In PubMed, we searched for "propensity score" AND (([match] OR matched) OR matching), filtering core clinical journals as defined by PubMed. The search was performed on August 22, 2019. We selected the 50 most recently published pharmacoepidemiology studies using PS matching, and 50 studies that were not pharmacoepidemiology, as defined by two independent reviewers (J.K. and A.T.). From these articles, we identified the matching algorithm that was used, which statistical program was used, the sample size, the treatment prevalence, the outcome prevalence, and the strength of the association between treatment and outcome. These parameters were used to determine the parameters of the simulation study.
For the simulation of the cohorts, we used a 2-step process to define covariates. First, we created 8 variables (X 1 Á Á ÁX 8 ): 6 binary variables (X 1 Á Á ÁX 6 ) and 2 continuous variables (X 7 , X 8 ). X 1 through X 6 were randomly drawn from a binomial distribution and had a prevalence of 0.2 and both X 7 and X 8 were drawn from a normal distribution and had a mean of 0 and a variance of 0.5 unit. All covariates were independent of each other. Based on these variables, we defined the probability of treatment T using a logistic model, and then simulated T from these probabilities: Finally, we simulated outcome Y based on the probability of Y given all eight variables and the treatment T, using a logistic model: The range of values used in the models in different scenarios is presented in Table 1. The parameter values α 0 and β 0 were chosen to result in the desired prevalence for T of 0.2 and 0.5 and for Y of 0.1 and 0.5.

| Propensity score matching
In all 36 generated cohorts, we applied greedy caliper matching, with and without using NN. First, in all 36 cohorts, we used logistic regression to calculate the probability for the treatment based on the

Key Points
• Greedy caliper propensity score matching is the most frequently used matching algorithm.
• Repeating greedy caliper propensity score matching in the same cohort yielded high variation in the treatmentoutcome association.
• The confidence interval of the treatment-outcome association did not consistently overlap a neutral association after repeating the matching procedure in the same cohort.
• This variation was largest in cohorts with lower outcome prevalence, in which propensity score matching is frequently used.

| Real-life dataset
We used the Stockholm Healthcare database for confirmation of our findings from the simulation dataset in a real-life setting. The database has been described elsewhere 8 . In short, the database contains demographic information for all Stockholm residents (n = 2.3 million), ATCcodes for dispensed drugs, and ICD-10 codes for inpatient and outpatient diagnoses from primary and secondary care. Patients were censored when they emigrated, died, or suffered from an outcome. The outcome of interest was a composite of an ischemic stroke, unspecified stroke or transient ischemic attack (TIA), registered as an ICD-10 code in a hospital setting and requiring acute care, as was done previously 9 .
We used a PS matched intention-to-treat analysis to assess the association of NOACs versus VKA and the risk for the composite endpoint. The propensity score was the probability of receiving a NOAC compared to a VKA, calculated using logistic regression. In the logistic regression model, we used the components of the CHA 2 Ds 2 -VASC score (age, sex, heart failure, hypertension, prior stroke/TIA/ embolism, vascular disease, and diabetes), registered in the 5 years prior to index date 10 .
We used both 1:1 caliper matching and 1:1 NN caliper matching with a caliper width of 0.2 SD ps or 0.01 SD ps without replacement.
We replicated the matching procedure 1000 times with a different random seed for each repetition. In each matched set, we used a stratified Cox proportional hazards model for matched pairs to assess the association of NOAC vs VKA with the risk for the composite outcome.

| Literature search
We assessed 100 articles. Of the 72 articles mentioning the kind of matching algorithm used, 51 used nearest neighbor matching (32 with a caliper), 17 used caliper matching, two used 5:1-digit matching, one used optimal matching, and one used kernel matching. SAS was mentioned in 32 articles, R in 25, SPSS in 17, and STATA in 14. The MatchIt package in R was the most frequently mentioned package (n = 13) but most often no package, macro, or program was mentioned at all (n = 79).

| Simulation study
Repeating the PS matching 1000 times with a different random seed yielded wide variation in the OR for the association of treatment and outcome, especially in caliper matching and less in NN caliper matching (see Figure 1 and Table 2

| Real life dataset
In line with the simulations, the largest variation was visible in the smallest cohort (n = 1594) after caliper matching with a median HR of 0.94 (IQR: 0.82-1.06), ranging from 0.52 (CI: 0.27-1.01) to 2.43 (CI: 1.01-5.86). The variation was smaller in the large cohort and after NN caliper matching (see Figure 2 and Table 5). Again, the 95% confidence was nonconsistently overlapping 1 after repeating the matches. Note: OR = odds ratio; T 50% = treatment prevalence 50%; O 50% = outcome prevalence 50%; % sign low/high risk = percentage of the 1000 matched sets with a significantly increased or decreased risk. and it could be those papers actually performed caliper matching without NN, with the risk of high variability. As the statistical package is not mentioned in most articles (n = 79), it is not possible to determine how the matching procedure took place. We recommend better reporting of matching procedures used, including which statistical software, as this can ultimately affect the results of a study, and is necessary for study replication.

| DISCUSSION
In addition, we found that approximately 50% of the studies were conducted in a cohort with a sample size of 2500 or less. In our simulation study, we showed that in these sample sizes the treatment effects are largely influenced by the selected random seed, which in practice is not often specified or reported. In addition, whether the 95% CI overlapped 1 was inconsistent with different random seeds.

DATA AVAILABILITY STATEMENT
Data from the Stockholm Healthcare Database is not allowed to be transferred.
The code for the simulation study and for the modifications in the MatchIt package is available with the authors on request.

ETHICS STATEMENT
The study was approved by the Regional Ethical Review Board in Note: The calliper was set at 0.01 of the SD of the propensity score. The % sign low and sign high risk state in how many of the successfully matched cohorts the treatment-outcome associations was statistically different from 1. The % unsuccessful matches is in how many instances there was at least one of the SMDs above 0.1. Abbreviation: OR, odds ratio. Note: The calliper was set at 0.01 of the SD of the propensity score. The % sign low and sign high risk state in how many of the successfully matched cohorts the treatment-outcome associations was statistically different from 1. The % unsuccessful matches is in how many instances there was at least one of the SMDs above 0.1. Abbreviation: OR, odds ratio. T 20%, O 10% 1.14 (1.14-1.14) (1.14-1.14) 0.0% 0.0% 125 100.0% Note: The calliper was set at 0.01 of the SD of the propensity score. The % sign low and sign high risk state in how many of the successfully matched cohorts the treatment-outcome associations was statistically different from 1. The % unsuccessful matches is in how many instances there was at least one of the SMDs above 0.1. Abbreviation: OR, odds ratio. Note: The calliper was set at 0.2 of the SD of the propensity score. The % sign low and sign high risk state in how many of the successfully matched cohorts the treatment-outcome associations was statistically different from 1. Abbreviation: OR, odds ratio.
T A B L E A 2 b Results of repeated calliper and nearest neighbor calliper propensity score matching in cohorts with a simulated odds ratio of 1.0 where the SMD for all covariates was ≤0. Note: The calliper was set at 0.2 of the SD of the propensity score. The % sign low and sign high risk state in how many of the successfully matched cohorts the treatment-outcome associations was statistically different from 1. Abbreviation: OR, odds ratio.