#### Monte Carlo simulations–methods

We based the design of our Monte Carlo simulations on a prior study that examined the performance of different caliper widths for use with greedy nearest neighbor caliper matching [11]. As in the prior study, we assumed that there were 10 covariates (X_{1} − X_{10}) that effected either treatment selection or the outcome. The treatment-selection model was logit(*p*_{i,treat}) = *α*_{0,treat} + *α*_{L}*x*_{1,i} + *α*_{L}*x*_{2,i} + *α*_{M}*x*_{4,i} + *α*_{M}*x*_{5,i} + *α*_{H}*x*_{7,i} + *α*_{H}*x*_{8,i} + *α*_{VH}*x*_{10,i}. For each subject, we generated treatment status (denoted by *z*) from a Bernoulli distribution with parameter p_{i,treat}. For each subject, we generated both a continuous and a dichotomous outcomes. We generated the continuous outcome using the following model: *y*_{i} = *β*_{0} + *z*_{i} + *α*_{L}*x*_{2,i} + *α*_{L}*x*_{3,i} + *α*_{M}*x*_{5,i} + *α*_{M}*x*_{6,i} + *α*_{H}*x*_{8,i} + *α*_{H}*x*_{9,i} + *α*_{VH}*x*_{10,i} + *ϵ*_{i}, where *ϵ*_{i} ∼ *N*(0,*σ* = 3). Thus, treatment increased the mean response by one unit (thus, the ATT was 1). For each subject, we also generated a dichotomous outcome using the following logistic model: logit(*p*_{i,outcome}) = *β*_{0,outcome} + *β*_{treat}*z*_{i} + *α*_{L}*x*_{2,i} + *α*_{L}*x*_{3,i} + *α*_{M}*x*_{5,i} + *α*_{M}*x*_{6,i} + *α*_{H}*x*_{8,i} + *α*_{H}*x*_{9,i} + *α*_{VH}*x*_{10,i}. We then generated a binary outcome for each subject from a Bernoulli distribution with parameter p_{i,outcome}. We selected the intercept, β_{0,outcome}, in the logistic outcome model so that the incidence of the outcome would be approximately 0.10 if all subjects in the population were untreated. This is approximately equal to the proportion of patients hospitalized with an AMI who are readmitted within 1 year (11%) and to the 30-day mortality rate for hospitalized AMI patients (12%) [13, 14]. In a given simulated dataset, we simulated a binary outcome for each subject, under the assumption that all subjects were not treated (z = 0). We then calculated the incidence of the outcome in the simulated dataset. We used a bisection approach to determine that value of β_{0,outcome} that would result in an incidence of 0.10. We set the regression coefficients α_{L}, α_{M}, α_{H}, and α_{VH} to log(1.25), log(1.5), log(1.75), and log(2), respectively. Thus, there were two covariates that had a weak effect on each of treatment selection and outcomes, two covariates that had a moderate effect on each of treatment selection and outcomes, two covariates that had a strong effect on each of treatment selection and outcomes, and one covariate that had a very strong effect on both treatment selection and outcomes. We selected the conditional log-odds ratio, β_{treat}, using methods described elsewhere so that average absolute risk reduction in treated subjects due to treatment would be 0.02 [15] (i.e., the true ATT was − 0.02). Briefly, for a given value of β_{treat}, we determined the probability of the occurrence of the outcome for each subject twice: first, under the assumption that the subject was untreated; second, under the assumption that the subject was treated. The subject-specific risk-difference was the difference between these two probabilities. We then determined the average subject-specific risk-difference across all subjects who ultimately received the treatment (because we are focusing on the ATT). We used an iterative process to determine the value of β_{treat} that would result in the desired risk difference ( − 0.02). We performed this iterative process prior to the main body of simulations. Thus, we used the same value of β_{treat} in all 1000 simulated datasets in a given scenario. Because we were simulating data with a desired ATT, the value of β_{treat} would depend on the proportion of subjects that were treated. Note that this approach allows for variation in subject-specific treatment effects (or differences in risk). We used a logistic model to simulate data with an underlying average treatment effect in the treated because such an approach will guarantee that individual probabilities of the occurrence of the outcome will lie within the unit interval. Although the use of a linear model to generate probabilities would result in a linear treatment effect on the probability scale, such an approach could result in subjects whose probabilities of the occurrence of the outcome lie outside of the unit interval (and thus violate the definition of a probability). The use of an appropriate link function will constrain predicted probabilities to lie within the unit interval, regardless of the values taken by the baseline covariates. The adopted approach results in a uniform effect of treatment on the relative odds scale, whereas the absolute effect of treatment may vary across subjects. However, this may be reflective of many clinical scenarios, because the absolute risk reduction may be greater for subjects at a greater baseline risk of the event compared with subjects with a lower baseline risk of the event. Despite potential non-uniformity of the absolute risk reduction, clinical investigators are still interested in estimating the average risk difference over the population (or over the treated subjects).

Our Monte Carlo simulations had a complete factorial design in which the following two factors were allowed to vary: (i) the distribution of the 10 baseline covariates; and (ii) the proportion of subjects who received the treatment. We considered five different distributions for the 10 baseline covariates: (a) the 10 covariates had independent standard normal distributions; (b) the 10 covariates were from a multivariable normal distribution. Each variable had mean zero and unit variance, and the pair-wise correlation between variables was 0.25; (c) the first five variables were independent Bernoulli random variables each with parameter 0.5, whereas the second five variables were independent standard normal random variables; (d) the 10 random variables were independent Bernoulli random variables, each with parameter 0.5; and (e) the 10 random variables were correlated Bernoulli random variables. In this setting, 10 continuous variables were generated as in scenario (b). Each continuous variable was then dichotomized at the population mean (zero). In a prior study, Austin [11] used the first four of these scenarios (a–d), whereas the fifth scenario was added to the current study. In a clinical context, the continuous variables can represent variables such as demographic characteristics (age, years of education, or income), vital signs (systolic and diastolic blood pressure, heart rate, and respiratory rate), and results of laboratory testing (e.g., hemoglobin, lipid levels, and creatinine). The dichotomous variables can represent demographic characteristics (sex) or the presence or absence of risk factors and co-existing illnesses (e.g., diabetes, hypertension, and kidney disease). For the second factor, we considered five different levels for the proportion of subjects that were treated: 0.05, 0.10, 0.20, 0.25, and 0.33. We modified the value of *α*_{0,treat} in the treatment-selection model to obtain the desired prevalence of treatment in the simulated datasets. We thus considered 25 different scenarios: five different distributions for the baseline covariates times five levels of the proportion of subjects who were treated (0.05, 0.10, 0.20, 0.25, and 0.33).

In each of the 25 scenarios, we simulated 1000 datasets, each consisting of 1000 subjects. The decision to use simulated datasets of size 1000 was made for two reasons. First, matching algorithms can be computationally intensive. Because 12 different matching algorithms were being examined, the decision was made to use datasets of moderate size. Second, in a systematic review of the use of propensity score methods in the medical literature, we observed that these methods have been used in datasets of size less than 1000 [16]. In each simulated dataset, we estimated the propensity score using a logistic regression model to regress treatment assignment on the seven variables that affect the outcome. We selected this approach as it has been shown to result in superior performance compared with including all measured covariates or those variables that affect treatment selection [17]. In practice, one can use the existing literature and clinical or subject-matter knowledge and expertise to identify those variables that affect the outcome. It is likely that this set of variables will be relatively consistent across regions or jurisdictions, whereas the set of variables that affect treatment selection may vary between regions or jurisdictions. In each simulated dataset, we used 12 different matching algorithms to match treated subjects to untreated subjects: optimal matching (on the propensity score and on the logit of the propensity score), greedy nearest neighbor matching (high to low), greedy nearest neighbor matching (low to high), greedy nearest neighbor matching (closest distance), greedy nearest neighbor matching (random order), caliper matching (low to high), caliper matching (high to low), caliper matching (closest distance), caliper matching (random order), nearest neighbor matching (with replacement), and caliper matching (with replacement). For the nearest neighbor matching algorithms, we matched subjects on the propensity score, whereas in the caliper matching algorithms, we matched subjects on the logit of the propensity score using a caliper of width equal to 0.2 of the standard deviation of logit of the propensity score [11]. Thus, in each simulated dataset, we formed 12 matched sets.

In each matched set, we estimated the estimated treatment effect as the difference between the mean of the observed outcome in treated subjects in the matched sample and the mean of the observed outcome in untreated subjects in the matched sample: , where Y_{1,i} and Y_{0,i} denote the outcome for the *i*th treated subject and *i*th untreated subject in the matched sample, respectively (and where the matched sample consists of N matched pairs). Thus, we estimated both a difference in means (continuous outcome) and a risk difference (binary outcome) in the propensity-score matched sample. This estimator removes the effect of confounding because the distribution of baseline covariates is expected to be the same in treated and untreated subjects in the matched sample [1]. We also computed the absolute standardized difference comparing the distribution of each of the 10 baseline covariates between treatment groups in each of the matched samples [18, 19]. For continuous variables, the standardized difference is defined as where and denote the sample mean of the covariate in treated and untreated subjects, respectively, whereas and denote the sample variance of the covariate in treated and untreated subjects, respectively. For dichotomous variables, the standardized differences are defined as , where and denote the prevalence or mean of the dichotomous variable in treated and untreated subjects, respectively. For caliper matching, we determined the mean percentage of treated subjects that were matched to an untreated subject (for the other matching algorithms, 100% of treated subjects will be matched to an untreated subject because there is no restriction on the maximum difference in the propensity scores for a matched pair).

Let θ denote the true treatment effect (1 and − 0.02 for continuous and binary outcomes, respectively), and let θ_{i} denote the estimated treatment effect in the *i*th simulated sample (*i* = 1, … ,1000). We estimated the mean estimated treatment effect as and the MSE as . For each of the 10 baseline covariates, we estimated the mean absolute standardized difference across the 1000 simulated datasets.

#### Monte Carlo simulations–results

To provide a context for the estimated treatment effects obtained using different matching algorithms, we examined the mean estimated crude or unadjusted treatment effects across the 25 different scenarios. The minimum and maximum crude treatment effects for the continuous outcome were 1.20 (percent bias: 20%) and 3.46 (percent bias: 246%). The first and third quartiles were 1.53 (53%) and 1.88 (88%), respectively. The minimum and maximum crude treatment effects for the binary outcome were 0 (percent bias: − 100*%*) and 0.238 (percent bias: − 1292*%*). The first and third quartiles were 0.026 ( − 232*%*) and 0.062 ( − 412*%*), respectively. These summary statistics allow one to examine the degree to which bias was reduced by using different matching algorithms.

The results for optimal matching on the propensity score were the same as those for optimal matching on the logit of the propensity score. In order to simplify the presentation of our results, we do not present results for the latter algorithm. Optimal matching, greedy nearest neighbor matching without replacement, and greedy nearest neighbor matching with replacement result, by design, in 100% of treated subjects being included in the matched sample. For the different caliper matching algorithms, the average percentage of treated subjects matched to an untreated subject in each of the 25 scenarios is described in Figure A1 in the Supporting information. For each of the five sets of distributions for the baseline covariates, the percentage of treated subjects successfully matched to an untreated subject was highest when caliper matching with replacement was used. The rank ordering of the four caliper methods that used matching without replacement was (from highest percentage of matched subjects to lowest percentage) selecting treated subjects from highest to lowest propensity score, selecting treated subjects at random, sequentially selecting treated subjects from best to worst match, and sequentially selecting treated from lowest to highest propensity score. The differences between caliper matching (highest to lowest) without replacement and the three other methods that used matching without replacement tended to increase as the prevalence of treatment increased (i.e., differences between caliper matching (highest to lowest) without replacement and the three other methods were the smallest when 5% of subjects were treated, whereas it was the greatest when 33% of subjects were treated).

The mean within-pair difference in the propensity score for the different matching algorithms is reported in Figure A2 in the Supporting information. Greedy nearest neighbor matching (lowest to highest) tended to result in mean differences in the propensity score that were greater than those from all other matching methods. Optimal matching tended to result in performance similar to that of three of the methods based on nearest neighbor matching without replacement (high to low, random, and closest distance). As would be expected, matching without replacement resulted in matched samples with the lowest mean within-pair difference in the propensity score. Caliper matching without replacement tended to have a performance between that of the nearest neighbor matching without replacement algorithms and the methods that used matching with replacement.

We report the mean estimated linear treatment effects in Figure 1 (continuous outcome) and Figure 2 (binary outcome). Recall that the true treatment effects were 1 and − 0.02, respectively. A horizontal line has been added to each panel denoting the magnitude of the true treatment effect. In general, optimal matching and nearest neighbor matching without replacement tended to have similar performance. Bias tended to be less with nearest neighbor caliper matching and matching with replacement. Amongst the methods that used caliper matching without replacement, bias tended to be less when treated subjects were selected in a random order or sequentially in the order of best match first.

We report the standard deviations of the estimated treatment effects across the 1000 simulated datasets for each scenario in Figure 3 (continuous outcome) and Figure A3 in the Supporting information (binary outcome). Matching with replacement tended to result in estimates that displayed greater variability than the methods based on matching without replacement. When at least some of the covariates were normally distributed and the outcome was continuous, optimal matching and the four implementations of nearest neighbor matching without replacement tended to result in estimates that displayed slightly less variability than the methods based on caliper matching without replacement.

We report the MSE of the estimated treatment effects in Figure 4 (continuous outcome) and Figure 5 (binary outcome). Matching with replacement tended to result in estimates with greater MSE compared with methods based on matching without replacement. The four different nearest neighbor caliper matching algorithms that used matching without replacement tended to have very similar performance to one another. In some settings, they had very similar performance to optimal matching and to nearest neighbor matching without replacement. However, in a minority of scenarios, their performance substantially exceeded that of optimal matching and that of nearest neighbor matching without replacement.

We report the mean absolute standardized differences for the 10 covariates under the different matching algorithms in Figures A4–A8 in the Supporting information for the five different sets of distributions for the baseline covariates. Several observations merit comment. First, in a few of the scenarios, matching with replacement resulted in slightly greater imbalance in measured baseline covariates compared with matching without replacement. This finding at first appears counterintuitive, because matching with replacement matches each treated subject to the nearest untreated subject. Thus, one would expect better balance on baseline covariates compared with the competing approaches. However, this finding is a result of how balance is assessed. Matching with replacement will most likely result in the same untreated subject being included multiple times in the matched sample. Thus, when the variance of the baseline covariate is estimated in treated and untreated subjects, the inclusion of the same untreated subject in multiple matched pairs will reduce the variability of the baseline covariate in the matched untreated subjects. This will result in an inflation of the standardized difference, because the pooled variance of the baseline covariate forms the denominator of the standardized difference. Second, when at least some of the baseline covariates were normally distributed, caliper matching without replacement tended to result in improved balance compared with methods based on nearest neighbor matching without replacement. The differences between these two sets of algorithms increased as the proportion of subjects who were treated increased. Third, optimal matching and the different implementations of nearest neighbor matching without replacement tended to result in the same balance in baseline covariates across the different scenarios.

We conducted a brief, post-hoc analysis to examine the stability of our findings due to the use of 1000 simulated datasets. We restricted our attention to one scenario (multivariate normal covariates and a prevalence of exposure of 5%) and one method (caliper matching—random order). In this sensitivity analysis, we replicated our simulations 10 times—we created 1000 simulated datasets 10 times. We constructed each of the 10,000 simulated datasets using a different random number seed. We then determined the mean estimated treatment effect and the MSE of the estimated treatment effect in each of the 10 sets of 1000 simulated datasets. When using caliper matching (random order), the estimated difference in means ranged from 0.980 to 1.033 across the 10 sets of simulated datasets, whereas the MSE of the estimate ranged from 0.384 to 0.440. Similarly, the estimated mean risk difference ranged from − 0.022 to − 0.013, whereas the MSE of the estimate ranged from 0.0067 to 0.0074.