Comparing paired vs non-paired statistical methods of analyses when making inferences about absolute risk reductions in propensity-score matched samples

Propensity-score matching allows one to reduce the effects of treatment-selection bias or confounding when estimating the effects of treatments when using observational data. Some authors have suggested that methods of inference appropriate for independent samples can be used for assessing the statistical significance of treatment effects when using propensity-score matching. Indeed, many authors in the applied medical literature use methods for independent samples when making inferences about treatment effects using propensity-score matched samples. Dichotomous outcomes are common in healthcare research. In this study, we used Monte Carlo simulations to examine the effect on inferences about risk differences (or absolute risk reductions) when statistical methods for independent samples are used compared with when statistical methods for paired samples are used in propensity-score matched samples. We found that compared with using methods for independent samples, the use of methods for paired samples resulted in: (i) empirical type I error rates that were closer to the advertised rate; (ii) empirical coverage rates of 95 per cent confidence intervals that were closer to the advertised rate; (iii) narrower 95 per cent confidence intervals; and (iv) estimated standard errors that more closely reflected the sampling variability of the estimated risk difference. Differences between the empirical and advertised performance of methods for independent samples were greater when the treatment-selection process was stronger compared with when treatment-selection process was weaker. We recommend using statistical methods for paired samples when using propensity-score matched samples for making inferences on the effect of treatment on the reduction in the probability of an event occurring. Copyright © 2011 John Wiley & Sons, Ltd.


Introduction
Propensity-score matching is increasingly being used to estimate the effects of treatments, exposures and interventions on outcomes in observational studies. The propensity score is the probability of treatment assignment conditional on the observed baseline covariates [1]. If treatment assignment is strongly ignorable, then conditioning on the propensity score allows one to obtain an unbiased estimate of average treatment effects [1].
Matching on the propensity score allows one to construct a matched sample in which systematic differences in observed baseline covariates are reduced or eliminated between treatment groups [2]. Outcomes often can be directly compared between treatment groups in the propensity-score matched sample. There is a lack of consensus in the literature as to the appropriate statistical methods to use when assessing the statistical significance of the estimated treatment effect in a propensity-score matched sample. Schafer and Kang suggest that, within the matched sample, the treated and untreated subjects should be regarded as independent [3]. Thus, if outcomes were continuous, a two-sample t-test could be used for testing the statistical significance of a difference in means between treatment groups, while the Pearson Chi-squared test could be used to test the statistical significance of the difference in proportions between treatment groups. Recent systematic reviews of propensity-score matching in the medical literature have found that authors frequently use statistical methods for independent samples when assessing the statistical significance of the estimated treatment effect in the propensity-score matched sample [4--6].
In contrast to this, it has been argued elsewhere that a propensity-score matched sample does not consist of independent observations [5]. Treated and untreated subjects matched on the propensity score will have observed baseline covariates that come from the same multivariate distribution [1]. Thus, on average, the baseline characteristics of matched treated and untreated subjects will be more similar than the baseline characteristics of randomly selected treated and untreated subjects in the propensity-score matched sample. When confounding is present, baseline covariates are related to the outcome. Therefore, the outcomes of matched treated and untreated subjects are likely to be more similar to one another compared to the outcomes of randomly selected treated and untreated subjects in the propensity-score matched sample.
In medical and clinical research, dichotomous outcomes are common [7,8]. Propensity score methods allow for direct estimation of risk differences (or absolute risk reductions), in which the proportion of subjects in whom the event occurs can be directly compared between treatment groups in the propensity-score matched sample [9]. The objective of the current study was to compare the effect on statistical inference when statistical methods for independent samples are used compared with when statistical methods for paired samples are used for comparing differences in proportions between treatment groups in the propensity-score matched sample. The study had four specific objectives: First, to compare the empirical type I error rate when using McNemar's test compared with when using the standard Pearson Chi-squared test for comparing the proportion of subjects in whom the event occurred between treatment groups; second, to compare the empirical coverage rates of 95 per cent confidence intervals when standard errors are estimated using methods for paired data compared with when methods for independent samples are used; third, to compare the precision of estimated 95 per cent confidence intervals when standard errors are estimated using methods for paired data compared with when methods for independent samples are used; and fourth, to compare the variance of the empirical sampling distribution of the risk difference with the estimated variance of the risk difference when using methods for independent samples and methods for paired samples. These four objectives will be addressed using Monte Carlo simulations.
subjects. We then assumed that the following logistic regression model related the probability of treatment to these 10 baseline covariates: 10,i We then generated a treatment status indicator (Z i ) for each subject from a Bernoulli distribution with subject-specific probability equal to p i,treat . Those subjects with Z i = 1 denote the treated subjects in whom the ATT is defined. We assumed that the following logistic regression model related the probability of the outcome to these covariates and an indicator variable (Z) denoting treatment: 10,i In the above regression model, p i,outcome denotes the probability of the outcome for the ith subject and denotes the log-odds ratio relating treatment to the outcome. We then generated subject-specific outcomes from a Bernoulli distribution with probability p i,outcome . The regression coefficients for the baseline covariates in the above two regression models were set as follows: L = log(1.1), M = log(1.25), H = log(1.5) and VH = log(2). These are intended to reflect low, medium, high and very high effect sizes. We fixed the value of 0,treat = −1.344090 so that approximately 25 per cent of the subjects would be treated. We fixed the value of 0,outcome = −1.098537 so that the probability of the event occurring in the population if all subjects were untreated would be approximately 0.29. To induce a risk difference of 0, was set to be 0. For risk differences of −0.02, −0.05, −0.10 and −0.15, the required value of equaled log(0.90619), log(0.7795362), log(0.6001387) and log(0.45292), respectively. The reader is referred elsewhere for a more detailed explanation of how these values of were determined [12].
The above scenario assumed that the 10 covariates (X 1 − X 10 ) were all independently distributed standard normal random variables. We also examined four additional covariate scenarios. In the second covariate scenario, the 10 covariates were from a multivariate normal distribution such that the mean and variance of each random variable were equal to 0 and 1, respectively, while the correlation between pairs of random variables was equal to 0.25. In the third covariate scenario, the first five covariates (X 1 − X 5 ) were assumed to be independent Bernoulli random variables with parameter 0.5, while the last five covariates (X 6 − X 10 ) were assumed to be independent standard normal random variables. In the fourth covariate scenario, the first nine covariates were assumed to be independent Bernoulli random variables with parameter 0.5, while the tenth covariate was a standard normal random variable. In the fifth covariate scenario, all the 10 covariates (X 1 − X 10 ) were all independent Bernoulli random variables with parameter 0.5. The values of 0,treat , 0,outcome and were modified in order to preserve the proportion of treated subjects, the marginal probability of the outcome, and the required treatment effect. We refer to the five covariate scenarios as the independent normal covariates scenario, the correlated normal covariates scenario, the first mixed covariates scenario, the second mixed covariates and the independent Bernoulli covariates scenario, respectively. Within each of the five covariate scenarios and for each of the five absolute risk reductions, we randomly generated 1825 data sets, each consisting of 10 000 subjects. We refer to the above set of 25 scenarios as the scenarios with a 0.29 outcome probability and a weak treatment-selection model.
We also examined three additional sets of five covariate scenarios. In the next set of five scenarios, the data-generating process was modified so that the probability of the outcome if all subjects were untreated was 0.15. We refer to this set of five 25 scenarios as the scenarios with a 0.15 outcome probability and a weak treatment-selection model. We then modified these two sets of 25 scenarios by changing the weak treatment-selection model to a strong treatment-selection model. A strong treatmentselection process will induce greater differences in baseline covariates between treated and untreated subjects in the unmatched sample. In these two sets of 25 scenarios, the coefficients for the treatmentselection model and the outcomes model were changed to: L = log(1.5), M = log(1.75), H = log(2), and VH = log(2.5). In the two sets of simulations in which there was a strong treatment-selection model, we observed low percentages of treated subjects successfully matched to untreated subjects in some of the covariate scenarios. Therefore, in these two sets of scenarios, minor modifications were made to the data-generating process by adding additional untreated subjects to the sample. Ten additional copies of each untreated subject were created within each replication of the Monte Carlo simulations. For these 10 additional subjects, outcomes were generated independently using the same subject-specific probability of an outcome. In these last two sets of simulations, the initially generated data set was of size 1000 (rather than of size 10 000). Then 10 copies of each untreated subject were added to the simulated sample so as to increase the number of potential control subjects.

Statistical analyses
In each simulated data set, we estimated the propensity score using a logistic regression model to regress treatment status on the 10 baseline covariates. Propensity-score matching was used to construct a matched sample consisting of pairs of treated and untreated subjects. We used greedy nearest neighbor matching on the logit of the propensity score using a caliper of width equal to 0.2 ( 2 1 + 2 2 )/2, where 2 i is the variance of the logit of the propensity score in the ith treatment group. This caliper width was used as it has been shown to result in optimal estimation of risk differences in a variety of settings [10].
In the propensity-score matched sample, the absolute risk reduction was estimated as the sample difference of the proportion of treated subjects in whom the outcome occurred and the proportion of untreated subjects in whom the outcome occurred in the propensity-score matched sample. When the true absolute risk reduction was 0 (the null hypothesis), the statistical significance of the estimated risk difference was assessed using two different methods. First, using methods for independent samples, the Pearson Chi-squared was used to assess the statistical significance of the difference in the probability of the outcome occurring between treatment groups [13]. Second, using methods for paired samples, McNemar's test was used for this comparison.
The variance of the difference in proportions was estimated in two different methods. First, using methods for independent samples, let p T and p C denote the observed probability of the outcome occurring in treated and untreated subjects, respectively, in the propensity-score matched sample. Furthermore, assume that there are N propensity score matched pairs. Then the standard error of the estimated risk difference is given by [13]. Second, using methods for paired samples, we assume that in the matched sample there were a pairs in which both the treated and untreated subjects experienced the event; b pairs in which the treated subject experienced the event while the untreated subject does not; and c pairs in which the untreated subject experienced the event while the treated subject did not. Then, the variance of the difference in proportions was estimated by ((b +c)−(c −b) 2 /n)/n 2 [14]. In both cases, 95 per cent confidence intervals were estimated as p T − p C ±1.96×se( p T − p C ), where se( p T − p C ) denotes the estimated standard error of the risk difference.
For each of the 100 scenarios (2 treatment-selection models×2 probabilities of outcomes×5 covariate scenarios×5 absolute risk reductions), we simulated 1825 data sets. The above analyses were conducted using each of the 1825 simulated data sets. In the 20 scenarios in which the true risk difference was 0, we estimated the empirical type I error rate as the proportion of simulated data sets in which the null hypothesis of no-treatment effect was rejected with a significance level of less than 0.05. Owing to our use of 1825 simulated data sets, an empirical type I error rate that was less than 0.04 or greater than 0.06 would be classified as being statistically significantly different from 0.05. For each of the 100 scenarios, we determined the proportion of estimated 95 per cent confidence intervals that contained the true risk difference. As above, due to the use of 1825 simulated data sets, empirical coverage rates that are less than 0.94 or that exceed 0.96 are statistically significantly different from the advertised coverage rate of 0.95. We also determined the mean width of the estimated 95 per cent confidence intervals across the 1825 simulated data sets. Finally, we compared the standard deviation of the empirical sampling distribution of the estimated treatment effects (i.e. the standard deviation of the 1825 estimated risk differences across the simulated data sets) with the mean of the estimated standard errors of the estimated treatment effect.

Monte Carlo simulations-results
The empirical type I error rates for the 20 different scenarios in which there was a true null treatment effect are reported in Table I. We also report the mean percentage of treated subjects that were successfully matched to an untreated subject across the 1825 simulated samples. When the Pearson Chisquared test was used to test the statistical significance of the risk difference, the empirical type I error rates were less than 0.04 in 14 (70 per cent) of the 20 different scenarios. In contrast, when McNemar's test was used, the empirical type I error rate were never less than 0.04. In one (5 per cent) of the 20 scenarios, the empirical type I error rate exceeded 0.06 (the empirical type I error rate was 0.0614 in this scenario). Thus, in the majority of covariate scenarios, the use of a method for independent samples (i.e. the Chi-squared test) resulted in a type I error rate that was statistically significantly different from the advertised rejection rate. When the results were stratified by the strength of the treatmentselection process, when methods for independent samples were used, the empirical type I error rate was statistically significantly different from 0.05 in 40 per cent of the 10 scenarios when there was a weak treatment-selection process; however, the empirical type I error rate was statistically significantly different from 0.05 in 100 per cent of the scenarios when there was a strong treatment-selection process.
The empirical coverage rates of 95 per cent confidence intervals, the mean length of 95 per cent confidence intervals, and the ratio of the mean length of estimated 95 per cent confidence intervals obtained using methods for independent samples to the mean length of estimated 95 per cent confidence intervals obtained using methods for paired samples are reported in Tables II-V for the four different sets of five covariate scenarios. In 71 of the 100 scenarios, the empirical coverage rates of 95 per cent confidence intervals obtained using a method for independent samples were statistically significantly different from 0.95 (i.e. empirical coverage rates exceeded 0.96 or were less than 0.94). The median empirical coverage rate was 0.964 (25th and 75th percentiles: 0.955 and 0.971) across the 100 scenarios. Thus, in over half of the 100 scenarios, the empirical coverage rates were significantly different from the advertised rate of 0.95. In contrast, in 15 of the 100 scenarios, the empirical coverage rates of 95 per cent confidence intervals obtained using methods for paired samples were statistically significantly different from 0.95. The median empirical coverage rate was 0.949 (25th and 75th percentiles: 0.944 and 0.951). As above, we examined the results for the independent method of analysis separately in scenarios with a weak treatment selection-process and in scenarios with a strong treatment selectionprocess. In 24 (48 per cent) of the 50 scenarios with a weak treatment-selection process, methods for independent samples resulted in 95 per cent confidence intervals whose empirical coverage rates were not statistically significantly different from 0.95. However, in only 5 (10 per cent) of the 50 scenarios with a strong treatment-selection process, did methods for independent samples result in 95 per cent confidence intervals whose empirical coverage rates were not statistically significantly different from 0.95. The ratio between the mean length of confidence intervals obtained using methods for independent samples and the mean length of confidence intervals obtained using methods for paired samples ranged from a low of 1 to a high of 1.197; the median ratio was 1.074 (25th and 75th percentiles: 1.045 and 1.113) across the 100 scenarios. Thus, in half of the 100 scenarios, the estimated confidence intervals were at least 7.4 per cent wider when methods for independent samples were used compared with when methods for paired samples were used. As above, the relative difference between the widths of the confidence intervals was greater when there was a strong treatment-selection process compared with when there was a weak treatment-selection process.
The analyses reported in the above two paragraphs suggest that confidence intervals constructed using methods for paired samples tend to have coverage rates that were closer to the advertised rates compared with when methods for independent samples were used. Furthermore, methods for paired samples resulted in estimates with greater precision, since the estimated confidence intervals are narrower compared with when methods for independent samples were used.
The square of the ratio between the mean estimated standard error when methods for independent samples were used and the standard deviation of the empirical sampling distribution of the estimated risk differences across the 1825 simulated data sets is reported in the second rightmost column of Tables II-V. A similar ratio obtained when methods for paired samples was used is reported in the rightmost column of Tables II-V. When methods for independent samples were used, this ratio ranged from a low of 0.950 to a high of 1.508; the median ratio was 1.149 (25th and 75th percentiles: 1.085 and 1.250, respectively) across the 100 scenarios. Thus, in 50 per cent of the scenarios, variance estimates obtained using methods for independent samples overestimated the sampling variance of the estimated  risk difference by at least 14.9 per cent. In 25 per cent of the scenarios, these methods overestimated the sampling variance of the estimated risk difference by at least 25.0 per cent. Furthermore, the estimated standard error overestimated the empirical standard deviation of the sampling distribution to a greater extent when there was a strong treatment-selection process compared with when there was a weak treatment-selection process. When methods for paired samples were used, the ratio ranged from a low of 0.906 to a high of 1.071; the median ratio was 1.003 (25th and 75th percentiles: 0.985 and 1.023, respectively) across the 100 scenarios.

Discussion
We compared statistical inference when methods for independent samples were used compared with when methods for paired samples were used for significance testing and for variance estimation when estimating risk differences in propensity-score matched samples. We found that compared with using methods for independent samples, the use of methods for paired samples resulted in: (i) empirical type I error rates that were closer to the advertised rate; (ii) empirical coverage rates of 95 per cent confidence intervals that were closer to the advertised rate; (iii) narrower 95 per cent confidence intervals; and (iv) estimated standard errors that were more closely reflected the sampling variability of the estimated risk difference.
As noted in the Introduction, applied researchers using propensity-score matching frequently use statistical methods for independent samples when estimating the statistical significance of estimated  [5]. The results of our series of Monte Carlo simulations suggest than when outcomes are dichotomous and the risk difference (or the absolute risk reduction) is used as the measure of treatment effect, then statistical methods of inference that account for the matched nature of the propensity-score matched sample are preferable to methods for the analysis of independent samples. The use of methods for independent samples will result in conservative confidence intervals -that is, confidence intervals whose coverage rates exceed the advertised rate. Furthermore, the estimated confidence intervals will tend to be wider, with an associated loss in precision, when methods for independent samples are used compared with when methods for matched samples are used. However, one should note that the estimated risk difference does not depend on whether one assumes that methods for matched samples or methods for independent samples should be used. The current study complements prior published research. An earlier study found that in many settings, methods for paired samples tended to result in improved inference compared with when methods for independent samples were used for the analysis of propensity-score matched samples [15]. In the prior study, empirical coverage rates of confidence intervals and variance estimation was studied for differences in means, rate ratios and relative risks. Furthermore, empirical type I error rates were studied for these three measures of effects as well as for odds ratios and hazard ratios. This prior study did not examine inferences about risk differences. The recent description of a data-generating process for simulating data in which treatment induces a specified absolute risk reduction [12] permitted the examination of inferences for risk differences that was conducted in the current study.
It is important to examine the effect of different methods of analysis on inference when estimating risk differences or absolute risk reductions. Binary outcomes are common in healthcare research [8]. Copyright  When outcomes are binary, the effect of treatment on outcomes can be described using four different metrics: the risk difference, the relative risk, the number needed to treat (NNT) and the odds ratio. Propensity-score matching has been shown to result in biased estimation of both conditional and marginal odds ratios [16,17]. The NNT is the reciprocal of the absolute risk reduction. Comparisons of different propensity score methods for estimating absolute risk reductions and relative risks are described in greater detail elsewhere [9,18].
In conclusion, we recommend that when propensity-score matching is used to reduce or eliminate the effects of treatment selection bias or confounding, that statistical methods for paired samples be used when estimating the effect of treatment or exposure on absolute risk reductions or risk differences.