Abstract
- Top of page
- Abstract
- Introduction
- Methods
- Results
- Discussion
- Acknowledgements
- References
- Supporting Information
Often the literature makes assertions of medical product effects on the basis of ‘ p < 0.05’. The underlying premise is that at this threshold, there is only a 5% probability that the observed effect would be seen by chance when in reality there is no effect. In observational studies, much more than in randomized trials, bias and confounding may undermine this premise. To test this premise, we selected three exemplar drug safety studies from literature, representing a case–control, a cohort, and a self-controlled case series design. We attempted to replicate these studies as best we could for the drugs studied in the original articles. Next, we applied the same three designs to sets of negative controls: drugs that are not believed to cause the outcome of interest. We observed how often p < 0.05 when the null hypothesis is true, and we fitted distributions to the effect estimates. Using these distributions, we compute calibrated p-values that reflect the probability of observing the effect estimate under the null hypothesis, taking both random and systematic error into account. An automated analysis of scientific literature was performed to evaluate the potential impact of such a calibration. Our experiment provides evidence that the majority of observational studies would declare statistical significance when no effect is present. Empirical calibration was found to reduce spurious results to the desired 5% level. Applying these adjustments to literature suggests that at least 54% of findings with p < 0.05 are not actually statistically significant and should be reevaluated. © 2013 The Authors. Statistics in Medicine published by John Wiley & Sons Ltd.
Introduction
- Top of page
- Abstract
- Introduction
- Methods
- Results
- Discussion
- Acknowledgements
- References
- Supporting Information
Observational studies deliver an increasingly important component of the evidence base concerning the effects of medical products. Payers, regulators, providers, and patients actively employ observational studies in therapeutic decision making. While randomized controlled trials (RCTs) are regarded as the gold standard of evidence when measuring treatment effects, conducting these experiments remains resource intensive, time consuming, can suffer from limitations in sample size and generalizability, and often may be infeasible or unethical to implement. In contrast, the non-interventional secondary use of observational data collected within the healthcare system for purposes such as reimbursement or for clinical care can yield timely and cost-efficient insights about real-world populations and current treatment behaviors at a small fraction of the cost and in days instead of years. With these potential advantages comes the recognition that observational studies can suffer from various biases and that results might not always be reliable. Results from observational studies often cannot be replicated [1, 2]. For example, two recent independent observational studies investigating oral bisphosphonates and the risk of esophageal cancer produced results leading to conflicting conclusions [3, 4], despite the fact that the two studies analyzed the same database over approximately the same period. A systematic analysis has suggested that the majority of observational studies return erroneous results [5]. The main source of these problems is that observational studies are more vulnerable than RCTs to systematic error such as bias and confounding. In RCTs, randomization of the exposure helps to ensure that exposed and unexposed populations are comparable. Observational studies, by definition, merely observe clinical practice, and exposure is no longer the only potential explanation for observed differences in outcomes. Many statistical methods exist that aim to reduce this systematic error, including self-controlled designs [6] and propensity score adjustment [7], but it is unclear to what extent these solve the problem.
Despite the fact that the problem of residual systematic error is widely acknowledged (often in the discussion section of articles), the results are sometimes misinterpreted as if this error did not exist. Most important, statistical significance testing, which only accounts for random error, is widely used in observational studies. Significance tests compute the probability that a study finding at least as extreme as the one reported could have arisen under the null hypothesis (usually the hypothesis of no effect). This probability, called the p-value, is compared against a predefined threshold α (usually 0.05), and if p < α, the finding is deemed to be ‘statistically significant’.
Although we believe that most researchers are aware of the fact that traditional p-value calculations do not adequately take systematic error into account, likely because of a lack of a better alternative, the notion of statistical significance based on the traditional p-value is widespread in the medical literature. Using PubMed, we conducted a systematic review of this literature in the last 10 years and identified 4970 observational studies exploring medical treatment with effect estimates in their abstracts. Of these, 1362 provided p-values in the abstract, and 83% of these p-values indicated statistical significance. The remaining 3608 articles provided 95% confidence intervals instead. Of these confidence intervals, 82% excluded 1 and therefore also indicated statistical significance. The details of this analysis are provided in the Supporting information, Appendix F ‡ .
In this research, we focus on the fundamental notion of statistical significance in observational studies, testing the degree to which observational analyses generate significant findings in situations where no association exists. To accomplish this, we selected two publications: one investigating the relationship between isoniazid and acute liver injury [8] and one investigating sertraline and upper gastrointestinal (GI) bleeding [9]. These two publications represent three popular study designs in observational research: the first publication used a cohort design, and the second paper used both a case–control design and a self-controlled case series (SCCS). We replicated these studies as best we could, closely following the specific design choices. However, because we did not have access to the same data, we used suitable substitute databases. For each study, we identified a set of negative controls (drug–outcome pairs for which we have confidence that there is no causal relationship) and explored the performance of the study designs. We show that the designs frequently yield biased estimates, misrepresent the p-value, and lead to incorrect inferences about rejecting the null hypothesis. We introduce a new empirical framework, based on modeling the observed null distribution for the negative controls that yields properly calibrated p-values for observational studies. Using this approach, we observe that about 5% of drug–outcome negative controls have p < 0.05, as is expected and desired. By applying this framework to a large set of historical effect estimates under various assumptions of bias, we show that for the majority of estimates currently considered statistically significant, we would not be able to reject the null hypothesis after calibration. Our framework provides an explicit formula for estimating the calibrated p-value using the traditionally estimated relative risk and standard error. As such, all stakeholders can easily employ this decision tool as an aid for minimizing the potential effects of bias when interpreting observational study results.
Results
- Top of page
- Abstract
- Introduction
- Methods
- Results
- Discussion
- Acknowledgements
- References
- Supporting Information
Our replication of the published studies produced similar results. The original cohort study reported an odds ratio of 6.4 (95% CI 2.2–18.3, p < 0.001) for isoniazid and acute liver injury, compared with an odds ratio of 4.0 (95% CI 2.7–6.0, p < 0.001) found in our reproduction. In total, we identified 2807 subjects that were exposed to the drug, for a total of 384,659 days. In the whole population cover by the database, 264,122 subjects met our criteria for acute liver injury.
The original case–control study reported an odds ratio of 2.4 (95% CI 2.1–2.7, p < 0.001) for sertraline and upper GI bleeding, while our reproduction yielded an odds ratio of 2.2 (95% CI 1.9–2.5, p < 0.001). The original self-controlled case study of the same relationship reported an incidence rate ratio of 1.7 (95% CI 1.5–2.0, p < 0.001) compared with an incidence rate ratio of 2.1 (95% CI 1.8–2.4, p < 0.001) found in our reproduction. In total, 441,340 subjects were exposed to sertraline for a total duration of 108,759,375 days. In the entire database, 108,882 subjects met our criteria for upper GI bleeding.
Distribution of negative controls
We applied the three study designs to the negative controls for the respective health outcomes of interest.
Figure 1 shows the estimated odds ratios and incidence rate ratios, which can also be found in tabular form in the Supporting information, Appendix C. For the case–control and SCCS designs, applying the same study design to other drugs was straightforward. For the cohort method, most drugs had different comparator groups of patients and required recomputing propensity scores. For three of the negative controls for acute liver injury and 23 of the negative controls for upper GI bleeding, there were not enough data to compute an estimate, for instance, because none of the cases and none of the controls were exposed to the drug. However, an initial investigation in the minimum number of required controls showed the remaining number sufficed (Supporting information, Appendix E). Note that the number of exposed subjects varies greatly from drug to drug, from 67 subjects being exposed to neostigmine to 884,644 individuals having exposure to fluticasone (Supporting information, Appendix C). These differences account for the majority in variation of the widths of the confidence intervals.
From Figure 1, it is clear that traditional significance testing fails to capture the diversity in estimates that exists when the null hypothesis is true. Despite the fact that all the featured drug–outcome pairs are negative controls, a large fraction of the null hypotheses are rejected. We would expect only 5% of negative controls to have p < 0.05. However, in Figure 1A (cohort method), 17 of the 34 negative controls (50%) are either significantly protective or harmful. In Figure 1B (case–control), 33 of 46 negative controls (72%) are significantly harmful. Similarly, in Figure 1C (SCCS), 33 of 46 negative controls (72%) are significantly harmful, although not the same 33 as in Figure 1B.
These numbers cast doubts on any observational study that would claim statistical significance using traditional p-value calculations. Consider, for example, the odds ratio of 2.2 that we found for sertraline using the case–control method, we see in Figure 1B that many of the negative controls have similar or even higher odds ratios. The estimate for sertraline was highly significant (p < 0.001), meaning the null hypothesis can be rejected on the basis of the theoretical model. However, on the basis of the empirical distribution of negative controls, we can argue that we cannot reject the null hypothesis so readily.
Calibration of p-values
Using the empirical distributions of negative controls, we can compute a better estimate of the probability that a value at least as extreme as a certain effect estimate could have been observed under the null hypothesis. For the three designs we considered, Table 1 provides the maximum likelihood estimates for the means and variances of the empirical null distributions. Interestingly, in our study, while the cohort method has nearly zero bias on average, the case–control and SCCS methods are positively biased on average. It is important to note that for all three designs,
is not equal to zero, meaning that the bias in an individual study may deviate considerably from the average.
Table 1. Estimated mean
and
variance of the empirical null distribution for the three study designs.| Design |  |  |
|---|
| Cohort | − 0.05 | 0.54 |
| Case–control | 0.90 | 0.35 |
| SCCS | 0.79 | 0.28 |
When eliminating the six drugs where we expect confounding by indication, the estimated parameters for the case–control design change slightly to
and
.
Figure 2 shows for every level of α the fraction of negative controls for which the p-value is below α, for both the traditional p-value calculation and the calibrated p-value using the empirically established null distribution. For the calibrated p-value, a leave-one-out design was used: for each negative control, the null distribution was estimated using all other negative controls. A well-calibrated p-value calculation should follow the diagonal: for negative controls, the proportion of estimates with p < α should be approximately equal to α. Most significance testing uses an α of 0.05, and we see in Figure 2 that the calibrated p-value leads to the desired level of rejection of the null hypothesis. For the cohort method, case–control, and SCCS, the number of significant negative controls after calibration is 2 of 34 (6%), 5 of 46 (11%), and 3 of 46 (5%), respectively.
Applying the calibration to our three example studies, we find that only the cohort study of isoniazid reaches statistical significance: p = 0.01. The case–control and SCCS analysis of sertraline produced p-values of 0.71 and 0.84, respectively.
Visualization of the calibration
A graphical representation of the calibration is shown in Figure 3. By plotting the effect estimate on the x-axis and the standard error of the estimate on the y-axis, we can visualize the area where the traditional p-value is smaller than 0.05 (the gray area below the dashed line) and where the calibrated p-value is smaller than 0.05 (orange area). Many of the negative controls (blue dots) fall within the gray area indicating traditional p < 0.05, but only a few fall within the orange area indicating a calibrated p < 0.05.
In Figure 3A, the drug of interest isoniazid (yellow diamond) is clearly separated from the negative controls, and this is the reason we feel confident we can reject the null hypothesis of no effect. In Figure 3B and C, the drug of interest sertraline is indistinguishable from the negative controls. These studies provide little evidence for rejecting the null hypothesis.
Literature analysis
The medical literature features many observational studies that use traditional significance testing to assert whether an effect was observed. Assuming that these studies have similar null distributions as our three example studies, we can test whether for historical significant findings, we can still reject the null hypothesis after calibration. Using an elaborate PubMed query (Supporting information, Appendix F), we identified 31,386 papers published in the last 10 years that applied a cohort, case–control, or SCCS design in a study using observational healthcare data. Through an automated text-mining procedure, we extracted 4970 articles where a relative risk, hazard, odds, or incidence rate ratio estimate was mentioned in the abstract. These estimates were accompanied by either a p-value or a confidence interval, and we used these to back calculate the standard error, allowing us to recompute the calibrated p-value under various assumptions of bias. The full list of estimates and recomputed p-values can be found in the Supporting information, Appendix G.
Figure 4 shows the number of estimates per publication year. The vast majority of these estimates (82% of all estimates) are statistically significant under the traditional assumption of no bias. But even with the most modest assumption of bias (mean = 0, SD = 0.25), this number dwindles to less than half (38% of all estimates). This suggests that at least 54% of significant findings would be deemed non-significant after calibration. With an assumption of medium size bias (mean = 0.25, SD = 0.25), the number of significant findings decreases further (33% of all estimates), and assuming a larger but still realistic level of bias leaves only a few estimates with p < 0.05 (14% of all estimates).
Discussion
- Top of page
- Abstract
- Introduction
- Methods
- Results
- Discussion
- Acknowledgements
- References
- Supporting Information
Our reproduction of three observational studies published in literature produced similar odds and rate ratios, giving confidence that our studies are representative of real-world studies. Applying the same study design to sets of negative controls showed that all three studies were plagued by residual systematic error that had not been corrected for by the various study designs and data analyses. We do not believe that this problem is unique to our three studies, nor do we believe that these study designs are particularly bad designs. The papers from which the designs were borrowed represent excellent scientific studies, and Tata et al. [9] even went as far as including an SCCS in their analysis, a method that is believed to be less vulnerable to systematic error [6]. We therefore must conclude that this problem is intrinsic to observational studies in general.
The notion of residual systematic error has already found some acceptance with methodologists, and several approaches for computing the potential impact of systematic error using a priori assumptions on potential source of error do exist (see [19] for an excellent review). However, very few observational studies have actually applied these techniques. Reasons for this include the need to make various subjective assumptions on the nature and magnitude of the systematic error, which themselves are subject to uncertainty, and the fact that some of these methods are highly complex. Using negative controls to empirically estimate the bias in a study provides a straightforward approach of interpreting the outcome of a study. The observed null distribution incorporates most forms of bias, including residual confounding, misclassification, and selection bias. The error distribution resulting from this bias (which does not depend on sample size) can be added to the random error distribution (which is based on sample size) to produce a single intuitive value: the calibrated p-value.
Our research is strongly related to previous research on estimating the false discovery rate (FDR) [20-22], where an empirical null distributed is computed for either z-values or p-values [22]. However, FDR methods were developed for analyzing high-throughput data representing many similar hypothesis tests. Most important, these tests typically have the same sample size and corresponding standard error. In the observational studies we investigated, we found widely differing standard errors even when using the same outcome, method, and database, primarily because of differences in drug prevalence. When applying FDR methods using z-value or p-value modeling, we found these methods had a counterintuitive property: large sample size (low standard error) could compensate for bias. For example, even when it was clear that a method was highly positively biased, we found highly prevalent drugs with effect estimates barely above one were still deemed statistically significant using these methods because of the large original z-value or small original p-value. Our intuition is that bias is irrespective of sample size and would remain present even in an infinitely large sample. We have therefore chosen to model our null distribution on the basis of the effect estimate, taking standard error into account as a measure of uncertainty.
We have demonstrated our approach in the field of drug safety but expect it could be applicable in other types of observational studies as well, as long as suitable negative controls can be defined. The most important characteristic of negative controls, apart from the fact that they are known not to cause the outcome, is that they somehow represent a sample of the bias that could be present for the exposure of interest. A completely random variable would make a poor negative control, because the bias will be zero, which is not what we would expect for any meaningful exposure of interest. For example, in nutritional epidemiology, other food types are most likely good negative controls, but the last digit of someone's zip code is not.
One of the limitations of our study is the assumption that our negative controls truly represent drug–outcome pairs with no causal relationship. However, a few erroneously selected negative controls should not change the findings much, and we find it hard to believe that a large number of our negative controls are wrong. In FDR methods [20] where there is no information on the presence of causal relationships, the majority of relationships ( > 90%) is simply assumed to be negative. Furthermore, we cannot say with certainty to what extent the results presented here are generalizable beyond the two databases (GE and Thomson MarketScan Medicare Supplemental Beneficiaries) used here. In no way would we suggest that the null distributions in other databases are comparable, even when using the same study design and analysis. For every database, the calibration process described here will have to be repeated. Another limitation is the notion that the same data, study design, and analysis can be used for different drugs. Although studies often already include more than one drug (for example, Tata et al. [9] studied both SSRIs and non-steroidal anti-inflammatory drugs), for some drugs, the study design would be deemed less appropriate because of known bias that would not be corrected for. For example, we identified some of our negative controls that might be confounded by indication, which might preclude the use of a case–control design. By removing those controls, we see only small changes in the fitted distribution. Furthermore, because we cannot pretend to know all bias that is present for the drug of interest, we would like to argue that we should include such negative controls to account for potential gaps in our knowledge.
For two of our three examples, we could not reject the null hypothesis of no effect after calibration, even though originally all three were considered highly statistically significant. The analysis of the effect estimates found in literature showed that the majority of significant results fail to reject the null hypothesis when even making the most modest assumptions of bias. This is in line with earlier estimations that most published research findings are wrong [5], although in this previous work, the main focus was on selective reporting bias (e.g., publication bias), which we have not even taken into consideration here. Reality may even be grimmer than our findings suggest, which is troubling because the evidence of these observational studies is widely used in medical decision making.
The method proposed here aims to correct the type I error (erroneously rejecting the null hypothesis) level, most likely at the cost of vastly increasing the number of type II errors (erroneously rejecting the alternative hypothesis). Ideally, we would improve our study designs to better control for bias, which would result in
and
approaching 0, and thereby maximizing statistical power after calibration. In that case, our approach would no longer be needed for calibration, only to show that bias has been dealt with. However, as shown here, the study designs currently pervading literature fall short of this goal, and more work is needed to reach this (potentially unobtainable) goal.
We recommend that observational studies always include negative controls to derive an empirical null distribution and use these to compute calibrated p-values.
Acknowledgements
- Top of page
- Abstract
- Introduction
- Methods
- Results
- Discussion
- Acknowledgements
- References
- Supporting Information
The Observational Medical Outcomes Partnership (OMOP) was funded by the Foundation for the National Institutes of Health (FNIH) through generous contributions from the following: Abbott, Amgen Inc., AstraZeneca, Bayer Healthcare Pharmaceuticals, Inc., Bristol-Myers Squibb, Eli Lilly & Company, GlaxoSmithKline, Janssen Research and Development LLC, Lundbeck, Inc., Merck & Co., Inc., Novartis Pharmaceuticals Corporation, Pfizer Inc., Pharmaceutical Research Manufacturers of America (PhRMA), Roche, Sanofi, Schering-Plough Corporation, and Takeda. At the time of publication of this paper, OMOP has been transitioned from FNIH into the Innovation in Medical Evidence Development and Surveillance (IMEDS) program at the Reagan-Udall Foundation for the Food and Drug Administration. Dr. Ryan is an employee of Janssen Research and Development LLC. Dr. DuMouchel is an employee of Oracle Health Sciences. Dr. Schuemie received a fellowship from the Office of Medical Policy, Center for Drug Evaluation and Research, US Food and Drug Administration, and has become an employee of Janssen Research and Development LLC since completing the work described here.