SEARCH

SEARCH BY CITATION

Abstract

  1. Top of page
  2. Abstract
  3. Type of Associations
  4. Reasons for Discrepancies Between RCTs and Observational Studies
  5. Conclusion
  6. References

Observational analogs of randomized clinical trials (RCTs) are well accepted in the study of disease risk factors, diagnosis, and prognosis. There is controversy about observational studies when the focus is on the intended benefit due to lack of blinding and poor control for unmeasured confounding. Well-designed randomized clinical trials are costly both in time and money. Therefore, existing databases are used increasingly and are often the only feasible source with which to examine delayed health effects. We reviewed the reasons for possible discrepancies between RCTs and observational studies. There can be different patient populations, differences in therapeutic regimen, control of confounding, follow-up, measuring outcome, and differences arising from the intention-to-treat analysis. Observational studies cannot replace trials, nor do trials make observational studies unnecessary. Both designs are susceptible to particular bias, so neither provides perfect information. (HEPATOLOGY 2006;44:1075–1082.)

Clinical research includes research studies in which interventions are assigned randomly [randomized clinical trials (RCTs)] and studies that are observational (nonexperimental) analogs of RCTs. These observational studies have a well-accepted role in medical research, especially in the study of disease risk factors, diagnosis, and prognosis.1–3 In addition, because the benefit/risk ratio for many therapies is high, most RCTs of therapies are too small or too brief to evaluate the small risks associated with them. Large nonexperimental observational studies are useful to assess these risks.4

Nevertheless, controversy sometimes surrounds observational studies, especially if they focus on the intended benefit of a therapy.5 RCTs have typically been regarded as the reference standard to evaluate the efficacy of a therapy or other intervention intended to improve the outcome of disease, and some consider RCTs to be the only valid design to evaluate therapeutic efficacy. The few reviews that have compared results of experimental and nonexperimental studies of the same question have shown that the results from nonexperimental studies of treatment at least correspond generally with results from trials.6–12 For example, in an analysis of 240 trials and 168 nonrandomized studies within 45 diverse topics of medical interventions, a correlation of r = 0.75 was observed between the summary effect measures obtained from the 2 sets of studies. This level of agreement is far from perfect, however. The 2 groups of studies often disagreed about how well the intervention worked,6 with nonrandomized studies tending toward finding larger effects than did randomized studies. A recent example is postmenopausal hormone replacement therapy (HRT) and risk of coronary artery disease, where the different magnitude of effect estimated from experimental and nonexperimental studies was the source of considerable controversy.13–16

Unfortunately, well-designed and adequately powered RCTs are often costly in both time and money. Many trials are funded by pharmaceutical or medical device manufacturers who aim to demonstrate the effect of their new products, but therapies that lack a commercial interest may never get evaluated in an RCT. In addition, many trials are too small, use surrogate markers, or underrepresent disadvantaged patient groups that may be most likely to have the disease. Instead trials often focus on short-term efficacy and safety in a controlled clinical environment among well-educated affluent patients.17 One motivation for this focus is the difficulty, even for patients who adhere to therapy, in following them over a long period, and ascertaining that they received the assigned treatment.17 Furthermore, not all trials take account of cointervention with therapies given outside the study protocol. All of these issues reduce the reliability of data from RCTs to inform clinical decision-making in the practice setting.

Studies conducted from existing databases offer an alternative to RCTs with the advantages of nearly ready-made large-scale population-based investigations, making it possible to study rare exposures, diseases, and outcomes inexpensively and rapidly.18–21 If the data sources comprise complete population data with respect to people who might be in the target population for therapy, the study results may be more likely to reflect daily clinical practice. Because these existing databases collect data for administrative or other purposes unrelated to research objectives, certain biases are reduced or eliminated. For example, nonresponse bias, recall bias, the ‘Hawthorne’ effect, and bias from losses to follow-up may be reduced or eliminated when data have been collected for administrative purposes.19–21 One also avoids any effect on the diagnostic process sparked by the research.19 In addition, many health effects first appear years after exposure. Existing databases are often the only feasible source with which to examine delayed health effects.19–21

Database studies, however, also have important limitations, which are too often ignored.19, 20, 22 The limitations relate to data selection and quality and thus the possibilities to control for confounding, because methods of data collection are not controlled by the researcher. Some data required for the research may not have been collected at all. For available data, inaccuracies or incompleteness can pose serious obstacles.19, 20, 22

In RCTs, the efficacy of a new or existing therapy is measured by comparing it with a placebo or an older therapy. Because treatment choices are assigned randomly, the assigned therapy for a patient in a trial is independent of the usual collaborative patient-physician decision process. As explained further in Table 1, this independence provides the theoretical grounding for the statistical methods that aid in drawing conclusions from the study about differences in the efficacy of the compared treatments.23 In contrast, when treatment choices are made in the usual setting and observed by researchers, differences in the outcomes of the treatment groups may be due to a difference in the efficacy of the treatments, but might also be due to differences between the patients themselves that influence treatment decision-making. These differences may be difficult or impossible to measure. Examples of such factors include the severity of the underlying disease, comorbidity, accuracy and utility of diagnostic tests, clinical quality, and patient adherence.24, 25

Table 1. Notable Differences Between Randomized Clinical Trials and Observational Studies
 Clinical TrialsObservational Studies
SettingStandardized approach to treating patients may differ from common practiceUsual clinical practice
EthicsMust meet ethical standards of human experimentationResearcher does not offer intervention, which limits ethical concerns mainly to privacy issues
Cost of each study subjectHighLow
SubjectsSelection of patients based on strict inclusion and exclusion criteria that depend on ethics and feasibilityCan readily include all patients, a broad range of patients, or can apply specific inclusion or exclusion criteria
ExposureUsually 1 or 2 interventionsNo limit to the number of interventions or comparisons
ComplianceCan often be measuredMore difficult to quantify directly
Confounding controlRandomization addresses known and unknown confoundingKnown factors, if measured, can be controlled, but very difficult to control adequately for unmeasured factors
Outcome1. Standardized measure of both surrogate, soft, and hard endpoints defined by the researcher1. Based on routine restriction and mean by hard endpoints
 2. Blinding is possible2. No blinding
Rare outcomeCost is too high for rare outcomesMuch more feasible for rare outcomes

Type of Associations

  1. Top of page
  2. Abstract
  3. Type of Associations
  4. Reasons for Discrepancies Between RCTs and Observational Studies
  5. Conclusion
  6. References

The interpretation of epidemiologic studies requires disentangling causal associations from associations that stem from bias or random error.26, 27 Random error is the component of overall error that cannot be predicted, but can be quantified using statistical distributions. Bias, or systematic error, is the component of overall error that tends to underestimate or overestimate an association.26, 27 Bias is often described in 3 categories: 1) selection bias, which derives from including or selectively following subsets of the population in the study in a way that distorts the relation between the exposure and the outcome; 2) information bias, which derives from measurement errors; and 3) confounding, which derives from noncomparability of the intervention groups in the study.27

Reasons for Discrepancies Between RCTs and Observational Studies

  1. Top of page
  2. Abstract
  3. Type of Associations
  4. Reasons for Discrepancies Between RCTs and Observational Studies
  5. Conclusion
  6. References

Different Patient Populations.

As noted above, RCT participants often do not reflect the distribution of sex, age, race, and ethnicity of the target patient population.28–30 Many trials exclude patients who are older or who have other diseases that complicate their treatment.17 In addition, patients must usually consent to participate in a trial, and patients who are older, sicker, less well educated, or less socioeconomically advantaged are less likely to participate than their younger, healthier, better educated, or socioeconomically advantaged counterparts. Trial results may exaggerate the difference between new and standard therapies because they enroll primarily patients who are healthier and otherwise more advantaged.18 For example, as noted above, these patients may be more likely to adhere to the new treatment than would disadvantaged patients.

Although a representative study population is not usually scientifically optimal, it is important to enroll patient populations that include the segments of the population at high risk for the disease under study.28–30 When treatment efficacy is modified by sex, age, race, ethnicity, or other factors, and the study population differs from the population that would be receiving the treatment with respect to these variables, then the average study effect will differ from the average effect among those who would receive treatment. In these circumstances, extrapolation of the study results is tenuous or unwarranted, and one may have to restrict the inferences to specific subgroups, although these subgroups should not bear the majority of the disease burden.

Differences in Therapeutic Regimen.

In a trial, the patient receives treatment according to a regimen that follows a detailed protocol. This treatment regimen standardizes the intervention and may lead to assigned doses and schedules that differ from what the same patient might have received had she been treated outside the randomized trial. The rigidity of the treatment regimen assists in characterizing the intervention, but its inflexibility may increase the likelihood of nonadherence. Usually, an RCT will attempt to assess nonadherence, which is one of the major sources of error regarding exposure assessment in randomized studies. In contrast, in a nonexperimental study, the therapy may be tailored to the needs of the patient, increasing the variability of the exposure itself. It may also be more difficult in nonexperimental studies to quantify nonadherence, although it is possible indirectly, for example, by looking for information that suggests a patient has refilled a prescription. In some databases, it may be difficult to distinguish between new and prevalent users of drugs or other ongoing therapies.13 Prevalent users have by definition survived the early period of therapy,13 which introduces the possibility of survivor bias, and leads to problems if the risk or effect varies with time. Such bias can be eliminated in studies that focus on new users.13

Differences in Control of Confounding.

When an investigation of an intervention's effectiveness is undertaken by a randomized design, the randomized assignment of eligible and consenting patients to treatment groups creates an expectation that the proportion of patients with factors related to the outcome will be in balance across the intervention groups. This balance holds regardless of whether these factors are known and measured by the investigator or unknown to the investigator. For example, breast cancer prognosis is strongly related to node status at diagnosis. Women with micrometastases in regional lymph nodes at diagnosis have higher risk of recurrence and death from their breast cancer. Randomization of breast cancer patients to 2 intervention groups creates an expectation that the proportion of patients with micrometastases in their lymph nodes will be the same in the 2 intervention groups. Any observed difference in the efficacy of the interventions to prevent recurrence or death should be due to a true difference in the effectiveness of the interventions, and should not be attributed to baseline differences between the intervention groups in the risk of recurrence or death. The expectation of balance is not always realized; imbalances arise by chance, just as the expectation of 2 heads out of 4 flips of a fair coin is not always realized. Nonetheless, the probability of large imbalances can be reduced by increasing the study size, just as the probability of 3 or more heads out of 4 coin flips is fairly substantial (P = .31), but the probability of 300 or more heads out of 400 coin flips is vanishingly small (probability = 8 × 10−15). The confidence interval that accompanies a trial's result is an attempt to quantify the imprecision in measuring the study effect, reflecting errors that are largely due to chance imbalances in the other factors that predict the outcome. The frequentist theory that underlies interpretation of the confidence interval therefore applies much more readily to randomized trials than to nonexperimental studies.31 For both designs, the confidence interval is a much better aid to inference than one based on statistical significance testing or its core statistic, the P value.

When an investigation of an intervention's effectiveness is undertaken by a nonexperimental design, there is no expectation that the proportion of patients with factors related to the outcome will be in balance across the intervention groups. These factors can confound the estimate of the difference in effectiveness between the compared interventions. Well-measured factors can be controlled in the study design (e.g., by restriction or matching) or analysis (e.g., by stratification or modeling).27 Confounder summary scores, such as the propensity score, may be helpful in some circumstances but seldom yield advantages over conventional multivariate methods.32, 33 Factors that have not been well measured or that are unknown to the investigator cannot be controlled in the design or analysis by any method, except indirectly to the extent they are associated with the factors that are controlled.27 One often intractable problem is confounding by indication, which arises when treatment interventions are assigned by patient-physician decision making (the usual model for patient care outside of a trial setting). The problem occurs because the treatment course is planned conditional on a broad spectrum of patient and physician preferences related to the disease severity, underlying health of the patient, expertise and experience of the physician and the physician's institution, and the interaction of all of these factors.24 It is usually impossible to measure and control all these factors well enough to avert confounding by indication. For example, in a nonexperimental study of the effectiveness of tobramycin solution for inhalation (TSI) to treat patients with cystic fibrosis who are infected with Pseudomonas aeruginosa, the risk of death was higher among those who received TSI therapy than among those who did not, even with control for all of the well-understood prognostic indicators (such as P. aeruginosa infection).34 TSI is, of course, offered to reduce the risk of death, not to raise it. The positive association likely arose because patients with indications for TSI were at highest risk for death, even after adjustment for the accepted prognostic indicators. These indications were apparent to the patients and physicians, but not completely reflected in the measured variables.

Confounding by factors that are difficult to measure also influenced the nonexperimental studies of the relation between hormone replacement therapy (HRT) and coronary heart disease.13 The “healthy user effect” has been proposed as the principal explanation for the discrepancy between results from randomized and nonexperimental studies of the effects of hormone replacement therapy. Women whose physicians prescribed HRT and who chose to use it were better educated and had better cardiovascular risk profiles than non-HRT users.35–37 Likewise HRT users who adhered to the therapy may have had a lower cardiovascular risk than patients who did not adhere to the treatment.38 These differences in users and nonusers in the nonexperimental trials may have confounded those studies, whereas they would be expected to be in balance between the groups of users and nonusers if the use of HRT was assigned by randomization. Similarly, selection of patients for liver transplantation is based on a set of diagnostic criteria strongly associated with the outcome.39

The potential for confounding by indication should not, however, automatically dissuade investigators from nonexperimental studies of therapeutic effectiveness. A recent trial showed that breast cancer patients 70 years old or older with small, estrogen receptor–positive (or unknown) node-negative tumors treated with breast conserving surgery and tamoxifen could forego radiation therapy with only a small absolute increase in rate of recurrence (3/100 person years over 5 years of follow-up).40 As noted above, older women eligible and consenting to participate in trials likely differ from the vast majority diagnosed in the community setting, and these differences may reduce the validity of generalization of trial results to the community.41 To investigate whether the trial result applied also to the larger community, Smith et al. examined the same relation using a nationally representative group of breast cancer patients restricted to those who met the trial eligibility criteria and adjusted for tumor, treatment, and other variables (such as comorbidity).42 They observed a difference in rate of recurrence similar to that observed in the trial (4/100 person years over 5 years of follow-up), suggesting that the trial result applies also in this community setting.

Differences in Follow-Up.

An advantage of RCTs is that patients can be followed in a standardized way, and strategies can be selected to facilitate both adherence to the therapy and good follow-up. Nevertheless, discontinuation of the trial intervention by the patient and other violations of the protocol can result in biased estimates. The direction of these biases can be unpredictable. In contrast, many database nonexperimental studies have nearly complete follow-up, and consequently, little bias from losses to follow-up.20, 21

In any type of study, patients should be followed long enough for the outcome to occur or be prevented.43 Therefore, the length of follow-up should correspond to the study hypothesis. For example, evidence suggests that long-term use of nonsteroidal anti-inflammatory drugs, such as aspirin, might reduce the risk of colorectal cancer. In the randomized Physicians' Health Study, in which physicians took low-dose aspirin for 5 years, little association between use of aspirin and reduced risk of colorectal cancer was evident after the first 5 years of intervention.44 In contrast, cohort and case-control studies found a reduction of approximately 50% in colorectal cancer incidence and mortality.45, 46 Could either the intervention or the follow-up have been insufficiently long in the Physicians' Health Study? When the study stopped, 29% of the study participants chose not to take daily aspirin regularly. A 12-year follow-up still found little association,47 whereas the protective effect against colorectal cancer in the observational studies increased as duration of regular aspirin use increased, with the relative risk still declining after 5 to 15 years of use. Therefore, it was concluded that the short intervention period or the low dose in the Physicians' Health Study were indeed responsible for the near-null findings.47 Sporadic adenomas precede most cases of colorectal cancer, and patients with colorectal cancer are at increased risk of developing adenomas, cancer recurrence, and a new primary colorectal cancer. Because adenomas are more prevalent than colorectal cancer and have a shorter induction period, trials designed to prevent adenomas can be completed quickly and with fewer subjects than primary chemoprevention trials of colorectal cancer.48 Two trials reported that aspirin did protect against both recurrence of adenoma and the risk of adenoma in patients who had previously had colorectal cancer.49, 50

Differences in Measuring Outcome.

An RCT defines the outcomes based on well-defined criteria that are relevant to patients and caretakers. These criteria might include death, quality of life, symptoms, or lab results. In contrast, most databases have few data on symptoms and functional status and the potential endpoint information must typically be measured as death, readmission, complications, or discharge diagnoses.20 Hospital discharge diagnoses are not entirely accurate, with 20% misclassification not uncommon.19 For example, in a Medicaid study of drug-induced acute liver disease, no liver diagnosis was found in 10.6% of the cases.51 Mortality data are usually recorded accurately, but often they do not include cause of death or they record it inaccurately. Because information on cause of death is more subjective than the fact of death, there has been some debate about whether the more specific information on cause of death should be the primary outcome, or whether it is preferable for the primary outcome to be total mortality, which avoids the subjectivity related to classifying the reason for death.52 For example, in the randomized Women's Health Initiative Trial, concordance of results with the Nurses' Health Study (an observational study) was evident for all major clinical outcomes except for cardiovascular endpoints.53 If the cause of death is assigned by a caretaker aware of the deceased woman's HRT history, and the caretaker believed that HRT prevented cardiovascular death, the information recorded for cause of death could have differed for HRT users and nonusers with the same clinical picture.38, 53, 54 This miscoding would be an example of information bias, which results from errors in measuring exposure, outcome, or other study variables. If patient assignment to intervention is not blinded, then errors in assessing the outcome could be related to the intervention, leading to bias that may exaggerate or underestimate the treatment effect. If measurement errors of the outcome variable are independent of the treatment assignment, a situation that is fostered by blinding, bias still results. The bias, however, is typically in the direction of underestimating the treatment effect. Often the terms “single-blinded” (patients) and “double-blinded” (patients and doctors) are used to describe the blinding process. In nonexperimental studies, blinding does not occur and information bias that exaggerates effects is a greater concern than in blinded experiments.

Differences Arising from Intention-to-Treat Analysis.

The principle of “intention-to-treat” (ITT) dictates that the groups compared in an analysis of an RCT ought to be those who were assigned to the different intervention groups, regardless of their actual experience in getting or adhering to the intervention.27, 43 With an ITT comparison, the compared groups are interchangeable as a result of the random assignment, which is the condition from which the theoretical advantages of randomization derive. While this advantage is an important strength of randomized trials, it has a drawback. Because of protocol errors, patient preferences, and other circumstances leading to nonadherence, those who follow the correct treatment protocol will usually be only a subset of those who are assigned the treatment. Furthermore, in some studies, those assigned one treatment may receive an alternative treatment in the study outside of the study protocol (this phenomenon is described as “crossover”). Nonadherence and crossover imply that the groups compared in an ITT comparison are not as sharp a contrast as one might wish between one intervention and the alternative intervention. The result of this imperfect contrast will be bias in estimating the effect. This bias is well known, and is often considered tolerable because it usually leads to an expected underestimation of the treatment effect when only 2 treatments are compared. Thus, if the study finds an advantage for a new treatment after an ITT analysis, one can infer that the actual effect of treatment, if the treatment regimen is followed faithfully, will be greater than that estimated from the RCT. This bias is considered “conservative” in the sense that the bias from an ITT analysis of an RCT poses a higher hurdle for a new intervention.

Nevertheless, underestimation of treatment effects is undesirable. Some treatment effects may be underestimated substantially, and the treatment may be judged insufficiently effective to be adopted because an ITT analysis underestimated the treatment benefit. This result is arguably too “conservative.” Furthermore, an RCT may be used to study adverse effects of an intervention. These effects will also be underestimated in an ITT analysis. Underestimating adverse effects cannot be considered “conservative.” Underestimating either risks or benefits of a new intervention is problematic and should be weighed against the advantages that an ITT analysis offers. Theoretical purists often state that an ITT analysis is the only formally acceptable analysis of an RCT, but this analytic approach, like all others, has both advantages and disadvantages. This issue also underscores the fact that RCTs are not perfect scientific instruments.

For example, in the Physicians' Health Study several cardiovascular risk factors measured at baseline were related to poor adherence. When the investigators adjusted for these baseline differences, those in the aspirin group with excellent adherence had a 51% reduction in the risk of acute myocardial infarction compared with those in the placebo group with excellent adherence. Among those with poor adherence, the reduction in risk was only 17%.55

Conclusion

  1. Top of page
  2. Abstract
  3. Type of Associations
  4. Reasons for Discrepancies Between RCTs and Observational Studies
  5. Conclusion
  6. References

Nonexperimental studies have many strengths, but also important methodological limitations. Compared with RCTs, the main limitations are:

  • 1
    Observational studies lack the benefit of random assignment, which is useful for controlling for both measured and unmeasured confounding variables. Uncontrolled confounding, particularly confounding by indication, is often a much more serious concern in nonexperimental studies than in RCTs.
  • 2
    Observational studies, especially those using existing data that have been collected for another purpose, typically have suboptimal measurement of treatment and outcome variables, resulting in bias in measuring the treatment effect. When the measurement errors are unrelated to other study variables, this bias will tend toward underestimation of the treatment effect.
  • 3
    When attempting to reconcile the results of experimental and nonexperimental epidemiologic studies, it is important to consider and if possible to adjust for differences in the patient population, treatment regimen, data analysis and other study features that may vary between the studies under consideration.

Table 2 shows factors affecting the validity of an observational study.

Table 2. Factors to Consider When Planning or Interpreting the Results of Observational Studies
1. Does the database fit the research question?
2. Does the study population fit to the research hypothesis and clinical decision making?
3. Is the size of the study population adequate to answer the research question?
4. Does the study design fit to the research question and clinical question? Only cohort studies provide direct risk or rate estimates and allow estimation of differences in risk or rates (most relevant for clinical decision making), while both cohort and case-control studies provide relative risk or rate estimates (more relevant for biological disease mechanism).
5. Is the exposure determined accurately? Was the exposure assessed before the outcome occurred? Can duration of exposure be quantified and a dose-response relation explained?
6. Is the outcome measured accurately and is it relevant for clinical practice.
7. Are confounding factors measured accurately to make it possible to control for confounding? Are there any potentially known unmeasured confounding?
8. Are the patients followed for a long enough time period to let the outcome occur? Length should correspond with the study hypothesis. Is there any loss to follow-up?
9. Are the statistical methods and their assumptions suitable for the research question?

Nevertheless, experimental and nonexperimental studies of the same question often agree, and observational studies have a record of successful contributions to clinical medicine.23 Well-known examples are estrogens for prevention of fractures and aspirin for myocardial infarction.23, 56

Observational studies cannot replace trials, nor do trials make nonexperimental studies unnecessary or undesirable. When both experimental and nonexperimental studies address the same question, both can contribute usefully to answering the question. The methods of meta-analysis are often used to combine information from several studies.43, 57 If both RCTs and nonexperimental studies have been conducted on the same question, both should be included in the meta-analysis, after using appropriate methods to adjust for the biases specific to the designs.43, 57 There is some evidence that observational studies may yield less heterogeneity than RCTs even on the same therapy.58 Both designs are susceptible to particular biases, so neither provides perfect information. Thorough consideration of their specific problems will enhance the interpretation of the totality of the evidence and lead to stronger overall inferences about clinical questions.

References

  1. Top of page
  2. Abstract
  3. Type of Associations
  4. Reasons for Discrepancies Between RCTs and Observational Studies
  5. Conclusion
  6. References