To propose methods for the quantitative assessment of the applicability of evidence from a trial to a target sample using individual data.
To propose methods for the quantitative assessment of the applicability of evidence from a trial to a target sample using individual data.
Demonstration was with a trial of drug therapy to prevent mortality and an accompanying registry of people with heart failure. Principal components analysis with biplots did not identify measurement discrepancies. Multiple imputation with chained equations addressed missing predictor values. A proportional hazards model with interaction term, including graphical interpretation and a multivariate interaction test, identified heterogeneity of treatment effect. An interval of homogeneity of treatment effect was the interval of the baseline risk of outcome in which no two treatment effects were statistically significantly different. Absolute risk reduction for individuals was estimated for both benefit and harm outcomes and presented in a bivariate treatment effects scatterplot.
Overall, the trial evidence applied to most of the registry according to overlapping distributions of estimated benefit and harm. However, 52% of trial and 33% of registry participants were estimated to have net benefit, and 14% of trial and 36% of registry participants were estimated to have strong net harmful treatment effect, that is, the individual estimate of harm was more than twice the estimate of benefit.
The proposed methods provide quantitative assessment of the applicability of trial evidence to a target sample. They combine the strengths of different study designs, namely, unbiased effects estimation from trials and representation in observational studies, while addressing the practical challenges of combining information, namely, measurement discrepancies and missing data. Copyright © 2012 John Wiley & Sons, Ltd.
Randomized controlled trials generally provide reliable evidence for approval of new treatments, informing clinical practice, and coverage decisions by virtue of their focused experimental design. Participants in trials form a select group rather than a representative sample of the larger at-risk population. Informed consent alone ensures selection, and there are usually many selection pressures. As a result, some consider the external validity of trial evidence to be very limited. We recognize two types of external validity: (i) the generalizability of trial evidence to the larger at-risk population and (ii) the applicability of trial evidence to a specific target sample that is a subset of the at-risk population. Concerns about the generalizability or the applicability of trial evidence arise primarily because of suspicion of heterogeneity of treatment effect (HTE). If the treatment effect were homogeneous, it would be applicable to everyone with an indication for treatment. Instead, patient, provider, and/or environmental characteristics can modify or mediate a treatment's effect to create heterogeneity.[2, 3] The question of a trial's applicability must be informed by a careful examination of its HTE. With significant HTE, the average treatment effect from an RCT with a representative sample may not be generalizable because the treatment effect varies substantially. Instead, the focus becomes the domain of applicability of HTE found in the trial. In other words, to complete the cycle of evidence generation and translation, we must be able to answer the question “does the evidence apply to a target sample?” For example, when a randomized controlled trial demonstrates that the benefit of a treatment depends on the patient's baseline risk (e.g. a greater benefit for patients at higher risk) but patients in community practice present for treatment with baseline risks outside the range captured by the trial, decision makers need a way to visualize the uncertainty with which the trial evidence applies to the community patients.
Current approaches to assess applicability include judgment based on the examination of a study's selection procedures and univariate patient characteristics (typically presented as in Table 1 of study results), which can be misleading because of unplanned selection and relationships between patient characteristics. Other approaches have studied the results of selection of trials with counterfactual frameworks that require strong assumptions,[4, 5] or simply a comparison of summary statistics.[6, 7] Cole and Stuart presented an approach based on standardization to generalize the evidence from a trial to the larger at-risk population. However, their approach does not examine HTE. We present methods to address the applicability of trial evidence to a specific target sample. We are unaware of previously proposed methods to assess applicability that compares benefit and harm and examines the heterogeneity in treatment effects in relation to both beneficial and harm outcomes.
|Study||SOLVD-R (n = 5100)||SOLVD-T (n = 2569)|
|Age ≥75 years*, †||16.1||5.8|
|History of diabetes mellitus||24.6||25.8|
|History of myocardial infarction||76.0||65.8|
|History of atrial fibrillation||15.1||10.8|
|History of chronic obstructive pulmonary disease*||17.7||10.0|
|History of stroke*||8.9||7.7|
|Hospitalization within last year||82.4||43.4|
|ACE inhibitor exposure*|
|Age (years)*, †||62.8 (12.2)||60.4 (9.9)|
|Left ventricular ejection fraction (%)*||31.9 (9.7)||24.9 (6.7)|
|Cardiothoracic ratio||0.7 (0.5)||0.6 (0.5)|
|Systolic blood pressure (mm Hg)*||128.2 (22.0)||124.9 (17.5)|
|Diastolic blood pressure (mm Hg)*||77.0 (13.0)||76.8 (10.2)|
|Resting pulse (b min−1)||82.1 (17.2)||79.9 (13.2)|
|Serum creatinine (mg/dl)*||1.3 (0.5)||1.3 (0.4)|
|Serum sodium (mEq/dl)||138.1 (4.2)||139.7 (3.2)|
|Body weight (kg)||78.2 (17.7)||79.7 (16.6)|
Here we propose methods for the quantitative assessment of applicability. These methods use individual-level data from both a trial and a target population, focus on HTE, present an assessment of applicability through graphical presentation of the joint distribution of both beneficial and harmful treatment effects, and address the practical issues of measurement discrepancies and missing data that can be important for patient-centered outcomes research.
The proposed methods are demonstrated through the example of a pivotal drug trial of angiotensin-converting enzyme (ACE) inhibitors to prevent mortality in people with heart failure (HF).
The Studies of Left Ventricular Dysfunction (SOLVD) trials were designed to determine whether the ACE inhibitor enalapril reduced mortality in people with HF and a low left ventricular ejection fraction. Designed in tandem, the treatment trial (SOLVD-T, n = 2569) recruited patients with an overt history of HF, and the prevention trial (SOLVD-P, n = 4228) recruited patients with no such history. We used data from only the SOLVD-T, which recruited 6.4% of the people screened and eligible from 1986 to 1990 at 23 centers in Belgium, Canada, and the USA. As the target sample, we used a related registry (SOLVD-R, n = 5100 not in an SOLVD trial) created at 18 of the 23 centers. The SOLVD-R was created to understand the natural history of patients with a low left ventricular ejection fraction (≤0.35) or an overt history of HF and thus represents a clinical target sample. There were 26 exclusion criteria for the trials and 5 exclusion criteria for the registry, which were both approved by institutional review boards.
The outcomes followed definitions prespecified by the study designers. The benefit outcome was death or hospitalization for HF during a 1-year follow-up. The harm outcome was severe hyperkalemia (serum potassium >5.4 mEq/l) or symptomatic hypotension (unexplained syncope, dizziness, or lightheadedness). Each of the various combined outcomes provided more power than individual outcomes to examine HTE. Further, it was reasonable to combine individual outcomes because the direction of treatment effect was the same for each component. A time horizon of 1 year allowed synchronization across the trial and registry for the benefit outcome. The harm outcome events were not available in SOLVD-R (see the next section for the estimation of the risk of harm in SOLVD-R).
To identify HTE, we studied predictors of the primary event outcomes to be prevented, not one predictor at a time but rather on the basis of a multivariable risk score. We focused on validated predictors of mortality in community-dwelling adults who were older adults[13-15] or had heart failure.[16-19] Predictors of the benefit outcome were age (years), gender, smoking status (never, ever, and current), history of diabetes mellitus, left ventricular ejection fraction (continuous), cardiothoracic ratio (continuous), pulse (b min−1), diuretic use, serum creatinine (mg/dl), serum sodium (mg/dl), history of chronic obstructive pulmonary disease, and history of stroke. Additional predictors of the harm outcome were presence of dependent edema, pulmonary edema, or crackles; weight (kg); systolic blood pressure (mm Hg); and diastolic blood pressure (mm Hg). Details concerning measurement for predictors and outcomes have been published elsewhere.[9-11, 20]
When combining data from different studies, an important preliminary step is to ensure that the data are compatible (e.g. similar definitions, measurement techniques, quality). Multivariate exploratory techniques can be valuable aids here. Principal components analysis with biplots of the minor components facilitated the detection of differences between two studies likely due to different measurement methods. This method provides a graphical summary of correlations through the overlaid display of two plots: one for correlations among subjects and another for correlations among variables. We considered the mechanism behind missing values for each variable and then examined the proportion of values missing for each variable as well as patterns of missing values across variables. Multiple imputation with chained equations was used with 10 imputed data sets when a desired predictor was <25% missing in both trial and registry. This method has the advantage of accurate representation of the variable distributions, which were allowed to differ by performing separate imputation within trial and registry.
The baseline risk of outcome has been shown in many past studies to be a meaningful and statistically powerful way to identify HTE.[26-28] The 1-year risk of outcome that is independent of treatment was estimated for both benefit and harm outcome for each subject with a Cox proportional hazard model that included the predictors described earlier and a treatment indicator. The validity of the risk estimation model was examined through calibration plots of expected versus observed outcomes across risk quantiles. Because the harm outcome was not available in the registry, the risk model for harm from the trial was applied to estimate risk in the registry. The individual baseline risk was then used as the potential treatment effect modifier to test for HTE in the trial. Let λ refer to the hazard of outcome at time t, r the baseline risk, and T the treatment assignment in the Cox model:
When there is significant HTE, we can define a heterogeneous treatment effect as a linear function of baseline risk r:
We can visualize this linear interaction model graphically. To confirm the Cox results, we used the Follmann multivariate test of interaction, which performs a likelihood ratio test for HTE in one step without an explicit estimation of the baseline risk. An interval of homogeneity of treatment effect, C, is the interval of baseline risk in which no two treatment effects are statistically significantly different
By design, this interval has the following desirable properties: it will always include the center of the data, where information for estimating treatment effect is maximal; it decreases as the size of interaction treatment effect increases; and it increases with decreasing significance level α. We take α = 0.05. The homogeneity interval C can be expected to have the highest applicability to a clinically exchangeable target population.
Then we estimated the absolute risk reduction due to treatment for each individual using a method described by Austin and Katki and Mark. To compare individual treatment effects across the trial and registry, we used a bivariate treatment effects scatterplot. Because this plot presents both benefit and harm outcomes with homologous scales and an aspect ratio of 1, equal values are on a 45° line. This presentation allows for an examination of the dependence of benefit and harm treatment effects for subjects in the trial and registry. R[32-34] and Stata statistical software were used for analyses.
Planned selection criteria for the trial and registry are presented in Table A1.
Comparing subject characteristics one at a time (Table 1), we found that the proportion of risk factors for mortality with HF was lower in the trial than that in the registry. For example, SOLVD-T had a lower proportion of older participants (5.8% vs 16.1%). There were fewer women in SOLVD-T than that in SOLVD-R (19.6% vs 28.8%). ACE therapy was used by 37.9% of people in the registry.
We found no evidence of discrepancy in the measurement of predictors between two studies. There were six variables with 1% to 5% missing values, six variables with 6% to 15% missing values, and no variables with 16% to 25% missing values in the trial or registry, although some desired variables were completely missing in the trial registry. Imputed values were inspected, and between-imputation variability was acceptable. The calibration plot displayed apparent validity for the baseline risk models (Figure A1).
Figure 1 presents the estimated treatment effects on benefit and harm outcomes. For the treatment benefit outcome, there was significant HTE, as treatment effect varied significantly according to baseline risk. There was increasing benefit with increasing risk (Figure 1; interaction term p-value 0.001). The hazard ratio for treatment effect for benefit ranged from 0.85 to 0.26. For treatment harm outcome, there was no significant HTE (Figure 1). The Follmann procedure agreed closely with the standard interaction models (interaction p-value 0.001 for the benefit outcome).
For the benefit outcome, the homogeneity interval comprised baseline risk values between 0.35 and 0.50, and the treatment effect (hazard ratio) in this interval ranged from 0.59 to 0.48 (Figure 2). In the target sample, 16.4% had a value of baseline risk in the homogeneity interval for the benefit outcome. For the harm outcome, because there was no HTE (at α = 0.05), the homogeneity interval spanned the whole range of observed baseline risk (0.35–0.91). In the target sample, 99.8% had a value of baseline risk in the homogeneity interval for the harm outcome.
The overlap values of estimated individual treatment effects for the trial and registry are displayed in the bivariate treatment effects scatterplot (Figure 3). Overall, the trial evidence seemed to apply to most people in the registry, as was suggested by the overlapping distributions of estimated benefit and harm. However, there were people in the registry estimated to have lower benefit than anyone in the trial, suggesting low applicability for people with a low likelihood of benefit. Interestingly, there were a few people in the registry estimated to have higher harm than anyone in the trial, but the lack of overlap occurred to a lesser degree than for benefit, and there was overall a similar distribution of estimated harm in the trial compared with the registry. More trial than registry participants (52% vs 33%, respectively) were estimated to have a net benefit (i.e. the individual benefit estimate is greater than the harm estimate) on the basis of risks alone, i.e. without incorporating patient preferences. Similar proportions of trial and registry participants (8% and 11%, respectively) were estimated to have strong net beneficial treatment effect (i.e. the individual estimate of benefit was more than two times greater than the estimate of harm). However, 14% of trial and 36% of registry participants were estimated to have strong net harmful treatment effect (i.e. the individual estimate of harm was more than twice the estimate of benefit).
We have proposed quantitative methods for assessing the applicability of trial evidence to a target group in the presence of HTE. These methods may guide inferences about the probability of seeing the same effects under the same treatment in the target population as were seen in the trial. These methods focus on clinical exchangeability and HTE to guide judgment concerning applicability inferentially. To this end, we present estimates of treatment benefit and harm as effects on an absolute scale (e.g. probability of event over 1 year). We propose the joint visual presentation of the distribution of clinically interpretable benefit and harm outcomes as a fundamental and flexible basis for considering applicability.
Quantitative methods to assess applicability should ideally use information from both a trial and a target population. This feature can raise concerns regarding measurement discrepancies and differences in missing values between the two studies. We also address these practical challenges. Because trials can never be large enough to represent all important target populations, methods to assess applicability are of great importance in both policy making and clinical applications.
The greatest threat to the applicability of a trial occurs when there is the selection of participants according to characteristics that create HTE.[36, 37] Notably, a few approaches have wrestled with the challenges of pooling individual-level data from more than one study design to assess clinical exchangeability and HTE. A proposal for cross-design synthesis attempted to draw on the respective strengths of trial and observational data by examining them side by side; however, it did not put forth statistical methods to estimate treatment effects in the observational study. The confidence profile method uses a Bayesian framework to combine evidence with special attention to study designs, but it uses only summary data.
The proposed approach examines the applicability of trial evidence according to both benefit and harm dimensions. In the proposed approach, benefit and harm are presented on the same fundamental scale (risk, or probability of event over 1 year). In addition, the correlation between the risks of benefit and harm are examined. In this example, risks of benefit and harm are positively and moderately correlated, but in another example, they could be negatively and more weakly or strongly correlated. In the presence of HTE, it is particularly important to examine clinical effects in individuals who can experience benefit and no harm, benefit and harm, no benefit and harm, and no benefit and no harm.
We focus on applicability with the goal of answering two questions fundamental for clinical decision making. First, did the trial include people who would be estimated to have the same benefit or harm as the target group or person? Second, what are accurate estimates of how treatment would change the risks of the benefit and harm outcomes? In addition, we identify a homogeneity interval with the highest degree of applicability, corresponding to values of treatment effect modifier in which no one in the trial was observed to have a different treatment effect. When there is HTE, this interval will not include the entire trial. We have not attempted to place weights on benefit and harm outcomes because studies have shown that these weights vary greatly within a patient group, and the value a person places on a given outcome is sensitive to the way information is presented.[41, 42] However, when a person's preferences are known, especially when many outcomes are being considered simultaneously, a variety of quantitative approaches could be used to summarize across outcomes for both the care provider and the patient.
In our example, we found no evidence of measurement discrepancy. This is likely because the SOLVD studies were designed together and with shared protocols, which will not be the case for many comparative effectiveness research studies. To fulfill the promise of cross-design synthetic methods, it is necessary to develop techniques to project valid treatment effects onto people who are in the target population but not in the trial. To our knowledge, ours is the first report on HTE according to the baseline risk of outcome in SOLVD-T. The original trial report conducted four pre-specified subgroup analyses: tertiles of serum sodium levels and ejection fraction, etiology of heart failure and NYHA classes. There was a suggestion of interaction by ejection fraction such that the tertile with the highest ejection fraction did not benefit (interaction p-value = 0.03).
There are limitations to these findings. The feasibility of the proposed methods for a given example will depend on the type of data available for both the trial and the target population. The methods proposed are data hungry and require individual-level data for both the trial and the target individual or target group. In most cases, some important information will not be measured equivalently in both trial and target population. “Sharp” description using many patient characteristics (e.g. a multivariate risk score) is needed to strongly support clinical care for individuals or a well-defined group because the potential for misclassification vis-à-vis clinical applicability is great when only aggregate data are available for “blunt” comparisons using one or two characteristics (e.g. older women). With this in mind, we believe the proposed methods can be used in a variety of scenarios such as comparing one or more trials to one or more registries or observational studies. There can also be substantial uncertainty about individual risk estimates because of methodological differences in addition to sampling uncertainty. There is a large body of literature demonstrating that the baseline risk of the outcome is often a potent treatment effect modifier on the relative risk scale.[26-28, 46-50] For this reason, we propose the baseline risk of the outcome as a strong candidate for examining applicability when a small number of treatment effect modifiers are prespecified. However, any variable found to be a strong treatment effect modifier could be used in place of baseline risk for the proposed method. Finally, we have not examined other factors that can affect applicability more broadly, such as provider characteristics and healthcare environment.
In summary, we have proposed quantitative methods for assessing the applicability of trial evidence to a target sample. The proposed methods exploit the advantages of a trial with respect to valid treatment effects and the advantages of observational data regarding a broad representation of target population. Our methods also attempt to directly address the practical challenges of using information from more than one study design.
The authors declare no conflict of interest.
This project was funded under contract no. HHSA29020050034-I-TO5 from the Agency for Healthcare Research and Quality, US Department of Health and Human Services, as part of the Developing Evidence to Inform Decisions about Effectiveness (DEcIDE) program. The authors of this report are responsible for its content. Statements in the report should not be construed as endorsement by the Agency for Healthcare Research and Quality or the US Department of Health and Human Services. This manuscript was prepared using research materials from the SOLVD obtained from the NHLBI Biologic Specimen and Data Repository Information Coordinating Center and does not necessarily reflect the opinions or views of the SOLVD or the NHLBI. Dr Weiss was also supported by the Robert Wood Johnson Foundation's Harold Amos Medical Faculty Development Program.
|SOLVD-R, n = 6273; 5100 not in a trial||SOLVD-T, n = 2569|
|Age ≥ 21 years (max = 95 years)||21 ≤ Age ≤ 80 years|
|(≥ 65, n = 2783)||(≥ 65, n = 936)|
|“Screened from the following two types of sources: (i) echocardiographic, radionuclear, and cardiac catheterization laboratories to identify those with an ejection fraction ≤ 45% and/or (ii) hospital discharge records to identify those with a clinical diagnosis of heart failure (confirmed by radiologic evidence or signs of pulmonary venous congestion (basal or perihilar vascular blurring, Kerley B lines, alveolar or pulmonary edema, or pleural effusion secondary to congestive heart failure).”||LVEF ≤ 0.35 and overt heart failure, that is, have currently or have had in the past clear clinical evidence of congestive heart failure and who currently require treatment with diuretics or inotropic drugs or vasodilators, or a combination of these, for symptomatic relief|
|“Most patients (78.9%) were eligible on the basis of ejection fraction only, 13.8% on the basis of heart failure, and 7.3% on both.”|
|Compliance with run-in period of taking enalapril for 2–7 days then placebo for 2 weeks|
|History of intolerance to enalapril|
|Already receiving an ACE-I and unwilling to discontinue|
|Myocardial infarction, cardiac surgery, percutaneous transluminal coronary angioplasty or balloon valvuloplasty within 7 days (17.1%)||Myocardial infarction in the last 30 days|
|Hemodynamically significant primary valvular or outflow tract obstruction (mitral valve stenosis, aortic valve stenosis, asymmetric septal hypertrophy, malfunctioning prosthetic valve)|
|Nonvalvular congenital heart disease (1.3%)||Complex congenital heart disease|
|Syncopal episodes presumed to be due to life-threatening arrhythmias (asymptomatic cardiac arrhythmia including ventricular tachycardia are not an exclusion criterion)|
|Any prospective participant in whom cardiac surgery including transplantation is likely in the near future (e.g. participant's name is on cardiac transplant list). In particular, if a potential participant is likely to need CABG surgery in the immediate future, he or she should be excluded but can be reassessed for eligibility after surgery|
|Unstable angina pectoris (defined as angina at rest) or severe stable angina (more than an average of two attacks per day) despite treatment|
|Uncontrolled hypertension at the time of randomization (uncontrolled blood pressure is defined as systolic blood pressure >14 C1 mm Hg and diastolic blood pressure >95 mm Hg)|
|Cor pulmonale (right ventricular failure secondary to pulmonary disease).|
|Advanced pulmonary disease (forced expiratory volume in first second (FEV1)/forced vital capacity(FVC) 550, peak expiratory flow rate < pulmonary disease (FEV1/FVC = 550, peak expiratory flow rate <200 ml/s, FVC 60% of predicted)|
|Major neurologic diseases that could lead to early death (i.e. Alzheimer's disease, advanced Parkinson's disease)|
|Cerebrovascular disease (e.g. significant carotid-artery stenosis) that could potentially be complicated or rendered unstable by administration of an ACE inhibitor. (Prospective participants who may be at increased risk for stroke should their blood pressure decrease excessively. The mere presence of a carotid bruit need not in itself exclude participants.)|
|Collagen vascular disease other than rheumatoid arthritis (i.e. systemic lupus erythematosus, polyarteritis nodosa, scleroderma|
|Suspected significant renal artery stenosis|
|Renal failure (i.e. creatinine >2.5 mg/dl or dialysis patients)|
|Malignancies, except for surgically cured skin cancer, carcinoma in situ, or 5 years free of disease after the diagnosis of solid tumors|
|Requirement for immunosuppressive therapy (the use of steroids for non-life-threatening diseases such as arthritis is not an exclusion)|
|Significant primary liver disease|
|Lack of reliable means to contact patient for follow-up, or lack of adequate medical records (24.6%)||Likelihood of a prospective participant being non-adherent due to chronic alcoholism, lack of a fixed address, drug addiction, etc.|
|Any non-cardiac life-threatening disease likely to significantly shorten survival (36.8%)||Other life-threatening disease or prospective participant who is not realistically expected to be discharged alive from the hospital|
|Pregnant woman or woman of child-bearing potential who is not protected from pregnancy by any method|
|Participant who is simultaneously receiving other investigational drug protocols (other than for compassionate use|
|Inability to give informed consent (patients who died in the hospital during the qualifying visit could be enrolled posthumously, if allowed by the local review board) (20.2%)||Failure to give consent|
This calibration plot shows observed risks (from Kaplan–Meier analyses within quartiles of predicted risk) compared with predicted risks (from the Cox models used to estimate individual risk). Geometric means within each bin, that is, quartile of predicted risk, are plotted. Perfect agreement occurs on the 45° line. The locations of risk distribution can also be compared.