Although randomized controlled trials (RCTs) require a great deal of time, money, and effort, the majority of them have resulted in failure to verify a priori hypotheses. Therefore, the intention in the current study was to clarify the differential elements of studies with ‘positive’ and ‘negative’ outcomes.
The authors performed a comprehensive search of RCT reports on treatments for hematologic malignancies published between 1995 and 2004, with 264 reports eventually identified. The expected rate and the observed rate for the primary endpoint were compared for 70 studies with all relevant information available.
Of all the superiority trials (n = 256), positive studies accounted for 33%. Most of the major study characteristics were not found to be associated with the study outcome except for the primary endpoint. Studies evaluating event-free survival were more likely to report positive results than were those evaluating overall survival (P = .061). For the experimental treatment arm, the mean difference between the expected and observed rates was −10.1% (standard deviation [SD], 10.1%) in the negative studies, which indicates a rate lower than expected, and was 1.3% (SD, 9.2%) in the positive studies (P < .0001). In contrast, no statistical significance was observed for the standard treatment arm because the mean difference was 6.3% (SD, 10.7%) for the negative studies and 3.0% (SD, 9.0%) for the positive studies (P = .1885). The journal impact factor was statistically significantly higher for the positive than for the negative reports (P < .0001).
In cancer therapeutics, the established method for the development of new agents or regimens is to proceed in a stepwise fashion, through phase 1, phase 2, and phase 3 studies.1 In a phase 3 study, the final stage of this process, a new treatment is compared with the existing standard treatment in the form of a randomized controlled trial (RCT). Because of their ability to minimize bias, RCTs are considered the most reliable method for assessing a new treatment, and it is widely accepted that data obtained from well-designed RCTs are the most definitive. Encouraging results of a new treatment in phase 1 and 2 trials do not necessarily pledge its superiority over the standard treatment, and whether it is better or not will only be known after completion of the relevant RCT.
Conversely, an RCT generally requires a large number of patients.2 Because a relatively small sample size would allow for the detection of only major differences, hundreds or even thousands of patients may need to be recruited to ensure adequate statistical power to detect minor but important differences between experimental and standard treatments. In addition, several years of work, huge financial costs, and strenuous efforts are also required.3, 4 However, despite such major investments it is a fact that a substantial proportion of RCTs have produced ‘negative’ results,5–7 which means that it has not been possible to demonstrate statistically significant superiority of new treatments over standard treatments.
Hematologic malignancies are unique types of cancer in that chemotherapy with or without radiotherapy is the main treatment for nearly all patients because surgical therapy is not indicated, with a few exceptions. Over the years, the development of new agents or therapeutic regimens has thus been vigorously pursued in this field. We therefore decided to perform a retrospective cohort study of RCT reports evaluating treatments for leukemia, lymphoma, myeloma, and their related diseases published during the decade between 1995 and 2004. The major objective of this study was to describe factors associated with the study outcomes.
MATERIALS AND METHODS
Selection of Studies
RCT reports on treatments for hematologic malignancies published between 1995 and 2004 as original articles and written in English were eligible for inclusion. When multiple reports were published for a single trial, a report that argued the primary endpoint of the study was selected. If the primary endpoint was not specified explicitly, overall survival (OS) was substituted for it. Trials focusing on biologic agents such as interferon were excluded.
The selection process was independently implemented by 2 hematologists (M.Y. and H.N.), and disagreements were resolved by discussion after each step. The initial literature search was conducted through MEDLINE by using the free text search term: (leukemia OR lymphoma OR myeloma), with the publication date limited to the period between 1995 and 2004, the type of article to Randomized Controlled Trial, the language to English, and the subjects to Humans. The search yielded 1238 reports, 625 of which were excluded by screening their titles. Both reviewers reviewed the abstracts of the remaining 613 article and 290 articles were retrieved in full for further consideration. To reach a final decision regarding which articles to include in the analysis, we examined all the candidate articles in detail, which resulted in the further exclusion of 26 articles because of no phase 3 studies, studies for biologic agents, duplications, etc, so that 264 reports were finally selected for this study.
To avoid errors in the data abstraction process, the 2 reviewers (M.Y. and H.N.) independently abstracted the data from the articles and subsequently compared the results. All data were checked for internal consistency and consensus was achieved for any discrepancies. Data abstracted from the articles included the disease name, type of subjects (children, adults, or both), study location, accrual period, year of publication, number of randomized subjects, primary endpoint, study design (superiority trial or noninferiority trial), study outcome, and whether the study was discontinued prematurely. Information regarding a priori sample size calculation was supplied in 117 reports (44%), from which the following data were also abstracted: expected and observed rates in terms of the primary endpoint for the experimental and control arms, and planned sample size. The rates of agreement in abstracted data between reviewers were in the range of 89% to 100% (median, 96%).
For each trial the experimental and the control arms were identified on the basis of information provided by the background section of the articles. A study designed for difference was classified as positive or negative according to whether the primary endpoint was met. The study outcome was considered positive if a statistically significant difference in favor of the experimental arm was shown for the primary endpoint. If the primary endpoint of a study was not explicitly specified, the study was considered positive if the experimental arm was significantly superior to the standard arm in terms of OS. With regard to the primary endpoint, disease-free survival and event-free survival (EFS) were grouped together as EFS. If >1 randomization was included in a study, the study outcome was determined by a comparison on which sample size the calculation was based. The journal impact factor was determined as that of the publishing journal in the year of publication according to Journal Citation Reports by the Institute for Scientific Information (Thomson Scientific, Philadelphia, Penn). The study location was assumed to be the same as the address of the first author.
To evaluate the correlation between a given study characteristic and the study outcome, logistic regression analysis was performed and the odds ratio (OR) was calculated in conjunction with the 95% confidence interval (95% CI). Factors analyzed for the association were disease, study location, type of subjects, number of randomized patients, duration of enrollment, and primary endpoint. Differences in rate in terms of the primary endpoint were calculated by subtracting the expected rate from the observed rate, and comparisons were made between the positive and negative studies by using 2-sided Student t tests. Impact of differences in rate on the study outcome was analyzed in 2 respective univariate logistic regression models, in which differences in rate were considered a continuous variable and dichotomized as less than −5%, −5% to 5%, and >5%. The journal impact factors for the positive and negative reports were also compared by 2-sided Student t tests. All statistical analyses were conducted using Stata software (version 8; StataCorp, College Station, Tex).
Trends and Characteristics of the RCT Reports
Characteristics of the 264 reports and annual trends are shown in Table 1 and Figure 1. The number of reports per year did not increase during the decade (Fig. 1A). Reports from the U.S. accounted for 26% (n = 68) of the total, followed by France (n = 36), Italy (n = 30), Germany (n = 26), and the U.K. (n = 25). A significant longitudinal trend was not evident with regard to the disease, type of subjects, or study location during the decade (Fig. 1B-D). The number of randomized patients was <100 in 49 reports, 100 to 199 in 83 reports, 200 to 499 in 86 reports, 500 to 999 in 34 reports, and >1000 in 12 reports. The primary endpoint of a study was not specified in 110 reports (42%), and information regarding sample size calculation was not provided in 147 reports (56%). Three studies were identified as noninferiority studies, and an additional 5 studies in which the study design was not explicitly stated in the text were also considered to be noninferiority studies. After the exclusion of these 8 studies, 84 of the remaining 256 reports (33%) were determined to be positive. A total of 36 studies (14%) were discontinued earlier than planned because of early results in favor of 1 treatment over another (n = 24), slow accrual (n = 6), safety issues (n = 2), withdrawal of the sponsor (n = 1), and unspecified reasons (n = 3).
Table 1. Characteristics of the Randomized Controlled Trial Reports Included in the Analysis
n = 264
OS indicates overall survival; EFS, event-free survival.
Studies designed to test noninferiority were excluded.
Study characteristics listed in Table 1 were examined for their association with study outcomes (Table 2). However, neither disease, study location, type of subjects, number of randomized patients, nor duration of enrollment was found to be correlated with the study outcome, although for the primary endpoint, studies evaluating EFS were more likely to report positive results than were those evaluating OS (P = .061). Full information regarding the rates in terms of the primary endpoint (ie, the expected rate and the observed rate in the experimental and the control arms) were made available in 70 studies. Among 115 superiority studies in which a priori sample size calculation was stated, 43 did not provide expected rates, and 2 did not provide observed rates. Figure 2 compares the expected and observed rates according to the study outcome. For the experimental arm, the observed rates fell below the expected rates in 91% of the negative studies and 41% of the positive studies (Fig. 2A). For the control arm, conversely, the observed rates were lower than expected in only 28% of the negative studies and in 31% of the positive studies (Fig. 2B). The mean difference between the observed and expected rates was −10.1% (standard deviation [SD], 10.1%) for the experimental arm, which indicates rates lower than expected, in the negative studies, and 1.3% (SD, 9.2%) in the positive studies (P < .0001). In contrast, no statistical significance was detected for the control arm, for the mean difference was 6.3% (SD, 10.7%) in the negative studies and 3.0% (SD, 9.0%) in the positive studies (P = .1885). Table 3 shows the impact of differences between expected and observed rates on the study outcome. When the rate in the experimental arm was overestimated by >5%, the study outcome was 7.2 times more likely to be negative, whereas such interaction was not observed for the control arm.
Table 2. Correlation Between Study Characteristics and the Results
OR indicates odds ratio; 95% CI, 95% confidence interval.
Difference in rate was calculated by subtracting the expected rate from the observed rate, so that the low value (eg, <−5%) suggests overestimation, and the high value suggests underestimation.
An OR of >1 indicates that the factor is associated with a negative result.
−5 to 5%
−5 to 5%
The journal impact factor was statistically significantly higher for the positive reports (mean, 10.8; SD, 9.8) than for the negative reports (mean, 6.3; SD, 5.5 [P < .0001]).
RCT is undoubtedly 1 of the most reliable methods to evaluate effectiveness of a new treatment; however, it requires a great deal of time, money, and effort, as well as participation of large numbers of patients.3, 4 The majority of RCTs are designed to detect the difference between a standard and an experimental treatment on the basis of an a priori hypothesis that the latter produces an outcome superior to that of the former. Nevertheless, the majority of RCTs fail to verify the hypothesis.5–7 This prompted us to investigate what the elements are that differentiate between studies with positive and negative results. We attempted to address this issue by analyzing RCTs of treatments for hematologic malignancies published during the decade between 1995 and 2004.
Of all the superiority trials we identified, positive studies accounted for 33%. Given that the new treatments in comparison with the standard treatments in these RCTs must have been selected ones that proceeded to phase 3 on the basis of promising results, the rate is unexpectedly low. Most of the major study characteristics including disease, study location, type of subjects, and sample size were not associated with the study outcome. Although the number of comparisons performed might not be large enough to draw a definitive conclusion, studies for which the primary endpoint was EFS appeared to report positive results more frequently. This is in accordance with the one reported by Kumar et al.8 They performed a meta-analysis of 126 RCTs conducted by the Children's Oncology Group and indicated that new treatments were deemed to be superior to standard treatments in studies assessing EFS as the primary endpoint, but this was not the case with studies assessing OS. A possible explanation for these findings could be treatment after recurrence. If effective salvage therapy after recurrence or primary treatment failure is administered, the difference between the treatments compared is partly skewed in terms of OS, which would influence the observed difference between the 2 treatment protocols.
Figure 2 clearly demonstrates that the difference between the expected and observed rates in terms of the primary endpoint was particularly notable in the experimental arm of the negative studies, in which the observed rate was lower than expected by an average of 10.1%. It can be assumed that this overestimation was the result of many studies eventually producing negative results. Although this should be a preliminary finding because of small sample size (n = 70), it is in a sense consistent with the report by Zia et al.9 To investigate whether results of phase 2 studies could be reproduced in subsequent phase 3 studies, they used data from 43 phase 3 studies and 49 preceding phase 2 studies and compared each of the response rates attained with identical chemotherapeutic regimens. They demonstrated that 81% of phase 3 studies had lower response rates than their preceding phase 2 studies, with a mean difference of 12.9%. Their findings and ours suggest that the design of a phase 3 RCT involves careful attention paid to estimating the effect of the experimental therapy and to deciding which new treatments should proceed to RCT. Recently, attempts have been made to construct a statistical model to provide assistance in selecting lung cancer chemotherapy regimens suitable for phase 3 RCTs.10, 11 Because development of an effective screening process would result in considerable savings of resources, it deserves further investigation.
Another limitation of this study is a possible bias in selecting trials, for only articles published in English were included in the analysis. It has been suggested that trials with positive results are more likely to be published than those with negative results,12–16 or more likely to be published in the English language.17 For this reason, the proportion of positive trials determined in this study could be higher than observed. It cannot be denied that the same kinds of bias may also have affected our findings when comparing the positive and negative studies, although to a lesser extent, if at all. For many of the problems associated with publication bias, prospective registration of clinical trials is expected to provide a solution.18, 19
To summarize, we reviewed RCT reports concerning treatments for hematologic malignancies published in English between 1995 and 2004. New treatments were found to be superior to standard treatments in only one-third of all the superiority trials we identified, and studies assessing EFS as the primary endpoint tended to report a positive result. Despite several limitations mentioned above, the finding that negative studies were characterized by a lower than expected observed rate in terms of the primary endpoint in the experimental arm may indicate that giving adequate consideration to the effect of an experimental therapy could be critical when planning an RCT.
We thank Dr. Elihu H. Estey (The University of Texas M. D. Anderson Cancer Center) for critical review of the article.