In recent years, remarkable advances in the treatment of rheumatic diseases have occurred. For diseases for which efficacious treatments already existed, new therapies that were either more effective or safer were tested and introduced. For diseases or disease manifestations for which there was no efficacious therapy, new treatments have been developed and their efficacy demonstrated. In the area of rheumatic and musculoskeletal diseases, examples of such advances include the emergence of tumor necrosis factor α (TNFα) inhibitors for the treatment of rheumatoid arthritis (RA) and spondylarthropathies and of bisphosphonates for the treatment of osteoporosis. Convincing evidence of the efficacy of these agents came from high quality and generally large-scale randomized trials.
Although other breakthrough therapies will emerge, new treatments of rheumatic diseases are more likely to offer only modest advantages in efficacy or safety over those that are already available. Looming are issues about where a new treatment fits in terms of disease management. Major questions include the following: 1) If the treatment is efficacious, how much more efficacious is it compared with placebo? 2) How does this treatment's efficacy and side effect profile compare with those of other treatments? (Because efficacy may be thought of as a treatment benefit and side effects as a central risk, this may be considered the risk/benefit ratio of a new treatment.) 3) When it is combined with current therapy, does the new treatment add to overall treatment efficacy?
Differences in efficacy between treatments are generally smaller than differences in efficacy between treatment and placebo, and adverse events are rare in short-term trials. Thus, answers to questions about the comparative efficacy and safety of treatments are often unknown at the time of treatment introduction and remain uncertain thereafter.
To answer most, but not all, of these questions, small differences between treatments in either efficacy or safety must be detected and accurately measured (1). Small differences between 2 treatments in the ability to improve joint pain in osteoarthritis (OA) are noticeable to many patients, leading most to prefer the modestly stronger treatment (2). Although quantifying small effects may seem unsatisfying, cumulative small effects produce large ones. The cumulative small effects of multiple efficacious treatments may constitute one explanation for the general improvement over time in RA treatments. Small effects can also be at issue in distinguishing potent treatments from those that are moderately effective (e.g., in RA, the difference in efficacy between TNFα inhibitors and methotrexate might be regarded as a small effect).
Contemporaneous with the increasing need to assess small treatment effects has come a blizzard of published data from randomized controlled trials, only some of which address these needs. The increased number of trials being published as well as the increased number of therapies for most diseases have posed new dilemmas about how to use trial information to make accurate choices among therapies.
Assessment of treatment depends critically on data from published trials and on reviews that summarize the results of these trials. The goal of this commentary is to propose a set of explanations regarding why the published clinical trial literature often does not provide accurate and comprehensive information on treatment efficacy and safety. Such omissions occur despite real and important improvements in the methodologic quality of trials in rheumatology and their reporting over the past 15 years (3). The framework of the commentary is necessarily broad, and I include suggested solutions to the problems identified.
Identification of the problems
In an ideal world (Figure 1), all trials evaluating a novel therapy for a disease would be performed using the same measures of that disease's activity. They would have sufficient power to be likely to detect an effect of therapy, and they would test these new therapies against both placebo and other available therapies so that the medical community could determine not only whether these therapies are efficacious, but also how they compare with currently available treatments. These trials would be published a limited number of times, perhaps only once, and prominently enough to allow the medical community to evaluate the results. All of the reports would use the same presentation format, so that information from trials could be compared.
Number of trials needed is prohibitively large, and trials are too small.
In the real world (Figure 1), trials more commonly test new therapies against placebo rather than against other active therapies (often placebo or new treatment is added to a background of active treatments that are not optimally effective). New therapeutic regimens are tested against placebo in part to obtain regulatory approval and in part because the difference in efficacy between a treatment and placebo is likely to be larger than the difference between two active treatments.
Furthermore, too many therapies (e.g., in RA) exist to permit testing of active treatments against each other. To test a new treatment against each of 5 existing therapies would necessitate 5 trials, each of which would likely be larger in size and more expensive than a placebo-controlled trial. To test all possible comparisons of those 5 therapies would require 10 trials, each of which would probably be large and expensive, or trials designed with multiple treatment groups (to allow such comparisons to occur within trials). Thus, with the proliferation of new therapies for specific diseases, it is unlikely that definitive evidence evaluating the comparative efficacy of these therapies will ever be acquired.
The problem with numbers described above is simple compared with the formidable task of assessing combination therapy, which is becoming increasingly popular for most rheumatic diseases. Comparing different combinations of available therapies against other combinations could generate an almost infinite number of trials. Since the differences in efficacy between combinations are likely to be small, each trial, to be adequately powered, might be prohibitively large. Although such comparisons are certainly within the purview of information that can be provided by clinical trials, these constraints of numbers and expense suggest that we will likely never know the comparative efficacy of newly available combinations and must be careful when allocating resources to compare commonly used regimens or to perform studies that provide generalizable information about combination approaches.
Added to these limitations is the reality that trials for many chronic diseases remain small. Indeed, for null trials in major medical journals (4), for trials in rheumatic diseases (3), and for trials in other fields such as depression (5) and head injury (6), sample sizes have been too small, in general, to detect likely treatment differences.
Is multiplicity duplicity?
Multiplicity (see Table 1) involves either multiply testing or presenting results of a trial; it vitiates the principle that a treatment should be tested and reported once per trial, and that there should be only a 5% chance that an inefficacious treatment is found to be better (or worse) than placebo (the Type I error rate).
Table 1. Different types of multiplicity in trial presentation and publication
1. Single trial, multiple publications
2. Multiple outcomes measured in a trial, outcome(s) showing null results ignored
3. Multiple analyses performed (e.g., intent-to-treat, completers, subgroup), only positive one highlighted
4. Multiple trials done, only positive ones published
It has been well documented that single trials with positive results are published repeatedly, often in a variety of guises, a phenomenon documented in rheumatology trials and in trials in other fields (7, 8). Attempts to identify which publications emanate from which trial have not been rewarding (8). Often, no one has kept records, or sponsors have been unwilling to provide clarifications. Thus, the ideal of a single trial published once prominently is often violated, and the weight of evidence that a given trial provides is difficult to determine.
Conversely, other trials that show no effect of a therapy are never published; this so-called publication bias is a phenomenon that has been reported in many medical fields (9, 10), including trials of glucosamine and chondroitin for the treatment of OA (11). Small null trials are especially unlikely to be published. One suggested solution to the publication bias problem has been to record the existence of a trial in a registry at the time of trial inception, but this has been variably successful (12). The result of publication bias is a subjective view of the efficacy of a therapy, with trials showing no effects being invisible to the medical community.
Other forms of multiplicity interfere with the clear translation of trial results. In chronic diseases in which disease effects may be multidimensional, many outcomes may be evaluated in a trial, yet only those showing favorable therapeutic results may be presented in the publication (7). Thus, in a trial, many analyses are done, selected positive results will pop out, and these will be reported, often in different publications. Those results that were not impressive or failed to reach significance will not emerge.
Last, one common form of analytic multiplicity occurs when a trial's main result is null, but when analyses of one or two subgroups show impressive, statistically significant results, and those last results are trumpeted as suggesting efficacy (13–15).
Intent-to-treat analysis is the only form of analysis that preserves balanced randomization in the evaluation of trial results, and any exclusion of persons after randomization violates this principle. However, among the trial reports in major medical journals (4), roughly one-half either provide no intent-to-treat analysis or are unclear regarding whether the analysis presented is intent to treat. Of rheumatology trials published between 1997 and 1998, only 29.8% presented an analysis that was reported as intent to treat (although this represented an improvement over 19.8% for the years 1987–1988) (3). Also, a modified intent-to-treat analysis may be presented as a rigorous analysis (16). Although failure to present an intent-to-treat analysis does not, per se, represent a problem of multiplicity, in practice, the availability of a variety of intent-to-treat options makes it possible to analyze trial data in multiple ways, so that the advocate of a specific therapy can choose the option that allows data to be presented in the most favorable light.
While industry-sponsored trials currently constitute the majority of those conducted, the problems cited above are common to all published trials, including those that are not industry-sponsored.
Summary of problems.
Although the ideal world would present transparent results of single trials, the real world provides multiple publications about a trial with positive results, with publication of the results of null trials being absent. Furthermore, data from trials with positive results frequently are presented in such a way as to make their unvarnished results obscure. Added to this are trials that are underpowered to detect modest effects. The global picture is one that produces understandable confusion about the efficacy of therapies. Because the problems are multifaceted, solutions to them are not simple.
Because information about clinical trials serves as the basis for most judgments regarding treatment efficacy (and possibly safety) and as the source for clinical reviews on treatment, I shall concentrate on the design and presentation of clinical trial data. This ignores the synthesis of these data both in impressionistic reviews and meta-analyses, where problems of bias (17–20) and lack of comprehensiveness have been noted (21).
I shall assume that redundancy of publication and citation of positive trials is uncontrollable in this information age. Perhaps there are positive consequences to repeated citation of a prominent trial's results—that all physicians interested in using a treatment can become familiar with evidence regarding its efficacy. What can and should be controlled is the accuracy with which trial reports reflect actual trial results. In addition, readers and reviewers should be able to identify the source of trial data and determine whether an article contains data from a unique trial or rather from a trial whose results have already been published in other contexts.
As the filter through which potential clinical trial publications must pass, medical editors exert considerable influence on how trials are published. Suggested improvements rely heavily on concerted actions by medical editors. I recognize that editors and journals represent a “back end” solution, and that many of the problems identified above arise in study design, analysis, and reporting phases. Nonetheless, experience with improvements in trial information instigated by widespread adoption of the Consolidated Standards for Reporting of Trials (CONSORT) (22) guidelines suggest that expected editorial policies that encourage open and standard presentations of trial data will, in turn, motivate “front end” improvements in the performance of trials.
Standardization of trial design and reporting.
In the mid 1990s, experts in clinical trial methodology, the CONSORT group, developed guidelines for the reporting of clinical trial results (22), which were endorsed by many major medical journals. CONSORT guidelines presented diagrams for participant flow, descriptions of eligibility criteria, and clearly defined primary and secondary outcomes.
Later evaluations (23) suggested that trials published in journals that adopted CONSORT guidelines showed a temporal improvement in reporting of trial methods, whereas one journal that did not adopt these guidelines showed no measurable improvement. The guidelines have made clinical trial methods more transparent and their results easier to understand; they have also standardized trial reporting. In 2001, a revised set of guidelines was introduced (24) and adopted by the International Committee of Medical Journal Editors and many medical journals, although not necessarily journals focusing on rheumatic diseases.
Although the CONSORT prescriptions were sanctioned by medical editors, they were not enforced or required. Even trials published in journals that endorse CONSORT often do not meet CONSORT guidelines (23). There is no reason why elements of CONSORT are not required for trial reports in major medical journals. Such a requirement would make trial reports uniform and, thus, would assist the medical community in determining whether therapies are efficacious and how they compare with each other. Some journals currently require that a CONSORT checklist be submitted with clinical trial reports (25). All journals could do so, and these checklists could be cross-referenced to locations in the text.
Although the CONSORT recommendations address trial reports, they do not mandate standardization of trial design nor intent-to-treat analyses. Editors should consider requiring that a trial's primary analysis be by intent-to-treat whenever possible. The CONSORT influence has been favorable, but higher standards and more uniformity of trial reports are needed.
Groups studying different diseases have begun to develop their own disease-specific guidelines, encompassing both the design of trials and the reporting of results. This process has taken place most prominently in rheumatic diseases. Until the early 1990s, published trials in RA used 10–15 diverse outcome measures, and individual studies often used several different measures, with no primary outcome. Trial measures were not necessarily selected based on performance characteristics and were often redundant. In an effort to standardize outcome measures in RA, committees, using data from trials, culled redundant measures, identified those that were sensitive to change, and selected outcome measures that sampled from different domains of disease activity (e.g., acute-phase reaction, pain, joint count). The result, the core set of disease activity measures, was adopted internationally (26, 27). This core set still included too many outcome measures (n = 7), and subsequent efforts boiled it down to a single outcome measure, producing a definition of response (28). Acceptance of this measure of response by regulatory agencies helped ensure its widespread adoption. Since then, trialists have successfully developed standardized response criteria for juvenile RA (29), OA (30), and ankylosing spondylitis (31). Yet other efforts are underway.
What are the benefits of trial standardization? Adopting a widely accepted set of outcome measures has permitted the rheumatology community to use the same language when evaluating therapies. As in oncology, this permits comparisons of treatment efficacy, sometimes even when the treatments are not directly compared. Although this type of across-trial comparison is problematic when patient populations and trial structures vary, it is nonetheless more informative than having no reasonable comparative information. It encourages investigations of what factors affect response (32), allowing high-risk subjects to be targeted for trials; it focuses attention on clinically relevant changes in individual patients and thus dovetails with the clinician's needs; and it forces trialists to choose a single outcome, not multiple outcomes.
The problem with standardizing trial outcomes is that such standardization may freeze progress, inhibiting development of new, creative approaches to outcome measurement. This problem is not inherent in trial standardization. The revised CONSORT statement (see above) suggests that standards may have to be updated to accommodate valuable new insights, and that serial generations of standards may be needed. To permit ongoing treatment comparisons, new standards may have to be compatible with old ones.
Despite increasing standardization of trial design and reporting, trial results naturally vary one from another, even if the trials were performed and reported similarly. For example, in recent trials comparing leflunomide and methotrexate in RA, which included similar patients and used the same definition of response, methotrexate response rates fluctuated from 35% to 57%. Some of the persistent variation across arthritis trials occurs because the timing of response has not been consistently defined; this is an example of standardization not proceeding far enough. The use of different treatment regimens and different patients with varying likelihoods of response introduces variability into trial results. Variability is also increased when the measurement of outcome (e.g., joint counts) has not been standardized across observers. Furthermore, there is a natural variation from patient to patient and from trial to trial. With these sources of variability, it is difficult to compare results for treatments across trials. Identifying and limiting causes of variability would make this tendency less problematic. It seems unfortunate that despite efforts to enhance trial uniformity, variability in the presentation of information in trial reports should add to the difficulty in interpreting whether results from different trials are genuinely different.
Verifying data analysis and avoiding multiplicity in trial reports.
Recognizing many of the problems identified above, leading medical editors recently promulgated a new policy to encourage trial data to be presented in an objective and dispassionate manner (32). Authors must sign statements indicating that they accept full responsibility for the conduct of the trial, had access to all data, and controlled the decision to publish. Although this is an excellent first step, many clinical trial investigators do not have the ability to reanalyze or evaluate trial data. Other solutions may be needed. These include making data available to all trial investigators (including those who have analytic capability) for reanalysis prior to publication. Even further steps to verify results and avoid multiplicity problems identified earlier should be considered.
As occurs at the Food and Drug Administration, journals should consider reanalyzing trial data to confirm the accuracy of the original analysis. Rigorous statistical oversight would limit multiple outcomes and selective reporting of outcome within publications. Also, editors should consider asking authors to provide the trial data that generated the analyses for postpublication electronic release to the scientific community, so that trial data can be reevaluated (33). Such electronic data could be limited to only the data that generated the results in the trial report, which would permit authors to take advantage of other trial data for novel publications. Medical editors need to exert authority to ensure that the data presented in trial reports accurately represent the data collected. The possibility of data reanalysis will force trialists to carefully evaluate the primary analysis and confirm its interpretation before publication.
Asking authors about other trial data.
While controlling the number of publications emanating from a single trial might not be in the best interest of the scientific community, identifying the trials that generate analyses is in the best interest of all of those interested in critically evaluating the quantity and quality of evidence supporting a therapy's efficacy and safety. The number of published reports of a trial cannot be well controlled, but identification of which data came from which trial can be better policed. At least one journal (Annals of Internal Medicine [www.annals.org/shared/author_info.html]) has a policy of requiring that authors provide “full details on any possible previous or duplicate publication of any content of the manuscript.” Previous publication of content does not preclude new publication of other parts of the study.
Editors could require, as a matter of course, that researchers submitting clinical trial reports (and perhaps other clinical research articles) provide details regarding whether data from the trial have been previously reported in other contexts. This information could be provided, along with funding sources, in the published manuscript or in the electronic version of the same report. The authors should be asked not only whether these trial data have been previously published in another form, but whether other manuscripts that contain data from this trial have been submitted.
Standardization of editorial policy regarding clinical trial reports.
As noted earlier, with the endorsement of revised CONSORT recommendations, the medical editors made an enormous leap in standardization of clinical trial reporting. The recommendations described above will work only if medical editors act in concert. Trial publication inequities would arise if some journals ask for trial data electronically or perform reanalysis of trial data while others do not. This can be avoided if medical editors, especially editors of leading journals, adopt uniform policies. Because most rheumatology trials are published in a limited number of rheumatic disease journals (4), a discipline-wide policy would be feasible.
Although both the entrepreneurial spirit of individual investigators and development of new therapies by the pharmaceutical industry need vigorous encouragement, we can no longer rely completely on information provided by these sources to determine how well treatments work and which ones work best. Enumeration of problems described above leads naturally to suggested specific solutions; I acknowledge that medical editors, especially, have moved toward adoption of some of them. However, none of the suggested solutions has been broadly and successfully adopted, and without a concerted effort, especially by medical editors, these problems will remain, and information about the efficacy and safety of treatments will be incomplete and, in some cases, inaccurate.
Although the proposed solutions (Table 2) represent small steps toward the ideal state described earlier, most of them will not be easily accomplished, and other approaches certainly exist. Nonetheless, I believe that consideration of these issues is timely, and that a debate regarding possible solutions is warranted.
Table 2. Suggested solutions for optimizing information from trials*
Trial data analysis and multiplicity
CONSORT = Consolidated Standards for Reporting of Trials.
Make trial data available to all authors; consider other means of verification of analysis results: editors require electronic data made available if trial published, journal statistician(s) to review report and have opportunity to reanalyze data
Instructions to authors require them to cite publications from same trials. These data are made available for reviewers and readers
I am indebted to Drs. Robert Meenan, Norman Levinsky, Wilson Colucci, and Joseph Loscalzo, and the faculty of the Clinical Epidemiology Unit for editorial suggestions and to Kitty Bentzler for technical assistance.