- Top of page
- APPENDIX 1
Eligibility criteria and identification of studies. We considered all articles on RCTs in SSc. Trials were eligible if they were randomized and included at least 5 patients with SSc, regardless of whether patients with other diseases had been included as well. Pseudorandomized studies with alternate allocation of subjects were excluded. Duplicate publications pertaining to the same trial were screened and only 1 report was retained. We also excluded studies where randomization had been performed, but the eventual comparisons in the analyses of the data were not between the randomized arms. Only peer-reviewed publications were included in the database, and meeting abstracts were not considered.
We searched MEDLINE (last update December 2000) using combinations of the terms “scleroderma” and “systemic sclerosis” with “clinical trial” and “randomized controlled trial” and “randomize” and “random.” The two authors screened all the retrieved abstracts to decide if they might fulfill the criteria for inclusion in this study. For potentially eligible studies, the full articles were retrieved for further consideration. Finally, we screened the references of retrieved articles for additional studies.
Data extraction and replication. The following data were extracted from each trial report: journal, first author's name, author's country and location of patient enrollment, the year of publication, funding source(s), targeted disease manifestations (and in particular Raynaud's phenomenon), masking, parallel or cross-over design, randomization mode, allocation concealment, description of reasons for withdrawals, information on withdrawals per arm, percentage of withdrawals, total sample size, number of patients with SSc, number of randomized patients per arm, the proportion of women, mean (or median) age of patients, prior duration of disease, duration of the trial, performance of power calculations, number of arms, the treatment(s) in each arm, and whether the trial was a comparison of a treatment against placebo/no treatment or a comparative trial of different regimens. Moreover, we examined whether the study outcomes were well specified and whether there was a single outcome or multiple study outcomes. Finally, the results of each trial were categorized as showing significant efficacy, trend for efficacy, no effect, trend for harm, or significant harm for a tested experimental intervention versus the intervention of the control arm. This classification was based on the overall appraisal of the key outcomes' results, their level of statistical significance (P < 0.05 or P ≥ 0.05), and their interpretation by the trial authors in the abstract and the discussion of the study publication.
Data were extracted in duplicate by the two authors and differences were solved by consensus. For the categorization of the trial results, the weighted kappa coefficient (4) between the two extractors was 0.8.
Statistical analyses. Descriptive statistics for continuous variables are presented as means with SDs or medians with interquartile ranges (IQRs), as appropriate. Descriptives for discrete variables involve frequencies and percentages, and separate estimates are given for trials focusing on Raynaud's phenomenon only versus other trials. We evaluated whether traditional attributes of study quality pertaining to the Jadad quality scale for RCTs (randomization, allocation concealment, double-blinding, information on withdrawals) (5), were related to the study sample size, duration of followup, and characteristics of the patient population. We also evaluated whether any of the trial design or quality characteristics mentioned above (under data extraction) were related to the reporting of significant efficacy by the trial publication. Comparisons used the Mann–Whitney U test and t-test and one-way analysis of variance (ANOVA) for continuous variables; the Fisher's exact test for 2 × 2 tables; the chi-square test adjusted for trend for categorical data; Spearman's coefficients for correlation analyses; and logistic regressions for multivariate analyses (6).
Finally, we evaluated whether binary trial characteristics that may reflect the design of the trials and/or their quality changed over time in more recent trials. The trials were categorized into 4 groups based on year of publication (pre-1985, 1986–1990, 1991–1995, and 1996–2000). The statistical analysis used exact inference adjusted for trend. We also evaluated whether the sample size was increasing in more recently published trials.
Analyses were conducted in SPSSc 10 (SPSSc, Chicago, IL) and StatXact 3 (Cytel Software, Cambridge, MA). All reported P-values are two-tailed.
- Top of page
- APPENDIX 1
Database. Seventy-eight reports were identified and 70 RCTs qualified for inclusion in this analysis (Table 1). Eight reports were excluded because of pseudorandomization (n = 2), duplicate publication (n = 4), or the use of nonrandomized comparisons for analyses of the data (n = 2). The eligible RCTs had been published in 29 different peer-reviewed journals: the leading contributing journals being Arthritis and Rheumatism, n = 9; the Annals of the Rheumatic Diseases, n = 9; the Journal of Rheumatology, n = 7; and the British Journal of Rheumatology (currently, Rheumatology), n = 5 (see Appendix 1). All studies were reported in the English language with 2 exceptions (1 reported in French and 1 in German).
Table 1. Characteristics of eligible trials*
|All trials ||Raynaud's trials||Other trials|
|n = 70||n = 24||n = 46|
|Sample size, median (IQR)||28(17–43)||25(15–51)||28(20–41)|
|Sample size, SSc only, median (IQR)||25(14–41)||20(8–34)||28(18–41)|
|Proportion of women, median (IQR)||0.82(0.77–0.88)||0.84(0.76–0.89)||0.82(0.75–0.88)|
|Mean age, years, median (IQR)||48.1(45.6–51.0)||48.1(45.6–50.8)||48.0(45.2–51.0)|
|Mean disease duration, years, median (IQR)||8.5(5–11.3)||9(8.3–12.5)†||7(3.8–11.2)†|
| Less than 2 months||12(17.1%)||6(25.0%)||6(13.0%)|
| Two months to 1 year||50(71.4%)||17(70.8%)||33(71.7%)|
| More than 1 year||8(11.4%)||1(4.2%)||7(15.2%)|
|Country of patient enrollment|
| United States||25||8||17|
| United Kingdom||18||12||6|
|Year of publication, median (IQR)||1991(1985–1997)||1988(1985–1994)||1991(1985–1997)|
|More than 2 arms||6(8.6%)||2(8.3%)||4(8.7%)|
|Comparison against placebo/no treatment||58(82.9%)||19(79.2%)||39(84.8%)|
|Sample size per arm not specified||15(21.4%)||6(25.0%)||9(19.6%)|
|Randomization mode specified||16(22.9%)||3(12.5%)||13(28.3%)|
|Allocation concealment specified||7(10.0%)||0(0%)||7(15.2%)|
| Double blind||58(82.9%)||20(83.3%)||38(82.6%)|
| Single blind||3(4.3%)||2(8.3%)||1(2.2%)|
| Open label||9(12.9%)||2(8.3%)||7(15.2%)|
|Percentage of withdrawals, median (IQR)||17.5(8.7–31.6)||11.0(7.0–21.5)‡||27.1(13.0–34.4)‡|
|Withdrawals described per arm||37(52.9%)||13(54.2%)||24(52.2%)|
|Power calculations mentioned||8(11.4%)||1(4.2%)||7(15.2%)|
|Outcomes well specified||19(27.1%)||4(16.7%)||15(32.6%)|
|Number of outcomes, median (IQR)||8(5–14)||8(4–10)||8(5–16)|
|Results for the experimental intervention|
| Significant efficacy||21(30.0%)||10(41.7%)||11(23.9%)|
| Trend for efficacy||17(24.3%)||6(25.0%)||11(23.9%)|
| No effect||28(40.0%)||8(33.3%)||20(43.5%)|
| Trend for harm||2(2.9%)||0(0%)||2(4.3%)|
| Significant harm||2(2.9%)||0(0%)||2(4.3%)|
A total of 49 interventions were evaluated, including iloprost (used in 12 arms of various RCTs), nifedipine (n = 9), ketanserin (n = 5), d-penicillamine (n = 4), cisapride (n = 3), cyclofenil (n = 3), enalapril (n = 3), cicaprost (n = 2), interferon gamma (n = 2), prazosin (n = 2), prostaglandin E1 (n = 2), relaxin (n = 2), alfa methyldopa and propranolol (n = 1), aminobenzoate potassium (n = 1), antacid (n = 1), antioxidant micronutrients (n = 1), autogenic training (n = 1), betaprost (n = 1), calcitriol (n = 1), chlorambucil (n = 1), cimetidine (n = 1), colchicine (n = 1), dexamethasone (n = 1), diltiazem (n = 1), dipyridamole with aspirin (n = 1), epoprostenol (n = 1), essential fatty acids (gamma linoleic acid; n = 1), extracorporeal photochemotherapy (n = 1), factor XIII (n = 1), finger temperature biofeedback (n = 1), fish-oil capsules (n = 1), frontalis biofeedback (n = 1), glyceryl trinitrate patches (n = 1), interferon alpha (n = 1), ketotifen (n = 1), losartan (n = 1), methotrexate (n = 1), N-acetylcysteine (n = 1), olive oil capsules (n = 1), photopheresis (n = 1), probucol (n = 1), ranitidine (n = 1), recombinant tissue plasminogen activator (n = 1), stanozolol (n = 1), total lymphoid irradiation (n = 1), urokinase (n = 1), 70% dimethyl sulfoxide (DMSO); (n = 1), 2% DMSO (n = 1), and 5-fluorouracil (n = 1). As shown, with 4 exceptions, the majority of these interventions were tested only in 1 to 3 RCTs each.
Trial characteristics. Comparisons between trials focusing only on Raynaud's phenomenon (n = 24) and the other trials (n = 46) showed significant differences on the percentage of withdrawals and prior duration of the disease (Table 1). Trials focusing on Raynaud's phenomenon tended to involve patients with longer disease duration and had fewer withdrawals on average. Perhaps this was partly a reflection of the fact that there were very few Raynaud's trials with long followup (only 1 had a followup of longer than 1 year), but long-term trials were rare and this comparison did not reach formal statistical significance. Otherwise, the two groups of trials were very similar in other aspects of study design and conduct (Table 1). The source of funding and whether the trial was placebo-controlled or comparative of different treatments was not significantly related to any of the design or quality characteristics or the efficacy results.
Only 5 studies had more than 100 patients and only 8 studies had more than 50 patients. The largest RCT had 308 patients. Women predominated in all studies. Five studies targeted patients with early disease (mean duration less than 3 years), and two of them had mean disease duration of less than 1 year. In addition, 23% of the studies did not mention the mean or median disease duration of the enrolled patients. A cross-over design was implemented in 27% of the studies. With the exception of 2 small studies performed in India and Mexico, all RCTs were conducted in developed countries.
Quality characteristics. Sixteen studies mentioned the randomization mode and 7 elaborated on the methods used to ensure allocation concealment, but most were double blind. One-third of the studies failed to describe withdrawals, and only about one-half of the RCTs offered withdrawal information per each arm. Withdrawal rates were substantially high in most trials that reported these rates. In 36 RCTs with available data, withdrawal rates exceeded 10%. Only 8 studies mentioned power calculations for specific outcomes, and 6 of them had at least 80% power for the specified difference in the main outcome. The outcomes were well specified in approximately one-quarter of the RCTs. With few exceptions, a multitude of outcomes were reported, the median being 8 outcomes.
Studies with longer duration of followup were more likely to report the mode of randomization and to provide information on withdrawals. The mode of randomization was specified in 3 of 8 trials with followup longer than 1 year, 13 of 50 trials with followup between 2 months and 1 year, and none of the 12 trials with less than 2 months of followup (P = 0.037, adjusted for trend). The respective rates for provision of information on withdrawals were 6 of 8, 37 of 50, and 3 of 12 (P = 0.008, adjusted for trend). However, a longer followup also increased the withdrawal rate. The mean withdrawal rate was 8% in RCTs with less than 2 months of followup, 21% in RCTs with followup between 2 months and 1 year, and 36% in RCTs with followup longer than 1 year (P = 0.030 by ANOVA).
The withdrawal rate was also higher in studies with a larger number of SSc patients (Spearman's correlation coefficient 0.39, P = 0.009). Studies with a larger number of SSc patients were more likely to mention the mode of randomization, and provide information on allocation concealment, withdrawals, and power calculations (Spearman's correlation coefficients 0.37, 0.24, 0.27, and 0.31, respectively; P = 0.001, P = 0.042, P = 0.026, and P = 0.009, respectively).
None of the quality parameters were significantly related to characteristics of the studied patient population in each trial, including proportion of women, mean age, mean disease duration, or country of enrollment (P > 0.05 for all tests).
Efficacy of the experimental interventions. Overall harm was described in very few trials and more than half of the RCTs showed either significant efficacy or a trend for efficacy for the experimental intervention versus the control arm. Nevertheless, the significant efficacy findings were scattered across 16 different interventions, and for 13 of them there was only 1 trial with significant results. Only 4 interventions had been studied in at least 3 placebo-controlled trials, and none of them had been shown to be unequivocally and consistently effective (Figure 1). Iloprost had been shown to be significantly effective against placebo in 3 different Raynaud's trials (Figure 1) and against nifedipine in another trial. Still, oral iloprost did not show any significant efficacy in the largest conducted placebo-controlled trial to date (n = 308). Nifedipine and cisapride had been shown to be significantly effective in 2 placebo-controlled trials each (Figure 1). Nifedipine was also used for Raynaud's phenomenon. Cisapride was used for its effect on gastrointestinal mobility, but the drug has been currently withdrawn because of serious toxicity. Ketanserin was significantly effective in 1 of 5 placebo-controlled trials (Figure 1), and the indication again was Raynaud's phenomenon. Therefore, no intervention for non-Raynaud's systemic manifestations has been found to be significantly effective in more than 1 trial to date.
Figure 1. Sample size and presence or absence of significant efficacy in placebo-controlled studies for the 4 treatments with at least 3 placebo-controlled trials each. Cross denotes statistically significant efficacy, square denotes no statistically significant efficacy.
Download figure to PowerPoint
Reporting of significant efficacy was negatively correlated with a longer duration of disease, a larger percentage of withdrawals, and double masking (Spearman's correlation coefficients −0.33, −0.34, and −0.28, respectively; P = 0.015, P = 0.026, and P = 0.018, respectively). Significant efficacy was reported in 14 of 58 double-blind trials, whereas 7 of 12 trials without double blinding reported significant efficacy for the experimental intervention (odds ratio 0.23, P = 0.025). Furthermore, none of the 8 long-term trials with followup of longer than 1 year reported significant efficacy versus 21 of the 62 other trials (exact P = 0.095). In multivariate backward elimination logistic regression, double blinding and a longer duration of disease independently decreased the chances of reporting of significant efficacy (odds ratio 0.17 for double-blind studies, P = 0.029; odds ratio 0.82 per 1 year longer prior duration of disease, P = 0.036).
Changes over time. Changes in trial design and quality characteristics over time are shown in Table 2. There was a suggestion that some quality parameters may be improving in recent trials. In particular, the odds of the trial not specifying the sample size per arm, not specifying outcomes adequately, and not mentioning power calculations decreased significantly over time. However, even in the most recent trials those problems were still present. For example, looking only at the 21 trials published after 1995, primary outcomes were well specified in 12, power calculations were mentioned in 7, and 1 trial did not even report the number of patients per arm. On the other hand, double masking was becoming less frequent over time (P = 0.042), and this was particularly true for trials published after 1995, in which one-third were not double-masked. Finally, cross-over designs were becoming less popular over time (P = 0.029), and they were used only in 2 RCTs published after 1995.
Table 2. Changes in trial design, quality characteristics, and outcomes over time
|Characteristics||Pre-1986 n = 19||1986–1990 n = 15||1991–1995 n = 15||1996–2000 n = 21||P value|
|Focus on Raynaud's phenomenon only||12(63%)||7(47%)||10(67%)||17(81%)||0.143|
|Sample size per arm not specified||7(37%)||4(27%)||3(20%)||1(5%)||0.011|
|Mean disease duration <5 years||1(5%)||4(27%)||4(27%)||4(19%)||0.35|
|Followup longer than 1 year||2(10%)||1(7%)||1(7%)||4(19%)||0.383|
|Randomization mode specified||3(16%)||3(20%)||2(13%)||8(38%)||0.126|
|Allocation concealment specified||1(5%)||2(13%)||0(0%)||4(19%)||0.256|
|Masking: double-blind vs. single or none||17(89%)||14(93%)||13(87%)||14(67%)||0.049|
|Withdrawals described per arm||7(37%)||7(47%)||10(67%)||13(62%)||0.084|
|Power calculations specified||0(0%)||0(0%)||1(7%)||7(33%)||0.0003|
|Outcomes well specified||3(16%)||1(7%)||3(20%)||12(57%)||0.0014|
|Results showing significant efficacy||6(31%)||4(27%)||5(33%)||6(29%)||0.943|
|Results showing significant efficacy or trend for efficacy||9(47%)||9(60%)||9(60%)||11(34%)||0.816|
|Funding from industry||9(47%)||14(93%)||12(80%)||13(62%)||0.563|
|Comparison with placebo or no treatment||17(90%)||13(87%)||13(87%)||15(71%)||0.138|
- Top of page
- APPENDIX 1
A systematic appraisal of the RCTs performed on patients with SSc shows a diversity of study designs and tested interventions. Nevertheless, very few studies have evaluated patients with early disease, and long-term followup for the assessment of hard clinical outcomes is uncommon. Most studies in the field have been of small sample size and underpowered for assessing effectiveness. Despite lack of power, a surprisingly large number of trials have concluded that the tested interventions are effective, but replication of most of these promising findings is lacking, and some of them may be spurious. Double-blind studies have been more conservative in this regard, and claims of effectiveness are also seemingly curtailed in trials of more long-standing disease. Although double blinding has been widely adopted, the majority of RCTs in the field have failed on several traditional aspects of quality reporting, including the mode of randomization, allocation concealment, and provision of reasons for withdrawals; larger RCTs with longer followup have done better in this regard than small RCTs of more limited duration. Quality characteristics seem to improve over time, but deficiencies are still prominent in many recent trials.
Reports from small, underpowered trials with multiple outcomes that are not well specified can be misleading for the clinical evidence procured by randomized research in this field. Small trials are known to be more vulnerable to publication lag (7) and publication bias (8). A multiplicity of not clearly specified outcomes also leads to confusion and provides an opportunity for multiple comparisons, data dredging, and data fishing (9), increasing the possibility of reaching spurious conclusions about the efficacy of an intervention. The large proportion of published trials that have claimed effectiveness for their tested interventions may be partly due to such biases. Of interest, no trial with followup longer than 1 year has claimed significant effectiveness. Scleroderma is a chronic disease, and long-term information would be important to obtain on all therapies that are considered to be effective based on short-term successes. Long-term followup would allow the assessment of hard clinical outcomes (2, 3) and would give a better picture of the long-term toxicity and tolerability of each intervention, which are equally important for deciding on its optimal use (10). The available trials showed alarmingly high rates of withdrawals, with rates averaging 36% in trials with followup of longer than 1 year. Enthusiasm may need to be tempered even for effective treatments if the withdrawal rates are that high. Further research should focus on optimizing the use, tolerability, and compliance with otherwise effective interventions.
While double blinding is widely implemented in SSc trials, most of these trials score relatively low on other parameters that are usually considered to be important markers of quality. Empirical evidence from other medical domains has shown that lack of masking may spuriously inflate the estimates of the treatment effects by 20%, and lack of adequate allocation concealment may lead to a 40% spurious inflation (11, 12). Of course it is difficult to score the quality of a RCT based on a published report, and several investigators have pointed out that quality ratings can be only rough indicators at best (13, 14). It is conceivable that although the reporting of some aspects of the study design may be suboptimal, the actual conduct of the trial might have taken these issues into consideration. Some deficiencies in a few examined trial reports were glaring, such as the occasional failure to report even the number of patients randomized in each arm. The more widespread implementation of standardized reporting for RCTs may improve this situation in the future (15).
We identified a relatively large number of trials focusing on Raynaud's phenomenon, as compared with trials focusing on other systemic manifestations of SSc that usually are more clinically important. There is a need to focus on rigorously defined, serious clinical outcomes in the field (3). Consensus statements by experts (3) provide guidelines for the performance of long-term trials with clear objectives. Such trials should also consider targeting some patient populations that have been relatively neglected in randomized research to date. In particular, very few trials have enrolled patients with early disease. Such patients may have the best chances of reversibility of organ damage and favorable long-term outcomes. Patients with long-standing disease, established complications, and extensive fibrosis may be more difficult to treat. Our analysis also shows that effectiveness was less likely to be claimed in populations with more long-standing disease. This analysis is susceptible to potential ecologic bias, because individual patient data are not available. However, it suggests that potentially effective interventions may be rejected prematurely as being ineffective if they are evaluated only in patients with long-established disease.
Finally, there is encouraging evidence that the quality of RCTs in the field has been improving in recent years. Deficiencies in trial design have been described in other areas of rheumatology, such as rheumatoid arthritis (16, 17), but larger and better-designed trials have also appeared more frequently in these areas recently. Even though it may be difficult to implement the mega-trial paradigm in rheumatology as in other specialties, e.g., cardiology (18), randomized research should be encouraged because uncontrolled studies in the field may lead to even larger biases (19) than those described above for RCTs. A better understanding of the long-term outcomes, a standardization of definitions, and the development of multicenter collaborations may lead to better and more fruitful clinical research accomplishments in the future. Some of the largest recent trials in the field have led to disappointing results. For example, relatively large trials have failed to find any significant systemic benefit for high-dose d-penicillamine (20), aminobenzoate potassium (21), or iloprost (22). Most successes have been modest, controversial, or limited to the treatment of Raynaud's phenomenon. Nevertheless, it is equally important to find effective therapies and to disprove ineffective therapies that may even be harmful. We should acknowledge that until now, the few major accepted therapeutic advances in the field, such as the use of angiotensin-converting enzyme inhibitors for renal crisis, have relied on nonrandomized studies (23, 24). Nonrandomized studies may be useful, but there is a need for more randomized research of high standards. Regardless of the study outcomes, careful study design and avoidance of biases is essential for obtaining an unbiased picture on how to treat patients with SSc.