SEARCH

SEARCH BY CITATION

Abstract

  1. Top of page
  2. Abstract
  3. METHODS
  4. RESULTS
  5. DISCUSSION
  6. Acknowledgements
  7. REFERENCES
  8. APPENDIX A

Objective

Improved standards for the evaluation of therapeutic interventions in systemic lupus erythematosus (SLE) are needed. The purpose of this study by a committee of the American College of Rheumatology was to define clinically meaningful improvement, no change, or worsening in 6 existing clinical measures of SLE disease activity. This represents an important step in a disease in which some organ symptoms get better and others get worse. It is intended to help investigators develop sample size estimates based on meaningful effect sizes and to gauge the clinical relevance of any observed change in disease activity.

Methods

Medical records from 310 patients drawn from 3 sources were abstracted into a standard format. Each vignette included clinical and laboratory data obtained during 2–3 visits. Ratings on the following 6 instruments were obtained for the same patients during the visit or retrospectively: the British Isles Lupus Assessment Group (BILAG), the Systemic Lupus Erythematosus Disease Activity Index (SLEDAI), the revised Systemic Lupus Activity Measure (SLAM-R), the European Consensus Lupus Activity Measure (ECLAM), the Safety of Estrogens in Lupus Erythematosus: National Assessment (SELENA)–SLEDAI, and the Responder Index for Lupus Erythematosus (RIFLE). From this pool of vignettes, 5 common vignettes and 10 randomly selected vignettes were rated through a secure Web site by 88 international experts on SLE. The experts, who were blinded to the activity measure scores, were asked to rate each patient's clinical condition as worsened, improved, or unchanged relative to the previous visit. These ratings were transformed by statistical procedures into performance characteristic curves that related a change on a particular SLE activity measure to the physicians' agreement on whether that patient had worsened, improved, or remained the same clinically. These were discussed by the committee members, who were blinded to the actual instrument used. The committee then voted on what level of expert agreement would be used to determine clinically meaningful change.

Results

The physician ratings on the 5 common vignettes revealed considerable variation in their clinical appraisals. Overall, the 6 SLE activity measures showed excellent separation of clinical conditions as being worsened, improved, or the same. The committee voted to take 70% agreement by physicians as the point on the performance characteristic curves at which meaningful change in a score could be identified. For each instrument, we computed the units of change required to indicate improvement or worsening.

Conclusion

To our knowledge, these are the first response criteria in any disease where a clinically relevant change has been determined a priori and mapped to standardized measures. This criterion should aid the clinical evaluation of new therapies, improve comparability between trials, and facilitate innovative trial designs.

The treatment of systemic lupus erythematosus (SLE) has improved dramatically, but the morbidity, long-term course, treatment-associated morbidity, and refractory subsets of SLE still impose a considerable toll on patients (1). With the remarkable advances in biology and powerful new technologies directed toward identifying targets, a burgeoning number of new therapeutic possibilities have appeared. However, the means by which these possibilities will be evaluated for their impact on target organ systems and the patient's health and well-being is in its infancy and lags behind drug development.

The conduct of clinical trials of therapeutic agents in SLE is challenged by the relatively small numbers of patients who are eligible for such trials, the heterogeneity of the disease, and the lack of reliable markers of disease activity and organ damage (2). In addition, clinical response continues to be defined on ad hoc and post hoc bases. Standardized criteria for SLE trials would have enormous advantages for testing new agents (3, 4). They would provide a common basis for comparing treatment options and would eliminate post hoc analyses of effects. Standardizing response criteria would permit both qualitative and quantitative (meta-analysis) syntheses of different clinical trials. Criteria would also permit the use of innovative clinical trial designs, which might be more efficient than the standard randomized clinical trials.

Defining a minimally important clinical difference is critical for the conduct of rigorous and interpretable clinical trials. First, lupus is a disease in which the activity can improve in some organ systems and worsen in others. Physicians evaluating the same patient may differ in their assessment of the overall disease activity; this is demonstrated graphically in the present study. Use of composite measures of overall activity permits the calculation of a summary score, and defining a minimally important clinical difference for overall activity and for individual organ systems reduces the measurement error. Second, many trials in SLE have inadequate numbers of subjects to demonstrate differences or to be definitive (“adequate statistical power”), and deciding on the number of subjects that are needed require an estimate of an important effect size. Finally, a definition of minimally important differences allows one to interpret the clinical relevance of any observed difference in disease activity.

The American College of Rheumatology (ACR) charged the Ad Hoc Committee on SLE Response Criteria with the development of criteria by which interventions can be evaluated in SLE. This work builds on the work of the Systemic Lupus International Collaborating Clinics (SLICC) and Outcome Measures in Rheumatology (OMERACT) groups, who recommended that all clinical trials in SLE include measures of cumulative organ damage, SLE disease activity, health-related quality of life, and adverse events (5). This report details the empirical work that led to the definition of a minimally important clinical difference for 6 existing instruments that have documented metric properties or have been used in clinical trials. A report on suggested criteria for evaluating the steroid-sparing ability of interventions in SLE appears elsewhere in this issue of Arthritis & Rheumatism (6). Articles describing our work on response criteria for specific target organs are in preparation.

METHODS

  1. Top of page
  2. Abstract
  3. METHODS
  4. RESULTS
  5. DISCUSSION
  6. Acknowledgements
  7. REFERENCES
  8. APPENDIX A

The committee consisted of clinicians and trials methodologists from the ACR, SLICC, European League Against Rheumatism, Pan American League of Associations for Rheumatology, International League of Associations for Rheumatology, Food and Drug Administration, and OMERACT. It met at Schloss Mickeln, Heinrich-Heine-University, Düsseldorf, Germany, on May 9–12, 2002. No industry funding was accepted. All participants signed ACR Conflict of Interest statements and attested that they met the standards of ethical conduct as delineated by the ACR. Primary data analyses and all interpretations were performed blindly with regard to the identity of the individual patients and the specific instruments. After a period of open comment from persons responding to the advertisement on the ACR Web site and from consultants who were actively solicited (see Acknowledgments), this report was reviewed by the Committee on Research and endorsed by the Board of Directors of the ACR.

Identification of expert clinicians in SLE.

The committee reviewed the results of an international survey in which expert clinicians evaluated the case histories of actual SLE patients. A list of 338 “SLE expert clinicians” was assembled by inspecting the membership rolls of the SLICC, the Editorial Board of the journal Lupus, speakers and authors on SLE presenting at ACR meetings during 1997–1999, attendees of the Fifth International Conference on SLE, and the 1998 ACR Membership Directory.

To quantify the relationship between the SLE disease activity measures and the physicians' judgment of clinical changes, clinicians were instructed to evaluate a large number of actual cases and to decide whether the patient had improved, remained the same, or had worsened. These clinicians were blinded to the scores on the SLE disease activity measures. The level of expert agreement on a particular change was set by the committee (see Results). This allowed us to map a change in score on a given instrument to clinically relevant improvement, worsening, or no change.

Vignette construction.

The persons who organized the premeeting study did not participate in the study itself. Vignettes for the survey were abstracted by one physician (Andrew Moore) from the medical records of 310 patients. The case histories came from the Montreal General Hospital (n = 86), a multicenter trial of plasmapheresis and cyclophosphamide (n = 93) (7), and from European patients in a study of an SLE activity measure (n = 131) (8).

There were no data available from a clinical trial or an observational cohort in which all 6 measures were scored in real time. Therefore, to carry out the exercise, we used cases scored by different raters and, for some measures, scored retrospectively. The British Isles Lupus Assessment Group (BILAG) (9) and European Consensus Lupus Activity Measure (ECLAM) (8) scores were obtained prospectively in vignettes 180–310 and retrospectively in vignettes 1–179. The Systemic Lupus Erythematosus Disease Activity Index (SLEDAI) (10) and revised Systemic Lupus Activity Measure (SLAM-R) (11, 12) were rated prospectively by the physicians participating in each study. For all vignettes, the Safety of Estrogens in Lupus Erythematosus: National Assessment (SELENA)–SLEDAI (13) was rated retrospectively by Dr. Jill Buyon, the Responder Index for Lupus Erythematosus (RIFLE) (14) by Dr. Michelle Petri, and the BILAG by Drs. Sonya Abraham and David Isenberg. The ECLAM was rated retrospectively by Dr. Marta Mosca for vignettes 1–179. The BILAG and RIFLE are ordinal transition scales, and changes on these instruments had to be transformed into continuous data for the analyses.

Each vignette provided demographic information, history, symptoms, and/or physical findings. Unavailable information on the history, physical examination, or laboratory results was inferred from data on the SLE activity subscales. After their construction, each vignette was rescored with the SLAM-R and the SLEDAI to ensure an accurate backward and forward transformation. Vignettes 1–194 included data from the baseline, 2-month, and 6-month encounters; the rest had data from only the baseline and 2-month encounters.

Reproducibility of ratings.

To evaluate the reproducibility of the experts' ratings, 5 common vignettes were assigned to all survey respondents. To ensure a sufficient number of responses across a range of SLE activity, we divided the vignettes by 5 intervals of change in SLE activity between baseline and the 2-month evaluation, and then we randomly sampled vignettes from each group (Figure 1). Ten additional vignettes were sampled from the rest of the vignettes for each participant, and the order of presentation of the 15 vignettes was also randomized. Table 1 gives an example of a vignette.

thumbnail image

Figure 1. Illustrative example of performance characteristics of a given systemic lupus erythematosus (SLE) activity measure. The figure shows the relationship between a change in disease activity score on a given disease activity measure and the probability that the expert physicians judged this as improved, no change, or worsened. The confidence intervals are illustrated at select points. For example, a 4-point decrease in activity score (vertical dotted line) corresponds to an 82% chance that the experts judged the patient as improved, a 12% chance of no change, and a 6% chance of worsened. Red curve = worsened; green = no change; blue = improved. ΔX = change in SLE activity measure (number of points are indicated along the x-axis); P(response/ΔX) = percentage of physicians indicating “improved,” “no change,” or “worsened” for a given ΔX; 95% CI = 95% confidence interval.

Download figure to PowerPoint

Table 1. Between-physician variations in rating of “common” vignettes*
Vignette no.TransitionPhysician responses, %Total no.
WorseNo changeImproved
  • *

    M0 = month 0 (baseline); M2 = month 2; M6 = month 6.

54M0 to M26.713.380.075
 M2 to M66.745.348.075
89M0 to M21.41.497.374
 M2 to M614.923.062.274
109M0 to M212.028.858.973
 M2 to M611.026.063.073
137M0 to M237.851.410.874
 M2 to M66.858.135.174
167M0 to M245.822.231.972
 M2 to M65.66.987.572

For the Internet survey, a secure relational database was constructed. Patient data were presented chronologically, and the response could not be changed. The vignettes were presented without scores from the SLE disease activity measures.

Goals of statistical procedures.

The statistical techniques detailed in Appendix A essentially mapped a clinician's appraisal of whether there had been a meaningful change in scores on the disease activity measures. Since the physicians' assessments were often in disagreement, it is more accurate to describe the probability of agreement that a given patient had improved, experienced no change, or worsened. Statistical procedures and computer simulations were used to produce performance characteristic curves relating this probability to scores on the activity measures, as well as to estimate confidence intervals, smooth curves, and adjust for the number of vignettes that were rated by the experts.

There are no conventions for setting the level of agreement between physicians' ratings to establish quantitative categories of better, no change, and worse. The committee therefore voted on the level of agreement before the data were reviewed. Using this level of agreement, the performance characteristic curves were inspected for corresponding scores on a given disease activity measure.

RESULTS

  1. Top of page
  2. Abstract
  3. METHODS
  4. RESULTS
  5. DISCUSSION
  6. Acknowledgements
  7. REFERENCES
  8. APPENDIX A

Characteristics of the survey respondents.

From the initial list of 338 physicians, e-mail addresses were identified for 255 (75.4%). They were invited to participate via an ACR-issued e-mail that explained the project and gave them a user name and password. In all, 130 experts logged-on to the survey between February 29 and April 4, 2000; 116 of them (78 men and 38 women; 45.5% of the 255 persons contacted by e-mail) answered the initial demographic queries. Of the nonresponders who could be contacted, 12 explained their reasons for not participating, and 19 had technical problems that were then corrected. The respondents' who completed all the vignettes and gave permission to be acknowledged are listed in the Acknowledgments.

Of the 116 participants, 108 were from a teaching institution, 96 had 10 or more years of experience in managing lupus, and 24 countries were represented. The mean age of the participants was 46.3 years, compared with a mean age of 50.2 years for the nonparticipants.

Response patterns.

A total of 88 SLE experts completed at least 1 patient vignette, and 68 of them (77%) completed all 15 vignettes that were assigned. Twenty of the experts completed only a portion of their assigned vignettes, with the number of responses per physician varying between 1 and 11. The survey yielded a total of 1,090 responses. These responses covered 232 different vignettes.

The response patterns of the 20 experts who did not complete all of the survey were screened for evidence of nonrandom, biased selection. Such a bias could occur if, for example, a physician evaluated “easier” vignettes and omitted more difficult ones. Eighteen physicians who evaluated a part of the vignettes responded in a pattern consistent with random selection; that is, they completed the first vignettes in the prescribed sequence and omitted the last vignettes. Given that the order of the vignettes in the sequence was randomized, there was no indication of a bias in these responses, and the data were retained for the final analyses. Two of the 116 physicians appeared to selectively rate vignettes, and their responses were not included.

Interphysician variation in assessments of 5 common vignettes.

All 5 common vignettes were evaluated by 68 physicians. Table 1 shows the results, which demonstrate that for some vignettes, there was impressive consistency, whereas for other vignettes, there was substantial interphysician variation. The committee discussed these vignettes in detail. Although there may have been variation based on occasional misunderstanding of the vignettes, the likely reason for the discrepancies was that clinicians differed in their weighting of manifestations, especially in circumstances where some manifestations improved and others worsened.

Relationship between changes in SLE disease activity scores and experts' assessments of overall change.

The analyses were based on 767 responses covering the transition from baseline to 2 months and on 529 responses covering the transition from 2 months to 6 months. Table 2 shows the distributions of the experts' responses used in the analyses. The relatively lower frequency of physicians reporting an increase in activity implies that the results are more precise with respect to estimating the probability of an improvement than with respect to estimating the probability of a worsening.

Table 2. Distribution of physician judgments*
InstrumentTransitionPhysician responses, no. (%)Total no.
WorseNo changeImproved
  • *

    BILAG = British Isles Lupus Assessment Group; M0 = month 0 (baseline); M2 = month 2; M6 = month 6; SLEDAI = Systemic Lupus Erythematosus Disease Activity Index; SLAM-R = revised Systemic Lupus Activity Measure; ECLAM = European Consensus Lupus Activity Measure; SELENA = Safety of Estrogens in Lupus Erythematosus: National Assessment; RIFLE = Responder Index for Lupus Erythematosus.

BILAGM0 to M2179 (23.7)139 (18.4)437 (57.9)755
 M2 to M696 (18.2)144 (27.2)289 (54.6)529
SLEDAIM0 to M2180 (23.5)140 (18.3)447 (58.3)767
 M2 to M696 (18.2)144 (27.2)289 (54.6)529
SLAM-RM0 to M2180 (23.5)140 (18.3)447 (58.3)767
 M2 to M696 (18.2)144 (27.2)289 (54.6)529
ECLAMM0 to M2180 (23.5)140 (18.3)447 (58.3)767
 M2 to M695 (18.3)139 (26.7)286 (55.0)520
SELENA–SLEDAIM0 to M2164 (25.7)114 (17.8)361 (56.5)639
 M2 to M693 (17.7)144 (27.4)289 (54.9)526
RIFLEM0 to M2179 (23.6)138 (18.2)442 (58.2)759
 M2 to M695 (18.0)143 (27.1)289 (54.8)527

Figure 1 depicts hypothetical data on the performance characteristics of a given SLE activity measure. It shows the relationship between a change in disease activity score and the probability that the physicians would judge this as “improved,” “no change,” or “worsened,” with confidence intervals at selected points. The probability of “improved” is very low in the right part of the graph, where there is an increased activity score, and this probability increases rapidly when the activity score decreases. For example, a 4-point decrease in the score (indicated by the vertical dotted line) corresponds to an ∼82% chance that an expert will judge the patient as having improved, with a 12% and a 6% probability of judgments of no change and worsened, respectively. Thus, for the measure, a decrease of 4 or more points gives high confidence that the patient improved according to the overall assessment by the expert physicians. Indeed, the 95% confidence interval of 0.74–0.90 confirms that the probability of “improvement” at ΔX = −4 is at least 74%, and indicates satisfactory precision of the curves.

Figure 2 depicts the actual data for each of the instruments between baseline and 2 months. Similar plots were generated for the period between 2 months and 6 months (results not shown). These curves were used to determine the change in score on each SLE instrument, and corresponded to ∼70% agreement for “better,” “no change,” and “worse.” If another level of agreement were to be chosen, the change for these categories could easily be determined from the curves.

thumbnail image

Figure 2. Data plotted for 6 systemic lupus erythematosus (SLE) activity measures. Analogous to Figure 1, the x-axis depicts the change in units of an SLE disease activity measure and the 0 point of the instrument. The units are omitted for simplicity. The y-axis is the probability of the experts rating a change as improved (blue line), unchanged (green line), and worsened (red line). The probabilities are also omitted for simplicity. SELENA = Safety of Estrogens in Lupus Erythematosus: National Assessment; RIFLE = Responder Index for Lupus Erythematosus; SLEDAI = Systemic Lupus Erythematosus Disease Activity Index; BILAG = British Isles Lupus Assessment Group; SLAM = revised Systemic Lupus Activity Measure; ECLAM = European Consensus Lupus Activity Measure.

Download figure to PowerPoint

In the absence of any standard criterion, the committee voted that if 70% of the respondents agreed that a patient's clinical condition had improved, worsened, or stayed the same it would constitute significant agreement. Using the 70% criterion, the change in any given instrument score corresponding to a clinically important improvement or worsening was then computed from the performance characteristic curves (Table 3). All 6 instruments we studied showed good to excellent discriminatory properties and separated patients according to whether their condition had improved, worsened, or remained the same.

Table 3. Clinically meaningful differences for specific instruments*
InstrumentReferenceImprovedWorsened
  • *

    A “clinically meaningful” difference was defined as a minimum change in the systemic lupus erythematosus activity score that corresponds to a ≥0.70 estimated probability that experts would judge the patient as having “improved” or “worsened.”

  • BILAG = British Isles Lupus Assessment Group; SLEDAI = Systemic Lupus Erythematosus Disease Activity Index; SLAM-R = revised Systemic Lupus Activity Measure; ECLAM = European Consensus Lupus Activity Measure; SELENA = Safety of Estrogens in Lupus Erythematosus: National Assessment; RIFLE = Responder Index for Lupus Erythematosus.

BILAG9−7+8
SLEDAI10−6+8
SLAM-R11, 12−4+6
ECLAM8−3+4
SELENA–SLEDAI13−7+8
RIFLE14−4+3

DISCUSSION

  1. Top of page
  2. Abstract
  3. METHODS
  4. RESULTS
  5. DISCUSSION
  6. Acknowledgements
  7. REFERENCES
  8. APPENDIX A

SLE is a complex disease that is clinically challenging to evaluate, and it has a pathobiology that is incompletely understood. That notwithstanding, we were able to derive empirically determined changes in overall measures of disease activity that constitute a minimally important improvement or worsening (Table 4). We believe this is the first time in SLE or any disease that a clinically meaningful difference has been defined first and then mapped to quantitative clinical scales. Having standardized end points, even though they are imperfect, will provide practical advantages to the field as well as to lupus patients themselves. The response criterion should be tested on primary data from appropriate randomized clinical trials.

Table 4. Illustrative survey vignette*
  • *

    SLE = systemic lupus erythematosus; BP = blood pressure; ESR = erythrocyte sedimentation rate; RBCs = red blood cells; WBCs = white blood cells; hpf = high-power field; CYC = cyclophosphamide.

inline image

If alternative cut points are to be used, they need to be established a priori. All 6 of the disease activity measures we studied demonstrated excellent performance characteristics in separating responses, which underscores their usefulness in quantitative studies.

In ∼10% of the patients in our data set, an organ manifestation improved while others worsened. This lends support to the use of global indices and, perhaps, explains the variation in the physicians' overall appraisals of patients that was observed in the exercise. This observation also undermines the assumption that there is a single molecular target in the pathogenesis of SLE.

There are some caveats, however. For practical reasons, we had to use abstracted vignettes. Both the abstracted vignettes and the information from clinical notes represent a series of interpretations and judgments about the actual patient. Nevertheless, all participants were presented with the same information, and neither the results nor the response criteria should have been affected.

Also, by necessity, the scoring of the SLE measures we evaluated was done by different physicians. Some were done by physicians who were involved with the actual care of the patient represented in the vignette, and others were done retrospectively, using only the information contained in the vignettes. Again, this introduce some systematic errors, but all participants saw the same information, and so, this should not have affected the results.

This exercise attempted to capture a physician's assessment of a “meaningful change.” This judgment of “meaningful” may be discordant with that of the patient (15, 16), whose appraisal may be driven by their dominant symptoms (in contrast to the most serious pathophysiology), by their priorities, and/or by the severity of their disease at baseline. The discordance between the physician's assessment and the patient's assessment of meaningful change, particularly with regard to patient- reported symptoms, should be studied. Investigators may wish to express the change in disease activity as both the absolute change and the percentage change, and to use transition questions to capture “meaningful change” in the individual patient (e.g., Have you experienced a change in your symptoms? Has this change made a difference to you? How much of a difference has this been?) (17).

Finally, one notes that the change in overall disease activity corresponding to a worsening of the patient's clinical condition would likely be larger for patients with high levels of disease activity at baseline and smaller for patients with lower levels of disease activity at baseline. Future research might explore whether examining the percentage change or the percentage decrease or using different cutoff points for patients entering a trial with lower baseline levels of disease activity would permit a more accurate depiction.

In summary, the committee recommended that controlled trials of therapy in SLE should use organ-specific measures, with response criteria that are defined a priori, and valid, reliable composite instruments for evaluating overall disease activity. Although composite indices reduce sample size requirements and have advantages for statistical analyses, they can, by their nature, mask worsening and responding organ systems. This makes it important to present both the overall activity and the activity in individual organs.

The 6 instruments we examined demonstrated discriminatory properties that were more than sufficient for use in clinical trials. It is likely that other validated measures would be useful as well. Investigators are urged to use one of the instruments and to calculate a sample size using the response criteria (or minimally important difference) for that particular disease activity measure. By implication, patients need to have a clinically important and sufficient level of disease activity prior to treatment in order to demonstrate a significant change (18). The choice of which activity measure to use should be based on the specific study, costs, convenience, and other factors beyond the scope of this analysis.

In addition to measuring overall disease activity, individualized organ-specific measures should be used. A priori response criteria for a specific organ/manifestation should be defined. A process was started in Düsseldorf, and the work will be the subject of future publications. There is also a need to test these criteria using actual clinical data sets. Identifying enough patients with specific organ involvement will be difficult, and it means that for all practical purposes in the foreseeable future, consensus will have to suffice.

Additional recommendations of the committee are as follows:

  • 1
    The use of an independent end-points committee whose members are blinded to treatment status could be valuable for adjudicating the status of patients and for ensuring the internal validity of a trial.
  • 2
    Procedures for ensuring the reliability and accuracy of the objective data (e.g., urinary sediment) and the subjective data collected from the subjects or clinician assessors (e.g., overall disease activity measures) in a study are an essential part of ensuring precision. The results of reliability tests that are performed during the trial should be reported.
  • 3
    A strong program of research on the identification and testing of biologic, imaging, and clinical and laboratory markers of activity and organ damage, as well as of disease activity that leads to long-term organ damage, is needed.
  • 4
    The published reporting standards for clinical trials (19) need to be supplemented by additional information in SLE trials in order to improve the quality and interpretation of the findings.

All science, in one sense, is about measurement, but not all measurement is science. Current measures of clinical phenomena are simply measures, nothing more, until the cause(s) of SLE and its subsets are elucidated. The committee acknowledges that these recommendations are but a beginning to what, in the final analysis, must be judged by new data and by their usefulness in furthering the treatment-discovery process and improving patient outcomes.

Acknowledgements

  1. Top of page
  2. Abstract
  3. METHODS
  4. RESULTS
  5. DISCUSSION
  6. Acknowledgements
  7. REFERENCES
  8. APPENDIX A

We gratefully acknowledge the invaluable contributions of Erika Chang, MSc, Elizabeth Concepcion, Kaleena Scamman, Mary Scamman, Victoria Gall, RPT, Jessica Tullar, Jennifer Akerblom, Connie Herndon, Sonya Abraham, and Roxane duBerger. Amy Miller coordinated and supported the meeting in Düsseldorf. We are indebted to Dr. Jeffrey Siegel, who attended the Düsseldorf meeting and commented on earlier drafts of the manuscript.

The following physicians participated in the Web-based survey: Sang-Cheol Bae, Gilles Boire, Larry Brent, Frank Buttgereit, Jill Buyon, Richard Cervera, Alf Cividino, Leslie Crofford, John Davis, Michal De Bandt, Raphael DeHoratius, R. H. W. Derksen, Pao-Hssii Feng, Barri Fessler, Alan Friedman, Azzudin Gharavi, Gary Gilkeson, Winfried Graninger, E. Gromnica-Ihle, Hiroshi Hashimoto, Marc Hochberg, Frederic Houssiau, Gabor Illei, Mariana Kaplan, Elizabeth Karlson, John Klippel, Masataka Kuwana, Michael Lockshin, Klaus Machold, Walter Maksymowych, Bernhard Manger, Thomas Medsger, Yair Molad, James Oates, Chaim Putterman, Rosalind Ramsey-Goldman, Morris Reichlin, John Reveille, Jane Salmon, Emilia Inoue Sato, Johann Schroeder, Robert Shmerling, Yeong Wook Song, Christof Specker, Gunnar Sturfelt, Deborah Symmons, Tsutomu Takeuchi, L. B. A. van de Putte, Carlos Vasconcelos, Asad Zoma, and Michel Zummer.

We are indebted to the committee's consultants, who reviewed earlier drafts of the manuscript. Their input sharpened the work considerably. They are Mee Leng Boey, Dimitrios Boumpas, Richard Brasington, Deh Ming Chang, Jefferson Doyle, Vern Farewell, Ellen Ginzler, Bevra Hahn, Jie Huang, Elizabeth Karlson, C. S. Lau, Joan Merrill, Ola Nived, Stanley R. Pillemer, Theresa Podrebarac, Janet Pope, Rosalind Ramsey-Goldman, Kristian Steinsson, Alan Tyndall, Dan Wallace, Michael Ward, and David Wofsy.

REFERENCES

  1. Top of page
  2. Abstract
  3. METHODS
  4. RESULTS
  5. DISCUSSION
  6. Acknowledgements
  7. REFERENCES
  8. APPENDIX A
  • 1
    Lockshin MD. Therapy for systemic lupus erythematosus. N Engl J Med 1991; 324: 18991.
  • 2
    Liang MH, Corzillius M, Bae SC, Fortin P, Esdaile JN, Abrahamowicz M. A conceptual framework for clinical trials in SLE and other multisystem diseases. Lupus 1999; 8: 111.
  • 3
    Liang MH. Toward more informative pilot studies with new anti-rheumatic drugs. Agents Actions Suppl 1993; 44: 7781.
  • 4
    Liang MH, Fortin PR. Response criteria for clinical trials in systemic lupus erythematosus [published erratum appears in Lupus 1997;6:619]. Lupus 1995; 4: 3368.
  • 5
    Smolen JS, Strand V, Cardiel M, Edworthy S, Furst D, Gladman D, et al. Randomized clinical trials and longitudinal observational studies in systemic lupus erythematosus: consensus on a preliminary core set of outcome domains. J Rheumatol 1999; 26: 5047.
  • 6
    Ad Hoc Working Group on Steroid-Sparing Criteria in Lupus. Criteria for steroid-sparing ability of interventions in systemic lupus erythematosus: report of a consensus meeting. Arthritis Rheum 2004; 50: 342731.
  • 7
    Euler HH, Guillevin L. Plasmapheresis and subsequent pulse cyclophosphamide in severe lupus erythematosus: an interim report of the Lupus Plasmapheresis Study Group. Ann Intern Med 1994: 145: 296302.
  • 8
    Bencivelli W, Vitali C, Isenberg DA, Smolen JS, Snaith ML, Sciuto M, et al, and The European Consensus Study Group for Disease Activity in SLE. Disease activity in systemic lupus erythematosus: report of the Consensus Study Group of the European Workshop for Rheumatology Research. III. Development of a computerised clinical chart and its application to the comparison of different indices of disease activity. Clin Exp Rheumatol 1992; 10: 54954.
  • 9
    Hay EM, Bacon PA, Gordon C, Isenberg DA, Maddison P, Snaith ML, et al. The BILAG index: a reliable and valid instrument for measuring clinical disease activity in systemic lupus erythematosus. QJM 1993; 86: 44758.
  • 10
    Bombardier C, Gladman DD, Urowitz MB, Caron D, Chang DH, and the Committee on Prognosis Studies in SLE. Derivation of the SLEDAI: a disease activity index for lupus patients. Arthritis Rheum 1992; 35: 63040.
  • 11
    Liang MH, Socher SA, Larson MG, Schur PH. Reliability and validity of six systems for the clinical assessment of disease activity in systemic lupus erythematosus. Arthritis Rheum 1989; 32: 110718.
  • 12
    Bae SC, Koh HK, Chang DK, Kim MH, Park JK, Kim SY. Reliability and validity of Systemic Lupus Activity Measure-Revised (SLAM-R) for measuring clinical disease activity in systemic lupus erythematosus. Lupus 2001; 10: 4059.
  • 13
    Petri M, Buyon J, Kim M. Classification and definition of major flares in SLE clinical trials. Lupus 1999; 8: 68591.
  • 14
    Petri M, Barr SG, Buyon J, Davis J, Ginzler E, Kalunian K, et al. RIFLE: Responder Index for Lupus Erythematosus [abstract]. Arthritis Rheum 2000; 43: S244.
  • 15
    Alarcon GS, McGwin G Jr, Brooks K, Roseman JM, Fessler BJ, Sanchez ML, et al, for the LUMINA Study Group. Systemic lupus erythematosus in three ethnic groups. XI. Sources of discrepancy in perception of disease activity: a comparison of physician and patient visual analog scale scores. Arthritis Rheum 2002; 47: 40813.
  • 16
    Yen JC, Neville C, Fortin PR. Discordance between patients and their physicians in the assessment of lupus disease activity: relevance for clinical trials. Lupus 1999; 8: 66070.
  • 17
    Liang MH, Lew RA, Stucki G, Fortin PR, Daltroy L. Measuring clinically important changes with patient-oriented questionnaires. Med Care 2002; 40 Suppl 4: II4551.
  • 18
    Abrahamowicz M, Fortin PR, du Berger R, Nayak V, Neville C, Liang MH. The relationship between disease activity and physician's decision to start treatment in active systemic lupus erythematosus. J Rheumatol 1998; 25: 27784.
  • 19
    Begg C, Cho M, Eastwood S, Horton R, Moher D, Olkin I, et al. Improving the quality of reporting of randomized controlled trials: the CONSORT statement. JAMA 1996; 276: 6379.
  • 20
    Fortin PR, Abrahamowicz M, Clarke AE, Neville C, DuBerger R, Fraenkel L, et al. Do lupus disease activity measures detect clinically important change? J Rheumatol 2000; 27: 14218.
  • 21
    Ramsay JD, Abrahamowicz M. Binomial regression with monotone splines: a psychometric application. J Am Stat Assoc 1989; 84: 90615.
  • 22
    Abrahamowicz M, Ramsay J. Multicategorical spline model for item response theory. Psychometrika 1992; 57: 527.
  • 23
    Hastie TJ, Tibshirani RJ. Generalized additive models. London: Chapman & Hall, 1990.
  • 24
    Efron B, Gong G. A leisurely look at the bootstrap, the jackknife, and cross-validation. Am Statistician 1983; 37: 3648.

APPENDIX A

  1. Top of page
  2. Abstract
  3. METHODS
  4. RESULTS
  5. DISCUSSION
  6. Acknowledgements
  7. REFERENCES
  8. APPENDIX A

STATISTICAL METHODS

The mean number of responses for vignettes other than the 5 common vignettes was ∼3. To minimize the excessive influence of the 5 common vignettes on the analyses, we decreased the number of responses to the vignettes in the analyses to 9, which corresponds to the 95th percentile of the distribution of the number of responses available for other vignettes. This was achieved by random sampling of 9 of the 75 responses available for a given standard vignette, with sampling performed independently for each of the 5 vignettes. The 9 responses selected for a given vignette were then retained for the analyses.

The main analyses determined the relationship between the change in the score on each of the 6 SLE disease activity measures and the probability that an expert would assess the patient's overall SLE activity as 1) improved (less activity), 2) unchanged, or 3) worse (more activity). Given the relatively low frequencies of the extreme responses of “much better” and “much worse,” we pooled “much better” with “better,” and we pooled “much worse” with “worse.”

The analyses recognized several sources of variation. First, in SLE, different patterns of changes in organ-specific symptoms can yield the same disease activity score, and the physicians varied in their assessments of whether the patient had responded, stayed the same, or worsened. There was also variation in the assessment of the same vignette, and therefore, the “average” probabilities (i.e., the probability that applies to the responses of a randomly selected “average” physician for a randomly selected “average” patient with a given change in score [18, 20]) were estimated. The modeling also ensured that for any change in score, the estimated probabilities of the 3 responses (i.e., worsened, no change, improved) had to sum to 1.0.

To meet these requirements, a computationally intensive approach that combined different nonparametric methods was used. The approach was based on a modified flexible regression spline polytomous regression model (21, 22), using generalized additive models (23). The final stage of the analyses assessed the precision of the probability curves and estimated confidence intervals around the point estimates. The intervals were also adjusted for sources of interdependent observations: 1) the same physician assessed several vignettes, and 2) the same vignette was assessed by several physicians. To account for these, we used a modified bootstrap approach (24), which allowed for direct modeling of the sources of variation by repeated resampling of the original data. The 95% pointwise confidence intervals reported for a given change in score were based on the 2.5th and 97.5th percentiles of the empirical distribution of the 1,000 corresponding bootstrap-based estimates of the probability of a given response (14). Further details on the statistical approaches are available from Dr. Michal Abrahamowicz.