SDD = smallest detectable difference; ICC = intraclass correlation coefficient.
How to report radiographic data in randomized clinical trials in rheumatoid arthritis: Guidelines from a roundtable discussion
Article first published online: 5 APR 2002
Copyright © 2002 by the American College of Rheumatology
Arthritis Care & Research
Volume 47, Issue 2, pages 215–218, April 2002
How to Cite
Van der heijde, D., Simon, L., Smolen, J., Strand, V., Sharp, J., Boers, M., Breedveld, F., Weisman, M., Weinblatt, M., Rau, R. and Lipsky, P. (2002), How to report radiographic data in randomized clinical trials in rheumatoid arthritis: Guidelines from a roundtable discussion. Arthritis & Rheumatism, 47: 215–218. doi: 10.1002/art.10181
- Issue published online: 5 APR 2002
- Article first published online: 5 APR 2002
- Manuscript Accepted: 6 OCT 2001
- Manuscript Received: 5 SEP 2001
Radiographs are important tools used to assess damage to articular structures in patients with rheumatoid arthritis. They are part of the core set of outcome data assessed in rheumatoid arthritis clinical trials and play a crucial role in determining the disease modifying effect of therapy (1). The interest in radiographic damage is growing, with the awareness that several treatment strategies have beneficial effect in reducing radiographic progression. However, it is also clear that there is wide variability in the methods used to report radiographic data, which hampers comparisons between trials. A uniform set of radiographic data to report in all clinical trials would resolve this issue and enable meta-analyses. Such analyses may allow assessments of the relative impact of various therapeutic agents without the need to carry out expensive and administratively complex direct comparisons.
There is considerable variability in the accrual and reporting of radiographic data. In some studies both the hands and feet are evaluated, whereas in others only the hands are assessed. Many studies use the Sharp or modified Sharp scoring system, whereas others employ the Larsen index (2–6). There is no consistency in reporting mean or median data. There is also no uniformity in reporting analyses on groups of patients or on individuals. Some, but not all, studies report data based on abnormalities (e.g., erosions), on number of joints (e.g., newly eroded joints) or based on patients (e.g., number of patients with erosions) or on severity (e.g., extent of joint erosions in individual joints). Importantly, there is no uniformity in the primary endpoint employed in different studies. The acceptance of a consistent approach to collect and report data in all trials will facilitate objective publication and comprehension of uniform data sets, especially when this approach is subsequently required by journal editors.
The language used to report radiographic data is also not uniform, and many of the words employed have different symbolic meanings to different individuals. This leads to more confusion and could imply that specific agents have different effects from others, even though review of the data suggests otherwise. Developing guidelines regarding the language used to describe comparable results could further improve clarity regarding the effects of different agents. Because of these concerns, it was decided to conduct a roundtable discussion in connection with a workshop on the “Interpretation of the impact of antirheumatic therapies on radiographic progression in rheumatoid arthritis” which was held in Bethesda, Maryland, December 3–5, 2000. This provided the opportunity to convene a large, varied group of individuals with an interest in radiographic scoring methods and clinical trial design. Researchers from academia, pharmaceutical companies, and representatives from the Food and Drug Administration were invited to participate. The 2 aims for the roundtable were: 1) to define a minimum set of radiographic data that should be presented for each clinical trial which includes radiographs as an endpoint, and 2) to recommend appropriate wording to describe radiographic data.
The roundtable discussion was developed by the 2 moderators (DvdH, LS) who selected the issues to be discussed. These were then presented with some background information to the roundtable participants (the authors of this paper) and to the audience (DvdH). A short presentation on radiographic results from recent trials (VS), was followed by separate presentations of the discussion issues by the moderator (LS). The issues were first open for discussion by the roundtable participants and then by the audience. After reaching consensus on a subject, the next discussion point was presented. This roundtable developed a series of recommendations that were presented in final format at the end of the two-day workshop. At that time all recommendations were again discussed and final modifications were made. The final consensus is presented here.
Minimum reporting set
In agreement with previous recommendations of the Outcome Measures in Rheumatology Clinical Trials conference (OMERACT) it was recommended that 1) radiographs of hands and feet should be included in each trial with a duration exceeding 1 year; 2) erosions and joint space narrowing are both important because they represent different aspects of the biologic process underlying the development of structural damage; and 3) the smallest detectable difference (SDD) based on the 95% limits of agreement of the observers should be reported as a form of quality control (7–11).
Number of observers.
The first issue discussed was the number of observers needed to score radiographs. It was determined that a minimum of 2 observers should read the films and the average score of the 2 observers should be used to express the analyses (12). To ascertain quality control of the observations, information on the agreement between observers, such as a kappa statistic or the intraclass correlation coefficient (ICC), should be presented. In case of use of 2 observers, the SDD based on the interobserver agreement should be used as another way to establish the quality of the readings. One observer might be acceptable, especially for non-drug trials such as epidemiologic studies. However, if only a single observer was utilized, it would be necessary to present the intraobserver agreement as a measure of the consistency of results. And the interobserver agreement of this single reader with another experienced reader should be shown to ensure generalizability of the results. In this circumstance, the SDD based on the intraobserver agreement should be calculated.
It was widely agreed that it would be important to develop a standard set of radiographs which would be used for quality control. This set could by scored by trial readers to test the quality of the assessments. By reaching a certain standard, certification could be obtained. This recommendation follows the previous suggestions developed at OMERACT 4 (7).
Absolute number versus percentage.
Various radiographic scoring methods have been adopted; although the Sharp and Larsen methods with their modifications are the most widely used, other validated systems to analyze radiologic data may also be used. These various methods have different score ranges that vary from a maximum score of 150 to 448. This may make comparisons between methods difficult. An increase of 5 scored with one method does not necessarily imply the same degree of progression as an increase of 5 scored by a different method. Therefore, it has been proposed to present data as a percentage of the maximum possible score (in the case of a method with a maximum score of 150, an increase of 5 would mean a 3.3% increase, whereas in a method with a maximum score of 448 this would mean an increase of 1.1%). Although using a percentage might increase the apparent comparability of the data, it presents inherent problems. For example, it assumes linearity along the complete range of scores and that the entire scale can be utilized with all scoring methods. The final consensus of the attendees was that the absolute numbers should be presented, because the comparison between therapies within a trial is the main goal. By presenting the absolute numbers and the maximum for the method applied, it is always possible to calculate the percentage of the maximum possible score for comparison across methods.
Primary and secondary endpoints.
An important issue is the definition of the primary and secondary endpoints. Because erosions and joint space narrowing give complementary information on different aspects of structural damage, it was believed important to have this information as the primary endpoint combined in a total score. For the (modified) Larsen system this pooled information is captured in the grading (although joint space narrowing has no major contribution) while for the (modified) Sharp system this information is combined in the total score (the sum of the erosion and joint space narrowing scores). For Sharp it was recommended that the erosion and joint space narrowing subscores should be included as secondary endpoints. This split analysis could demonstrate whether some therapies may have a differential effect on the various components of structural damage.
Analyses on treatment group versus individual patient level.
Data may be analyzed on a group (e.g., the median or a mean change in erosion score of the group) or on a patient basis level (e.g., the number of patients with an increase in erosions). Each type of information gives valuable, but different information. This is especially true because radiographic data is frequently skewed with a large number of patients with either no or minimal change and a small number of patients with a major change in radiographic damage. Analyses on a group level are more sensitive to detect differences between treatments, and therefore it was agreed this should be the primary level analysis, and patient-based analyses as secondary.
The non-parametric nature of the data indicates that it is advantageous to present both mean, with SE or SD, and median results with several percentiles even though the median is mostly seen as the appropriate statistic to employ. Box-whisker plots showing median, 25–75, 10–90 percentiles and outliers are a useful and effective method of presenting the data.
For the presentation of the data on an individual patient basis, a cut-off value reflective of clinical importance needs to be determined. The information presented varies according to the cut-off level selected. It was considered important to know the percentage of patients who showed any progression. For example, a change of 0.5 could be used if the averaged score of 2 observers is used, and a cut-off level of 0 could be used when there is only one observer. Because of the inherent measurement error and the lack of statistical validity, it is uncertain whether this approach will be useful and whether patients will be misclassified with this cut-off level. For example, patients could be classified as having progression even if their radiographic scores do not increase and vice versa. Another suggestion is to use the SDD, which is different in each trial, as a cut-off level. This is a purely statistical concept, which presents a characteristic of the instrument and the readers that apply the instrument. The SDD corresponds to measurement error, and a change smaller than the SDD cannot be detected separately from measurement error. Although it is almost certain that a patient with a change larger than the SDD shows real progression, patients with lesser degrees of change may be missed. Because the SDD is usually rather large, this is at the same time a measure that indicates those patients with relatively major progression. After much discussion, consensus was reached that both the percentage of patients with progression >0.5 (for two readers or >0 for one reader) and the percentage of patients with progression >SDD should be presented. In addition, especially in trials which include patients with early rheumatoid arthritis, it is useful to know the percentage of patients with a score of 0 at baseline and at followup or endpoint. Although all these measures may be clinically valuable, only progression which exceeds the SDD has statistical validity. Progression >SDD likely would be more sensitive if more than 2 readers were employed to read every film, although this would increase the trial expense. The recommendations are summarized in Table 1.
|Radiographs of hands and feet|
|SDD as quality control|
|Preferably 2 or more observers|
|Kappa and/or ICC, and SDD for interobserver agreement|
|Average score of observers|
|If one observer|
|Kappa and/or ICC for intra and interobserver agreement|
|SDD for intraobserver agreement|
|Presentation of absolute numbers|
|Primary endpoint: total score (erosions and joint space narrowing combined)|
|Secondary endpoints for Sharp methods: erosions, joint space narrowing|
|Primary analysis: group level|
|Reporting of mean, SE/SD|
|Box-whisker plot (median, percentiles, outliers)|
|Secondary analysis: patient level|
|% of patients with progression >0.5 for two observers, >0 for one observer|
|% of patients with progression >SDD|
Two additional issues were included: presentation of the number of newly damaged or newly eroded joints, and predicted yearly progression rates. Although not every discussant considered these sufficiently important to include in the guidelines, it is acceptable to include these data. Some participants considered predicted yearly progression rate to be useful, but it was felt that the predicted progression rate should not be used as a means of testing effectiveness of therapy. Its main utility is as a baseline disease severity characteristic for comparing treatment groups in a trial or subsets in the study.
Statistical testing and missing data.
It was agreed that statistical testing should be based on distribution of the data. Radiographic data are usually non-normally distributed. It is important to have a prior definition of tests to be used. Another special concern with radiographic analyses are missing data. Complete films may be missing but also specific joints on available films may not be assessable. In radiographic analyses this is a more critical issue than for clinical data, as a “last observation carried forward” (LOCF) principle will most probably underestimate progression. If we assume the hypothetical situation where all patients in the placebo arm withdraw before followup radiographs are taken, and all patients in the active arm (with real effect) have all radiographs, applying a LOCF for the clinical data would still find the difference, but by radiographic data the active treatment group would perform worse than placebo because it will be unlikely that no patient will show progression whatsoever. Consequently, it is of major importance to try to obtain all followup radiographs from all randomized patients. Still, there will be unobtainable radiographs. Therefore a wide variety of sensitivity analyses should be performed to indicate the robustness of the data. For joints with no followup data, this can be accomplished by imputing data from other (adjoining) joints in the same patients, or by assigning the highest score of all joints in that patient. For radiographs that are completely missing, scores from patients within the comparative treatment arm may be substituted, typically the worst scores in the placebo or active control groups for those missing in the study treatment group. At present there are no uniform ways to handle missing data. Several examples have been described by Sharp et al, and utilized in recent randomized, controlled trials (13). Unfortunately, the handling of missing data remains a major unresolved issue.
Comparison of results across trials.
Following the above guidelines will ensure that the same minimum amount of information will be available on all clinical trials including radiographic analyses. This is a basic prerequisite to allow comparison of data across studies and treatment. Nonetheless, it will remain difficult to compare results directly across trials, and caution should be exercised when attempting to do so. Trials differ by study design, patient populations, baseline damage (different part of the scale of the scoring method) and disease severity (only partly assessed by detailing prognostic factors), observers, scoring method utilized, and sequence in which films are read. These differences will make it hazardous to compare absolute numbers of progression from one clinical trial to another.
The second aim of the roundtable discussion was to select the appropriate and specific wording to be used when reporting the results of radiographic analyses. The following list of words (in alphabetical order) was selected from recent publications describing radiographic results: arrest, delay, halt, inhibit, prevent, reduce, retard, slow, and stop. The general opinion of the roundtable participants was to “not worry about words, present the data.” Participants agreed that the data will best present results, and the description of the results is less significant. Although this was a general consensus, it was agreed that 2 groups of words might be applicable. The first group, including “reduce, retard, slow, and delay” should be used as a general description of results on a group level in a trial. In fact, this can be used for every statistically significant result comparing the results of one intervention with another at the group level. The word “delay” was judged to be somewhat different from the words ”reduce, retard, slow” as this word implies that radiographic progression might occur in the future. Therefore the use of “delay” was less preferred. It was concluded that using the words “arrest, halt, inhibit, prevent, and stop” should provide descriptive information on an individual patient level, and that it would be unlikely that this result would be observed in all patients. Consequently, it is important to provide a percentage of patients achieving a certain degree of absence of progression. For example: “progression was arrested in X% of patients defined as progression ≤ cut-off level Y.” Utilization of the above words in the appropriate setting along with the added recommendations, would allow more uniform descriptions of radiographic results without imposing unacceptable restrictions. A summary is presented in Table 2.
|Reduce||To be used on a group level|
|Retard||Every statistically significant result compared to comparator|
|Arrest||To be used on a patient level|
|Halt||Describes the percentage of patients with progression ≤cut-off level Y|
Guidelines have been presented to improve uniform reporting of radiographic data in clinical trials. Their use should facilitate comparison of results across trials and therapies. In addition, a list of words to be utilized when describing radiographic results is provided in hopes that their consistent use will allow better comprehension of published results. Hopefully, these guidelines will aid investigators in the design, analysis, and reporting of radiographic data.
- 13Leflunomide Rheumatoid Arthritis Investigators Group. Treatment with leflunomide slows radiographic progression of rheumatoid arthritis: results from three randomized controlled trials of leflunomide in patients with active rheumatoid arthritis. Arthritis Rheum 2000; 43: 495–505., , , , ,