In assessing the disease-modifying potential of drugs used in the treatment of ankylosing spondylitis (AS), the demonstration of a reduction or termination of structural damage is essential. Structural damage in AS can be measured on radiographs of the spine and hips. A number of radiographic scoring methods are available for this purpose: the Bath Ankylosing Spondylitis Radiology Index (BASRI) (1), the Stoke Ankylosing Spondylitis Spine Score (SASSS) (2), and a modification of the SASSS (M-SASSS) (3). The BASRI exists in 2 forms: BASRI-spine and BASRI-total. The former excludes, and the latter includes, the hips. The BASRI and the SASSS have been published in peer-reviewed journals, and the M-SASSS has been published in thesis form only. All methods have been validated by their developers.
It is commendable that one of these methods would be selected as the radiographic outcome assessment of choice for clinical trials, in order to ensure uniformity and allow a comparison of data across trials in the future. The Assessments in Ankylosing Spondylitis Working Group is attempting to standardize the measurements in AS (4), and the selection of a method to assess radiographic progression is one of the important issues on its research agenda.
The validity of radiographic scoring methods in AS has hardly been investigated in the past. Spoorenberg et al (5, 6) have initiated method comparisons with a maximum followup of 2 years, primarily related to assessing damage scores. In these studies, some aspects of reliability (intra- and interobserver reliability of status scores) of all 3 methods were established. Moreover, agreement between 2 observers on progression in individual patients was assessed, but only with a strict definition of “agreement.” In clinical trials, however, the subject of interest is change in radiographic damage, primarily on the group level, and not the absolute level of damage itself. Apart from that, and according to the Outcome Measures in Rheumatology Clinical Trials (OMERACT), discrimination (sensitivity to change), truth (construct validity), and feasibility of scoring methods should be investigated before a preference is made (7).
The main objective of the present study was therefore to test the radiographic scoring methods of all 3 aspects of the OMERACT filter over a followup period of 4 years, including an evaluation of the reliability of progression scores.
- Top of page
- PATIENTS AND METHODS
The M-SASSS seems to be the most appropriate method for scoring radiologic progression in AS patients. This conclusion is based on the following aspects of the OMERACT filter: truth, discrimination, and feasibility.
With regard to truth, a valid scoring system requires assessments of the cervical and lumbar spine. Inclusion of the SI joints and hips has no additional value for the detection of progression. An AP view of the lumbar spine as well as an assessment of the posterior site of the lumbar spine do not provide sufficient additional information about progression to justify the extra effort, but an AP view will provide additional information (and therefore better reflects the truth) if the level of damage rather than the progression of damage is the major concern. The consequences are that the SASSS is not recommended because it does not take into account the cervical spine. The BASRI is recommended because of its AP view if radiographic damage is the matter of interest, but this AP view does not supply valuable additional information if progression must be scored.
With regard to discrimination, the M-SASSS demonstrated superior interobserver reliability. In terms of sensitivity to change, this method quantifies a higher proportion of patients as having progression as compared with the BASRI. It also appeared that the BASRI, in contrast with the M-SASSS, might be subject to a ceiling effect. With regard to feasibility, the BASRI takes less time for scoring and training but yields the highest radiation exposure to the patient. For the aspect of feasibility, there is no preferred method.
When the results of our comparison of the different scoring methods against the OMERACT filter are surveyed, the M-SASSS seems to be preferable for the evaluation of radiologic progression in clinical trials and cohort studies. Several studies related to the scoring of radiographs of AS patients have been published, mainly by the developers of the BASRI and the SASSS. It appears that our results are consistent with the results of those studies.
In our study, we could not find support for including the hips in a staging or a progression score. This finding is supported by the data of MacKay et al (1), who persuasively explain why the hips are not included in the BASRI-spine. Because hip disease affects only 18–37% of the AS population, the use of a global score for every AS patient, with a maximum score of 16 rather than 12, may inappropriately dilute the score of the majority of AS patients. Those with severe, or grade 4, spinal disease without hip arthritis would rate only 12 on a 16-point global scale despite having a bamboo spine, poor metrologic values, and poor function. It may be better to grade these populations separately, using the BASRI-spine for one and the BASRI-total for the other. Note that omission of the hips and SI joints in our scoring method does not necessarily mean that these joints are not important in AS for prognostication. As an example, hip involvement became an important predictor of severe disease (14).
The results reported by MacKay et al (1) are also consistent with our conclusion about the essential inclusion of the cervical spine. They presented data on the involvement of the cervical and lumbar spine, SI joints, and hips in a group of 470 patients (15). More than 80% of the patients showed involvement of the cervical or lumbar spine or both (43%), and 8% of the patients showed changes only in the cervical, but not the lumbar, spine. As for which view is needed for scoring the lumbar spine, our conclusion on the status scores is again supported by MacKay et al (1). They judged 58 sets of AP and lateral views of the lumbar spine, and scores for the AP view, the lateral view, and a combination score (the highest score of the 2 views) were obtained. The combination score differed from the AP or lateral score if syndesmophytes or fusion was seen at different levels on each projection. This occurred in 3 of the 58 patients. The combination score differed from the AP score alone in 9 of the 58 patients (15.5%) and from the lateral score alone in 21 patients (36%). Overall, the use of 2 projections changed the score in 46% of the cases. Assuming that the combination view provides the most truthful assessment, the sensitivity of the AP view alone is 0.83 and that of the lateral view alone is 0.73. For the aim of staging, both views are therefore necessary. Unfortunately, MacKay et al did not investigate whether both views are also necessary for assessing progression.
The thoracic spine is not included in any of the scoring methods. This is due to technical problems related to the anatomy of the chest with superimposed lung tissue. Another structure of the spine that has not been mentioned is the facet joints. In lateral views of the lumbar spine, these joints are difficult to assess with any degree of confidence even by an experienced musculoskeletal radiologist (2). On an AP view, these joints can be assessed. This is an advantage of the BASRI scoring method. All other methods ignore the posterior structures of the spine, classifying those who have only posterior element fusion as normal or as having mild radiographic changes, when in fact the spine may be completely fused (1). In Table 4, measurements of spinal mobility were compared with radiologic scores. We found a good correlation, and this relationship was as good for the BASRI-spine as for the other methods.
An important disadvantage of the BASRI in comparison with the SASSS methods is the fact that it does not pick up minor radiologic change. The score does not change with each additional erosion or sclerosis, and will always remain grade 2 until there is fusion between 2 vertebrae or ≥3 syndesmophytes are identified. The developers of the BASRI and the SASSS evaluated their reliability and sensitivity to change. Inter- and intraobserver reliability of the BASRI was assessed on status scores (1), which showed good reliability. After a period of 1 year, no change was observed. In a 2-year period, the mean BASRI-spine value increased from 7.0 to 7.9 (in 40 patients), which was statistically significant. The radiographs in that study were blinded for chronology, confirming that the BASRI could determine “forward progression” (i.e., identify the earlier of 2 radiographs performed on the same individual). We found a progression from 6.5 to 6.9 over a 2-year time interval, and our radiographs were read in chronological order, which often even amplifies progression scores. The difference might be explained by the fact that the patient population in Bath, UK, differs from the population in the OASIS cohort with respect to disease severity.
The developers of the SASSS (16) also investigated the reliability of their method. They showed a good interobserver reliability but, unexpectedly, poorer intraobserver reliability. Sensitivity to change was assessed in 28 patients over a 12-month time interval, and the radiographs were read in known order. The SASSS increased by 4.1 (from 14.4 to 18.5), which was statistically significant. This increase is considerable in comparison with our results; after 4 years of observation, we observed a progression of only 3.5 points.
The cohort used in this study has been studied before, as mentioned above (5, 6). In contrast with the previous study, we observed a change after 1 year, but the order in which the radiographs were scored was known, while in Spoorenberg et al the order was unknown. This can markedly influence the results, as has been shown for rheumatoid arthritis (17). Moreover, in the Spoorenberg study, the average of the 2 observers' progression scores was used to determine whether a patient was classified as having progression, and the criteria for defining progression were much stricter than those applied in the present study. This was especially a disadvantage for the M-SASSS. With the 4-year data available, we observed that the minor changes after 1 and 2 years indeed forecast further progression after 4 years, which adds to the validity of these minor changes.
The different results on progression in all studies can also be explained by a difference in composition of the patient populations. There are 2 different concepts in the mode of radiographic and functional progression of AS during the first 10 years after disease onset. While 2 groups (18, 19) have reported that the most rapid progression occurred in this period, another group (20) recently reported that, in their patient population, radiographic progression was linear, with no significant changes between the decades.
This study may evoke some concerns. First, the conclusions of this study are based on findings in the OASIS cohort. Although this cohort represents the entire spectrum of AS patients, which adds to the external validity of the observations, the conclusions still need to be confirmed by other independent investigators examining a different cohort. Second, we did not investigate whether any of the measures are subject to spectrum bias, i.e., whether they perform differently in patients with early versus late AS. The group was too small to make subgroups for such an analysis. Third, most of the analyses in this study are based on the scores of one reader. Although interobserver reliability appeared to be satisfactory, future studies should include more readers in order to limit biases due to single observers.
In all the studies describing the measurement of radiologic change in AS patients, there are no data available on the reliability of progression scores, which is important in clinical trials. Therefore, we would like to emphasize that in future studies, it is necessary to pay attention to the reliability of these scores. As can be seen from our results, the reliability of progression scores can add important information to the reliability of status scores. In our study, change could be assessed reliably by the M-SASSS.
In summary, comparing the BASRI, the SASSS, and the M-SASSS with respect to their use in clinical trials, we have shown that the M-SASSS offers advantages in measurement properties. However, the BASRI is a feasible and user-friendly method that reliably detects damage in patients with AS, and can be used for that purpose in clinical practice.