- Top of page
- PATIENTS AND METHODS
The Bath Ankylosing Spondylitis Disease Activity Index (BASDAI) (1), the Bath Ankylosing Spondylitis Functional Index (BASFI) (2), and the Dougados Functional Index (DFI) (3) are well established, widely used instruments to evaluate disease activity and functioning in patients with ankylosing spondylitis (AS). The BASDAI and BASFI are completed on a visual analog scale (VAS) and the DFI on a Likert scale.
The VAS is a commonly used scale consisting of a 100-mm horizontal line anchored with 2 extremes at either end. It has proven to be a valid and reliable measure for subjective feelings such as pain and function (4–8). Disadvantages of the VAS are that many patients experience difficulties in completing the VAS and that the VAS can only be administered in a written form, which is a limitation for illiterate or visually impaired patients (9–15). Furthermore, there is a risk for measurement error: random errors that can occur during measuring the distance to the mark on the line, and systemic errors during the reproduction of the VAS questionnaires because photocopying may alter the length of the line.
The Likert scale (or verbal rating scale) consists of several categories, most commonly 5 or 7 with adjectives representing degrees of, for instance, functional ability. Subjects mark the adjective that best describes their impairment. Advantages of the Likert scale are that it is easy to understand, simple to complete, and it can be administered in either a written or verbal form (11). Disadvantages are the potential discrepancy between the patient's feelings and the descriptions on the scale, the different interpretations that can be attributed to the adjectives of the scale, and the unequal intervals between the categories (4).
Another type of scale is the numerical rating scale (NRS). The NRS is usually an 11-, 21- or (rarely) 101-point scale, with numbers in boxes that are anchored with 2 extremes at either end. Subjects mark their answer by putting a cross through the appropriate number. The NRS is simple to complete and score, and can be administered in both written and verbal form (11).
Although no major differences in practical use of these answer modalities have been found, the NRS seems to be slightly preferred, since it is easy to complete and appropriate for all groups of patients (10, 11, 16, 17). Furthermore, the presumed high sensitivity of the VAS, because of its infinite number of possibilities for answers, has been disproven by Jensen et al, who showed that little information was lost when a 101-point NRS was transformed to an 11- or 21-point NRS (18). The NRS and the Likert scale, both ordinal scales, have inherent problems that do not differ from those of a VAS because in practice, clusters are formed on the scale, which limit the actual number of responses (12). Consequently, the VAS does not behave as a true continuous scale.
Instruments for research in rheumatology should be valid in all their aspects. To standardize the nomenclature of validity, the Outcome Measures in Rheumatoid Arthritis Clinical Trials (OMERACT) filter has been proposed (19). The 3 domains of the OMERACT filter are truth (validity), discrimination (reproducibility and responsiveness), and feasibility. One of the criteria for feasibility is the appropriateness of the answer scales used in questionnaires. Because some patients may experience difficulties with VAS or Likert scales, and because the NRS is slightly preferred in the literature, we decided to assess the discrimination and feasibility properties of the BASDAI, BASFI, and DFI with an NRS. Our first objective was to study the agreement of scores on the original scales of the BASDAI, BASFI, and DFI with scores on an NRS. Second, the reproducibility and responsiveness of the BASDAI, BASFI, and DFI on the original answer scale and on the NRS were assessed. Attention was paid to both scores on single items as well as the total scores of these questionnaires. The first was done to supply more insight into the properties of single-item questionnaires, sometimes used to assess certain aspects of a disease. To enhance the generalizability of the study results, we decided to investigate all objectives across different languages and cultures, and in clinical trial patients and outpatients with varying disease duration and disease activity.
- Top of page
- PATIENTS AND METHODS
In this study, 3 commonly used AS-specific questionnaires, administered with both their original answer modality and an NRS, were judged with respect to feasibility (appropriateness of the answer modalities) and discrimination (reproducibility and responsiveness) criteria of the OMERACT filter.
More patients were found to have difficulties in completing a VAS than either the Likert scale or NRS. Eighty-seven percent of the patients preferred to answer on either an NRS or a Likert scale, and only 9% on a VAS. A greater preference for the Likert scale was also described by Kremer et al in 57% of the patients studied (10).
Bland-Altman plots showed a major variety with respect to answering the same question on different scales (Figure 1). The variation was not solely due to the different answer modalities: The question that was answered twice on the NRS also showed a substantial variability. The huge differences in scores give the impression that some patients did not fully understand or properly read the anchors of the scales. From the Bland-Altman plots it can be deduced that the variability is random. However, the variability of the total scores of the questionnaires answered on different answer scales was less impressive, as can be expected by aggregating different answers into 1 score, but was still substantial. The ICCs of the total scores were relatively high, implying a high degree of concordance between scores on different answer modalities.
The reproducibility of individual questions and total scores of the BASDAI, BASFI, and DFI was much lower than expected (Table 3). Large differences in all questionnaires were found between the scores obtained from the outpatients in Mexico and the control group of the spa therapy trial in the Netherlands. It is arguable whether age, cultural aspects, or the level of education were the basis for these differences. In the Dutch group, the mean ICCs of both individual and total scores of the BASDAI were lower than the minimum of 0.75; in the Mexican group none of the ICCs from both individual and total scores of the 3 questionnaires reached the minimum of 0.75. Only the total scores of the BASFI and DFI in the Dutch group showed acceptable ICCs. However, no major differences between the ICCs with respect to different answer modalities were found for all questionnaires.
The lack of reproducibility found in individual questions can have major implications for single-item questionnaires. For instance, the 2 questions on pain and patient's global, both single-item questions and selected as specific outcome instruments in research in AS patients, may lack a sufficient degree of responsiveness on an individual patient level, due to low reproducibility (30). This lack of reproducibility merits further investigation.
Garrett et al reported a test–retest reliability with a Pearson's correlation of 0.93 for the BASDAI (1). Pearson's correlations were also published by Calin et al for the BASFI (r = 0.89) and DFI (r = 0.96) (2). Whereas the Pearson's correlation is a measure of association, the ICC gives information about the concordance of the results, which is the degree to which the same results are found in the same, stable subjects at repeated measurements. Consequently, the Pearson's correlation coefficients give higher results, which are difficult to interpret. The ICC is preferred in calculating reproducibility (28). Dougados et al reported an ICC of 0.86 for the DFI, which is similar to our result in the Dutch group (3).
Responsiveness was assessed with 3 different methods. Moderate to large effects were found in the intervention group with all 3 responsiveness methods, independent of the answer modality used. The method described by Guyatt showed higher responsiveness than either the SRM or ES method in all questionnaires, with exception of the DFI answered on the Likert scale. Ruof et al also found that the Guyatt method showed higher responsiveness scores in their comparative study of the BASFI and DFI, in which they also stated that the BASFI was more responsive than the DFI (31). The results of our study do not confirm the latter; the superiority of one of these questionnaires with respect to responsiveness appeared to be dependent on the responsiveness method applied.
Bolton and Wilkinson showed in their study that the responsiveness of measures was higher when using the NRS compared with VAS and Likert, although NRS and VAS were closely related (17). In the present study, minor differences between the scales were found, with only the BASDAI showing consistently higher scores on the VAS compared with the NRS. In general, the different scales seem to have reasonably similar properties with respect to responsiveness.
In conclusion, although a major variability in individual questions on the original answer scales of the BASDAI, BASFI, and DFI compared with the NRS was found, total scores showed a high level of agreement. These results were found for all 3 countries, and for both clinical trial patients and outpatients. The variability between the scales may entirely be explained by the low reproducibility of individual questions found in each of these questionnaires. All 3 questionnaires showed good responsiveness on both the original scales and on the NRS. The original answer modalities of the BASDAI, BASFI, and DFI can all be replaced by an NRS that maintains the properties of the original scales.