A comparison of response criteria to evaluate therapeutic response in patients with juvenile idiopathic arthritis treated with methotrexate and/or anti–tumor necrosis factor α agents

Authors


Abstract

Objective

There are no validated criteria to evaluate clinical response in juvenile idiopathic arthritis (JIA). The purpose of this study was to compare 4 sets of criteria (2 from the American College of Rheumatology [ACR] and 2 from the European League Against Rheumatism [EULAR]) for clinical response evaluation in JIA patients treated with methotrexate and/or anti–tumor necrosis factor α drugs.

Methods

Seventy-five patients with JIA were evaluated at baseline and after 6 months of therapy with second-line drugs. Mean age at study onset was 12.8 years (range 2–32.9 years). Diagnoses were systemic JIA (n = 16), rheumatoid factor–positive JIA (n = 5), rheumatoid factor–negative JIA (n = 9), persistent oligoarticular JIA (n = 10), extended oligoarticular JIA (n = 33), and psoriatic arthritis (n = 2). Clinical response was evaluated with the ACR Pediatric 30 criteria and the ACR 20% response criteria (ACR20), and with the EULAR Disease Activity Score (DAS) and 28-joint DAS (DAS28). Patients with EULAR criteria responses of “good” or “moderate” were classified as responders. Responders and nonresponders according to the different criteria were then compared.

Results

For patients younger than 16 years, Cohen's kappa varied between 0.51 and 0.72, with a good-to-excellent reproducibility index for all comparisons, except for the DAS28/ACR20 comparison. The best agreement was obtained by comparing the DAS and the ACR Pediatric 30. For patients older than 16 years, the reproducibility index was good or excellent in only 2 cases, i.e., comparing the DAS and the ACR Pediatric 30 and comparing the DAS and the DAS28 (as expected).

Conclusion

Our study shows a good agreement overall for the different criteria tested. The highest concordance was observed between the DAS and the ACR Pediatric 30, the lowest between the DAS28 and the ACR20. Our data suggest that the ACR Pediatric 30 criteria can be used also in adult patients affected by JIA, and that the original DAS can be an alternative to the ACR Pediatric 30 in both children and young adults with JIA.

There is no single “gold standard” quantitative measure to assess and monitor the clinical status of patients with juvenile idiopathic arthritis (JIA). The variability of the disease course and the marked phenotypic heterogeneity among different disease types with the umbrella term of JIA are all obstacles to the creation of a valid and uniform definition of clinical response. Indeed, the recent availability of effective therapeutic drugs such as methotrexate (MTX) (1), and more recently biologic agents (2, 3), has reinforced the need for valid, reproducible, and uniform criteria with which to define whether an individual patient has responded to a particular treatment during a drug trial, as well as in routine clinical practice.

In adult rheumatoid arthritis (RA), this problem was addressed more than a decade ago, when during the Outcome Measures in Rheumatology Clinical Trials conference, a core set of efficacy end points was first defined and subsequently developed by the European League Against Rheumatism (EULAR) (4–6). Both the American College of Rheumatology (ACR) and EULAR have worked intensively on this topic, producing criteria that are now used by rheumatologists, i.e., the ACR core set of disease activity measures (7), the ACR 20% response criteria (ACR20) (8), and the EULAR Disease Activity Score (DAS) (9). The DAS in 28 joints (DAS28) has been created as a further modification of the original EULAR criteria (10, 11).

The 2 sets of criteria (ACR and EULAR) are not totally interchangeable; indeed, there are substantial differences between them. First, they have been developed for different purposes (to distinguish between responses to active treatment and placebo [ACR criteria] and between high and low disease activity [EULAR criteria]). Second, a different number of core set variables are included (7 in the ACR criteria and 3 or 4 in the EULAR criteria), and third, the definition and classification of improvement are different. In fact, the ACR criteria define improvement only on the basis of a relative variation, while the EULAR criteria include the absolute variation as well as the level achieved. Finally, while the ACR criteria include only 2 outcome categories (responder/nonresponder), the EULAR response can be classified as good, moderate, or absent. Although these criteria have been designed with different methods, they perform similarly in identifying responders (9–15).

A preliminary core set of response variables in pediatric arthritis was defined, and a preliminary definition of improvement statistically verified, by Giannini et al (16). This definition, termed the ACR Pediatric 30, is similar to the ACR criteria for response in adult RA, and includes improvement of at least 30% from baseline in 3 of any 6 predefined variables (number of active joints, number of joints with limited motion, physician's assessment of disease activity, parent's assessment of well being, a validated measure of physical function, and a laboratory measure of inflammation). Moreover, to fulfill this definition of improvement, no more than 1 of the remaining variables can worsen by >30%. To our knowledge, however, this definition has not yet been prospectively validated. We therefore performed a study comparing 4 sets of response criteria (2 ACR and 2 EULAR) in JIA patients treated with MTX and/or anti–tumor necrosis factor α (anti-TNFα) drugs.

PATIENTS AND METHODS

Patients.

Seventy-five patients (14 male, 61 female) with JIA who were patients of the Gaetano Institute in Milan during 2001 were evaluated at baseline and after 6 months of therapy with MTX or anti-TNFα. The patients treated with MTX were enrolled from a clinical trial evaluating the efficacy of high-dose MTX in treating JIA. Patients treated with biologic agents were all receiving anti-TNFα drugs at the time this study was performed. In all patients, disease onset occurred before the age of 16 years. Twenty-one patients were young adults at the time of study entry, and 54 were considered to be pediatric patients (age <16 years).

Forty-one patients had been treated with MTX, at 10 mg/m2/week, for a period of 16–24 weeks (mean 20 weeks). Seventeen patients had received intravenous infliximab (3 mg/kg, up to 10 mg/kg in more severe cases, every 4–8 weeks as per protocol [3]) and 17 others had received subcutaneous etanercept (0.4–1 mg/kg twice per week). Diagnoses were as follows: systemic-onset JIA (16 patients), rheumatoid factor (RF)–positive polyarticular JIA (5 patients), RF-negative polyarticular JIA (9 patients), oligoarticular JIA (10 patients [all from the MTX-treated group]), extended oligoarticular JIA (33 patients), and psoriatic arthritis (2 patients). In the 2 age groups (<16 years and >16 years), there were no significant differences with regard to JIA subsets. The mean age at study entry was 12.8 years (range 2–32.9 years).

Methods.

Clinical response was evaluated by the following criteria: ACR20, ACR Pediatric 30, DAS, and DAS28, as originally described (12, 16). Apart from the ACR Pediatric 30, the criteria sets used in the current study have been validated only in adults with RA and were never designed or validated for use in children.

For each patient a complete and detailed physical examination was performed at baseline and at the end of the study period. The variables recorded (necessary to calculate the ACR and EULAR criteria) were as follows: tender joint count, swollen joint count in 44 and 28 joints, limited joint count Ritchie Articular Index (RAI) (range 0–78) (17), erythrocyte sedimentation rate (ESR; Westergren), pain evaluation by 100-mm visual analog scale (VAS) as reported by patient or parent/guardian, patient's and physician's global disease activity score by the same 100-mm VAS, and physical disability index of the Childhood Health Assessment Questionnaire (18). For each patient, the ACR20, the ACR Pediatric 30, the DAS, and the DAS28 were then calculated. On the basis of the results from the observation period, patients were classified as responders or nonresponders. Patients with a moderate or good response according to the EULAR criteria (DAS) were considered responders.

Improvement criteria definitions.

For the ACR20, improvement was defined as a 20% improvement in tender and swollen joint counts and 20% improvement in 3 of the 5 remaining ACR core set measures (8). For the ACR Pediatric 30, improvement was defined as ≥30% improvement from baseline in 3 of any 6 variables in the core set, with no more than 1 of the remaining variables worsening by >30% (16). The DAS28 was determined as follows:

equation image

where TJC = number of tender joints (28-joint count), SJC = number of swollen joints (28-joint count), and GH = general health. The DAS was determined as follows:

equation image

Statistical analysis.

A common approach to evaluating agreement and reproducibility is to compare instruments intended to diagnose the same disorder. Our study was undertaken to examine how closely the ACR and EULAR criteria, which have 2 different operational definitions of the same theoretical concept, are related. Data (responders and nonresponders with the different criteria) were entered into 2 × 2 contingency tables. Response criteria were subsequently compared with one another for concordance by means of kappa statistics, and Cohen's kappa coefficient was calculated. Cohen's kappa is a valid measure of agreement between the evaluations of 2 observers (19–21). A value of 1 indicates perfect agreement, while a value of 0 indicates that agreement is no better than chance. Kappa is only valid for tables in which both variables use the same category values and both variables have the same number of categories; it expresses the proportionate reduction in error generated by a classification process, compared with the error of a completely random classification. It can be calculated with the formula κ = (Po − Pc)/(1 − Pc), where κ = coefficient of agreement, Po = observed proportion of concordant judgments, and Pc = expected proportion of randomly concordant judgments. Moreover, the Fleiss (22) and the Landis and Koch (23) indices, which are used to define the degree of agreement (Fleiss) and of reproducibility (Landis and Koch), were also assessed. Statistical analysis was performed with SPSS statistical package 11.01 for Windows (SPSS, Chicago, IL).

RESULTS

Overall, responder rates in the total population studied were as follows; 79% for the DAS and DAS28, 69% for the ACR20, and 80% for the ACR Pediatric 30. Among patients treated with biologic drugs, the responder rates were 79% for the DAS and DAS28, 64% for the ACR20, and 85% for the ACR Pediatric 30, while among patients treated with MTX, the responder rates were 78% for the DAS, 80% for the DAS28, 82% for the ACR20, and 78% for the ACR Pediatric 30. Chi-square testing showed no significant differences in responder rates between patients treated with MTX and patients treated with biologic agents. Concordance rates were evaluated for both the total group and for patients by age (i.e., >16 years and <16 years).

As seen in Table 1, the best agreement in the total group and in the pediatric group was obtained by comparing DAS and ACR Pediatric 30; in the young adults, the best agreement was found when comparing the DAS and the DAS28. In the total group, the Fleiss index (Table 2) was considered good to excellent in all cases except in the comparison of the DAS28 with the ACR20, where it was marginal. In younger children it was generally good, while in young adults it was inapplicable except in 2 comparisons (DAS with ACR Pediatric 30, and DAS with DAS28).

Table 1. Cohen's kappa values for all comparisons in the study groups*
Comparison pairGroup
All patientsAge <16 yearsAge >16 years
  • *

    Values are the mean ± SEM. DAS = Disease Activity Score; ACR Ped 30 = American College of Rheumatology Pediatric 30; DAS28 = DAS in 28 joints; ACR20 = ACR 20% response criteria.

  • Comparison invalid, because P > 0.05.

DAS/ACR Ped 300.71 ± 0.10.72 ± 0.10.69 ± 0.2
DAS28/DAS0.68 ± 0.10.65 ± 0.10.73 ± 0.1
DAS28/ACR Ped 300.55 ± 0.10.61 ± 0.10.39 ± 0.2
DAS/ACR200.53 ± 0.10.61 ± 0.10.21 ± 0.3
ACR20/ACR Ped 300.53 ± 0.10.56 ± 0.10.33 ± 0.3
DAS28/ACR200.38 ± 0.10.51 ± 0.1
Table 2. Fleiss agreement index and Landis and Koch reproducibility index results for the total population*
Comparison pairFleiss agreement indexLandis and Koch reproducibility index
  • *

    See Table 1 for definitions.

DAS/ACR Ped 30Good/excellentSubstantial
DAS28/DASGood/excellentSubstantial
DAS28/ACR Ped 30GoodModerate
DAS/ACR20GoodModerate
ACR20/ACR Ped 30GoodModerate
DAS28/ACR20Marginal/goodSlight

As shown in the tables, we replicated and confirmed these results by using other methods of statistical analysis, namely, the Landis and Koch reproducibility index (Table 2), the Cramèr Φ (Table 3), the Somers' Δ (Table 4), and the Kruskal-Stuart τb (Table 5) (24–27). All of these methods showed comparable results and confirmed a good correlation between the DAS and the ACR Pediatric 30 and the DAS28.

Table 3. Cramèr Φ values for all comparisons in the study groups*
Comparison pairGroup
All patientsAge <16 yearsAge >16 years
  • *

    See Table 1 for definitions.

  • Value not computable, because P > 0.05.

DAS/ACR Ped 300.750.710.73
DAS28/DAS0.730.6150.73
DAS28/ACR Ped 300.390.380.41
DAS/ACR200.34
ACR20/ACR Ped 300.31
DAS28/ACR200.34
Table 4. Somers' Δ values for all comparisons in the study groups*
Comparison pairGroup
All patientsAge <16 yearsAge >16 years
  • *

    Values are the mean ± SEM. See Table 1 for definitions.

  • Value not computable, because P > 0.05.

DAS/ACR Ped 300.75 ± 0.10.69 ± 0.10.72 ± 0.1
DAS28/DAS0.73 ± 0.10.61 ± 0.10.73 ± 0.1
DAS28/ACR Ped 300.39 ± 0.10.41 ± 0.1
DAS/ACR200.35 ± 0.1
ACR20/ACR Ped 300.30 ± 0.1
DAS28/ACR200.33 ± 0.1
Table 5. Kruskal-Stuart τb values for all comparisons in the study groups*
Comparison pairGroup
All patientsAge <16 yearsAge >16 years
  • *

    Values are the mean ± SEM. See Table 1 for definitions.

  • Value not computable, because P > 0.05.

DAS/ACR Ped 300.75 ± 0.10.610 ± 0.10.73 ± 0.1
DAS28/DAS0.73 ± 0.090.615 ± 0.150.73 ± 0.15
DAS28/ACR Ped 300.40 ± 0.10.387 ± 0.18
DAS/ACR200.34 ± 0.1
ACR20/ACR Ped 30
DAS28/ACR200.34 ± 0.1

Using the ACR Pediatric 30 as the gold standard, the best performance in the total population was obtained with the DAS (71% concordance), followed by the DAS28 (55% concordance), and by the ACR20 (53% concordance) (Figure 1). With regard to the ACR Pediatric 30, the DAS28 showed a sensitivity of 0.9 and a specificity of 0.66, the DAS showed a sensitivity of 0.93 and a specificity of 0.8, and the ACR20 showed a sensitivity of 0.81 and a specificity of 0.84. Receiver operating curves were then constructed, with the mean area under the curve being 0.702 for the DAS28, 0.735 for the DAS, and 0.562 for the ACR20. In all cases but 1 (the ACR20), P values were <0.05; P was not statistically significant for the ACR20 (Figure 2).

Figure 1.

Concordance between the American College of Rheumatology (ACR) Pediatric 30 and the other criteria, corrected by kappa statistics. Values are the mean ± SEM percent. DAS28 = Disease Activity Score in 28 joints; ACR20 = ACR 20% response criteria.

Figure 2.

Receiver operating curves between the ACR Pediatric 30 (ACR Ped 30) and other criteria. For the DAS versus the ACR Pediatric 30 and the DAS28 versus the ACR Pediatric 30, P < 0.05 versus the null hypothesis that the area under the curve (AUC) = 0.5; for the ACR20 versus the ACR Pediatric 30, P > 0.05 versus the null hypothesis. AUC values shown are the mean ± SEM. See Figure 1 for other definitions.

DISCUSSION

The course of JIA is highly variable, ranging from a mild, self-limiting form to a very aggressive form. Although response criteria have never been formally studied in JIA, several reports on responsiveness of outcome measures have been published (28–30).

Until the time of this study, no individual marker for disease activity had shown satisfactory specificity and sensitivity (31); thus, an index of disease activity combining several variables was needed. The criteria used for adult patients may not be applicable to children, since JIA is a different disease than RA; for example, some variables in the ACR core set have different scores in children than in adults, and the measurement of some of these variables can be affected by age-related factors (e.g., pain assessment). Despite the substantial improvement obtained with the preliminary definition criteria (ACR Pediatric 30), the evaluation of clinical response has yet to be standardized in JIA, and it is not known how the existing criteria used for RA would perform in a pediatric population. The accuracy of the preliminary definition for improvement (16) in the assessment of response to MTX treatment was retrospectively evaluated by Ruperto et al (32). In their study, data on 111 patients with JIA were obtained from an open-label, uncontrolled trial that was designed to investigate the efficacy of MTX. Approximately two-thirds of patients treated with low-dose MTX were found to have improved disease response, as assessed using the preliminary definition, a proportion similar to the one expected based on a previous controlled study of low-dose oral MTX. The authors concluded there was preliminary evidence of the validity of the definition of the ACR Pediatric 30.

However, the ACR Pediatric 30 definition criteria have not been tested prospectively against other external standards. To our knowledge, at the time of our study, comparisons among different sets of response criteria in JIA had not been published. We have compared the ACR Pediatric 30 criteria, which have been endorsed by both the ACR and the Food and Drug Administration, with 3 currently used RA response criteria, finding in most comparisons a good or excellent correlation. Since the EULAR criteria (DAS) classify response in 3 categories, we have grouped as “responders” patients who had a moderate or good response according to the DAS. In our cohort, maximal concordance was found when comparing the ACR Pediatric 30 with the original DAS, while the least concordance was found when comparing the DAS28 with the ACR20. We also subgrouped our patients on the basis of age (i.e., >16 years or <16 years) and analyzed the results separately to see if the criteria used for adult RA could be used for young adults with JIA. No significant differences in JIA subsets were found between these 2 groups. Outcomes in adults with JIA have been the subject of recent studies (33).

For statistical comparisons of the different criteria tests we did not use a chi-square test or a chi-square–based Pearson correlation coefficient contingency index, because these methods are not adequate for assessing interrater agreement. Chi-square analysis tests the null hypothesis of association between variables, not concordance, and does not provide any information about direction of concordance. Therefore, the Cohen's kappa coefficient was chosen as the most appropriate test for our analysis, as was used in some previous studies (34, 35). Cohen's kappa is, in fact, a measure of agreement that compares the observed agreement with the agreement expected by chance if the observer ratings were independent. We also used a more detailed analysis by calculating the Cramér Φ coefficient, a measure of the degree of association between 2 binary categorical nominal variables (which is similar to the correlation coefficient in its interpretation), and the Somers' Δ and Kruskal-Stuart τb, which are different since they measure the degree of association between categorical ordinal variables. In general, results with all of these tests (24–27) confirmed a good correlation among existing criteria.

It is worth noting that, as opposed to other criteria, the ACR Pediatric 30 includes as a parameter the number of joints with limited motion. This is important since, in patients with short disease duration, this count can improve significantly through physical therapy, while patients with longstanding disease may have a number of joints with limited motion that cannot improve, despite significant disease activity improvement, due to mechanical deformities not related to the actual presence of inflammation. Moreover, with the ACR Pediatric 30 criteria, a patient can be designated as a responder even if 1 variable has worsened by >30%. The poor concordance between the DAS28 and the other criteria may be partly due to the fact that ankle, foot, and temporomandibular joints are not included in this simplified joint count. Of note, these are joints that are very frequently affected in JIA. We noted a difference in response as assessed with the DAS and the DAS28 even in the same patients; this is obviously due to the difference in joint count, since all other variables are the same.

In conclusion, our study shows that results obtained using the different criteria sets are primarily similar, although the statistical testing revealed significant concordance between criteria sets in only 2 comparisons (DAS/DAS28 and DAS/ACR Pediatric 30). If this finding could be confirmed in other series, the ACR Pediatric 30 criteria could be used in young adults with JIA, and the original DAS could be considered a valid alternative to the ACR Pediatric 30 for both children and young adults with JIA.

Ancillary