An index of the three core data set patient questionnaire measures distinguishes efficacy of active treatment from that of placebo as effectively as the American College of Rheumatology 20% response criteria (ACR20) or the Disease Activity Score (DAS) in a rheumatoid arthritis clinical trial
To evaluate the capacity of a pooled index of only the 3 patient self-report questionnaire measures among the 7 American College of Rheumatology (ACR) core data set (Core Data Set) measures to distinguish efficacy of active treatment of rheumatoid arthritis (RA) with leflunomide or methotrexate versus placebo in a randomized, controlled clinical trial, and to compare the results with those obtained using the ACR 20% response criteria (ACR20), Disease Activity Score (DAS), and other pooled indices.
The 7 ACR Core Data Set measures of 1) joint swelling, 2) joint tenderness, 3) physician global assessment, 4) erythrocyte sedimentation rate (ESR), 5) functional disability, 6) pain, and 7) patient global assessment were combined into the following 5 pooled indices: “All Core Data Set” (all 7 measures), “Assessor Only” (measures 1–3), “Assessor + ESR” (measures 1–4), “Patient Only” (measures 5–7), and “Patient + ESR” (measures 4–7). The capacity of each of these 5 indices to detect differences between active treatment and placebo treatment was compared with that of the ACR20 and the DAS using 4 different analytic methods, each of which presented advantages and limitations. Agreement of the indices with one another and with the ACR20 and the DAS was analyzed according to pairwise kappa statistics and Z scores in multivariate logistic regression models.
Each of the 5 indices, including “Patient Only,” had a similar capacity to detect greater efficacy of leflunomide and methotrexate versus placebo in this clinical trial, according to each of 4 methods, at similar levels of statistical and clinical significance.
A pooled index of patient self-report questionnaire Core Data Set measures appears to be as informative as ACR20 responses, DAS scores, and pooled indices of all and assessor-derived Core Data Set measures for distinguishing between active treatment and placebo treatment in this RA clinical trial.
The American College of Rheumatology (ACR) core data set of 7 disease activity measures for rheumatoid arthritis (RA) (hereinafter referred to as the “Core Data Set”), used to assess outcomes in clinical trials of treatments for RA (1), and the Disease Activity Score (DAS) (2) have provided important quantitative advances in clinical research. The ACR/European League Against Rheumatism Core Data Set includes 3 assessor-derived measures—tender joint count, swollen joint count, and physician global assessment; 1 laboratory test—erythrocyte sedimentation rate (ESR) or C-reactive protein (CRP) level; and 3 patient self-report questionnaire measures—functional disability, pain, and patient global assessment. The Core Data Set also includes a radiograph for studies of 1 year or longer. Improvement of at least 20% in both tender and swollen joint counts as well as in 3 of the 5 additional measures is designated the ACR preliminary definition of improvement and known as the ACR 20% response criteria (ACR20) (3). Higher thresholds for improvement have also been described, such as the ACR50 and the ACR70 (4). The DAS includes a tender joint count, swollen joint count, ESR or CRP level, and patient assessment of general health. Responses to therapy are assessed as low, moderate, or high disease activity at baseline and followup (2).
A randomized, controlled clinical trial in patients with active RA had indicated significantly greater efficacy of leflunomide or methotrexate versus placebo according to ACR20 and DAS responses over 1 or 2 years (5, 6). In patients who received active treatment, each of the 7 Core Data Set measures (which include the 4 DAS measures) was improved, while in patients who took placebo, mean scores for joint tenderness, joint swelling, physician global assessment, and patient pain showed improvement, while mean ESR and functional disability and patient global scores were not improved. The relative efficiencies of individual patient questionnaire measures were greater than those of physician-derived measures (5, 6). These observations suggested that a subset composite pooled index (7) which included only the 3 patient self-report questionnaire measures from the Core Data Set might perform as well as the ACR20, DAS, and other subset indices from the Core Data Set in terms of discriminating between efficacy of active and placebo treatment in a clinical trial.
MATERIALS AND METHODS
US301 clinical trial.
A randomized, controlled clinical trial, US301, indicated greater efficacy of leflunomide or methotrexate versus placebo over 1 or 2 years according to the ACR20 and DAS (5, 6); details concerning the design, conduct, and results are described in previous reports (5, 6). The ACR20 and DAS were computed on the basis of the last observation carried forward value at 1 year compared with baseline.
New pooled indices (indices).
Indices are mathematical combinations of multiple individual measures. The 7 ACR Core Data Set measures of 1) joint swelling, 2) joint tenderness, 3) physician global assessment, 4) erythrocyte sedimentation rate (ESR), 5) functional disability, 6) pain, and 7) patient global assessment were combined into the following 5 pooled indices: “All Core Data Set” (all 7 measures), “Assessor Only” (measures 1–3), “Assessor + ESR” (measures 1–4), “Patient Only” (measures 5–7), and “Patient + ESR” (measures 4–7).
Methods to assess new indices.
The efficacy of treatments, on the basis of the last observed values over 1 year relative to baseline, was calculated according to the 5 different indices using 4 different analytic methods, each of which has advantages and limitations:
- 1The “percent rescaled average” method is based on rescaling each Core Data Set measure from 0 to 100 and computing percent change from baseline to 1 year for an average value of 3, 4, or 7 included measures. The advantages of this method are that it is simple and intuitively understandable, integrates equitably changes in 0–100 rescalings for each measure, and provides continuous scales which can be classified into favorable or unfavorable categories (although only 20% improvement is depicted as a favorable response in this report). The major limitation is that a large change in one measure may distort the average; for example, one rescaled measure could have a reduction from 80 to 10, while 6 others could have no change from 45, but the percent change for the average corresponding to pooled indices 1–7 would be 20% for 50 to 40.
- 2The “average percent” method involves computation of each of the 5 indices as the composite average percent change from baseline for the 3, 4, or 7 included measures, without rescaling each measure. The range of worsening was restricted to −100% to reduce a potentially excessive influence of very large percent changes for poorer status relative to small nonzero baselines. This method has the advantages cited above for the “percent rescaled average” method, with a possible additional advantage of providing continuous scales without a need for rescaling. Again, the method provides continuous scales which can be classified into favorable or unfavorable categories, but only 20% improvement is depicted as a favorable response in this report. The major limitation is that, as with the “percent rescaled average” method, the results may be overly sensitive to excessive influence of percent change of a single measure among 3 or more, particularly one with a small baseline. Moreover, differences in percent change in a given measure may not reflect optimally differences in changes in clinical status. For example, a change in a modified Health Assessment Questionnaire (modified HAQ) (8) functional disability score from 1.4 to 1.1 would be classified as a 21% change, compared with a change from 0.4 to 0.1, which would be 75%, and would contribute very differently to a composite average, although the differences might be more similar clinically than implied by the percentage change.
- 3The “categories” method is also based on percent changes for the included 3, 4, or 7 Core Data Set measures, as in the “average percent” method, but the percent changes are transformed to 5 ordered categorical variables: “−1” for more than 20% worsening from baseline, “0” for less than 20% worsening or improvement, “+1” for at least 20% but less than 50% improvement, “+2” for at least 50% but less than 70% improvement, and “+3” for at least 70% improvement. This classification of averages into categories is similar in spirit to the ACR20 and the DAS. The principal advantage of this method is its lower sensitivity to a potentially excessive influence of large changes in only one (or two) measures. However, an important limitation is that mean values for change are defined only in arbitrary categories, rather than on an underlying continuous scale.
- 4The “majority” method defines favorable responses according to 20% responses in 2 of 3, 3 of 4, or 6 of 7 measures, in a manner similar to the ACR20. A favorable response for the “Assessor” or “Patient” indices is indicated by a 20% improvement in 2 of the 3 components, for “Assessor + ESR” or “Patient + ESR” by a 20% improvement in at least 2 “Assessor” or “Patient” components and ESR, and for “All Core Data Set” by a 20% improvement in both “Assessor” and “Patient” indices. These specifications may also be used to define at least 50% or 70% improvement from baseline for the composite indices. The principal advantages of this method are that it does not involve any type of averaging or rescaling of any measure, and it is most similar to the ACR20. The major limitation is that the index converts continuous variables into only dichotomous or ordered categorical versions.
Analyses of subset indices.
Four types of analyses were applied to compare results according to the 5 composite indices and the ACR20 and DAS. The first involved descriptive comparisons of the rates of 20% responses for each index, according to the 4 analytic methods, as well as for the ACR20 and DAS, with their standard errors, computed for each of the 3 treatment groups. The second involved evaluation of agreement of the 5 composite indices with one another, as well as with the ACR20 (3), and with dichotomous DAS scores for “good or moderate” response (2), using pairwise kappa statistics (9). Values near 0 correspond to chance agreement, and values near 1 correspond to almost perfect agreement.
The third type of analysis involved computation of multivariate logistic regression models as applied through generalized estimating equations (9), to address the sensitivity of each of the 5 indices from the 4 analytic methods, as well as that of the ACR20 and the DAS, for detecting differences in treatments versus placebo. The main results from each of these models were Z scores, the relative magnitudes of which provide information about how much discriminant validity is present for active drug versus placebo. Odds ratios were computed with 95% confidence intervals to describe the similarity of the indices for sensitivity to differences in efficacy of treatments with leflunomide, methotrexate, and placebo.
The fourth approach was to analyze the extent to which each of the dichotomous composite indices from the 4 methods accounted for the sensitivity of the ACR20 and the DAS to detect treatment differences. This was done using univariate logistic regression models with the ACR20 or the DAS as the response variable and the corresponding composite index and treatments as the explanatory variables. The tendency for P values for treatments to be nonsignificant, e.g., P > 0.100, in these models indicates the extent to which the composite index accounts for differences among the 3 treatments on the ACR20 (or the DAS).
The mean ± SEM percentages of patients with 20% favorable responses for each of the 5 composite indices, the ACR20, and the DAS, using the 4 analytic methods, ranged from 52% to 74% for leflunomide, from 45% to 72% for methotrexate, and from 18% to 43% for placebo (Table 1). The ACR20 tended to have somewhat smaller proportions for favorable responses with active treatment or placebo than the DAS or the composite indices, as well as somewhat smaller differences between active treatment and placebo. Conversely, a somewhat higher proportion of favorable responses was seen for the “Assessor Only” index, reflecting findings for individual physician/assessor-derived measures (5, 6). Each of the 5 composite indices had similar proportions of favorable responses using the 4 analytic methods to assess each of the 3 treatments, except for the stringent “majority” method for the “All Core Data Set” index, for which the proportions were somewhat smaller. These descriptive results indicated that each of the composite indices for favorable responses appeared to provide similar results for treatment comparisons.
Table 1. Estimated mean ± SEM percentages of patients with 20% improvement in a trial of leflunomide or methotrexate versus placebo, using 4 analytic methods for 5 composite indices, the ACR20, and the DAS*
|“Patient Only”||LEF||178||62.9 ± 3.6||62.4 ± 3.6||58.4 ± 3.7||64.0 ± 3.6|
| ||Placebo||118||32.2 ± 4.3||33.1 ± 4.3||27.1 ± 4.1||33.1 ± 4.3|
| ||MTX||179||58.1 ± 3.7||50.8 ± 3.7||50.3 ± 3.7||56.4 ± 3.7|
|“Patient + ESR”||LEF||178||61.8 ± 3.7||57.3 ± 3.7||54.5 ± 3.7||55.1 ± 3.7|
| ||Placebo||118||28.8 ± 4.2||26.3 ± 4.1||17.8 ± 3.5||23.7 ± 3.9|
| ||MTX||179||57.5 ± 3.7||50.3 ± 3.7||44.7 ± 3.7||50.3 ± 3.7|
|“Assessor Only”||LEF||179||72.1 ± 3.4||72.6 ± 3.3||68.7 ± 3.5||73.7 ± 3.3|
| ||Placebo||118||42.4 ± 4.6||42.4 ± 4.6||39.8 ± 4.5||43.2 ± 4.6|
| ||MTX||180||72.2 ± 3.3||71.1 ± 3.4||68.3 ± 3.5||69.4 ± 3.4|
|“Assessor + ESR”||LEF||179||69.3 ± 3.5||67.6 ± 3.5||65.9 ± 3.6||63.1 ± 3.6|
| ||Placebo||118||39.8 ± 4.5||38.1 ± 4.5||34.7 ± 4.4||32.2 ± 4.3|
| ||MTX||180||72.2 ± 3.3||66.1 ± 3.5||65.6 ± 3.6||62.2 ± 3.6|
|“All Core Data Set”||LEF||178||70.8 ± 3.4||68.0 ± 3.5||62.9 ± 3.6||56.7 ± 3.7|
| ||Placebo||118||34.7 ± 4.4||32.2 ± 4.3||28.8 ± 4.2||28.8 ± 4.2|
| ||MTX||179||64.8 ± 3.6||63.1 ± 3.6||55.3 ± 3.7||49.2 ± 3.7|
|ACR20||LEF||178|| || || ||52.2 ± 3.8|
| ||Placebo||118|| || || ||26.3 ± 4.1|
| ||MTX||180|| || || ||45.6 ± 3.7|
|DAS||LEF||182|| ||58.8 ± 3.7|| || |
| ||Placebo||118|| ||29.7 ± 4.2|| || |
| ||MTX||180|| ||58.9 ± 3.7|| || |
The agreement between the ACR20, the DAS, and each of the 5 dichotomous composite indices using the 4 analytic methods was described with pairwise kappa statistics, which ranged from 0.57 to 0.80 (Table 2), indicating good agreement. As might be expected, kappa statistics for agreement with the ACR20 and the DAS tended to be somewhat higher for the “All Core Data Set” index than for the indices which included only 3 assessor- or 3 patient-derived measures or 4 measures to include the ESR. Furthermore, the assessor-derived indices tended to have somewhat higher kappa statistics than the patient indices for agreement with the DAS, reflecting the fact that the DAS includes primarily assessor-derived measures. Nonetheless, these pairwise kappa statistics, as well as pairwise kappa statistics among the 5 composite indices according to each of the 4 analytic methods (data not shown), indicated good-to-excellent agreement among the 5 indices.
Table 2. Kappa statistics for agreement of composite indices from 4 analytic methods with ACR20 and DAS*
|“Patient + ESR”||0.656||0.671||0.634||0.647||0.625||0.619||0.645||0.618|
|“Assessor + ESR”||0.594||0.742||0.640||0.759||0.679||0.776||0.735||0.735|
|“All Core Data Set”||0.679||0.779||0.689||0.784||0.748||0.798||0.770||0.698|
Multivariate logistic regression models to compare active and placebo treatments for the ACR20, the DAS, and dichotomous composite indices (Table 3) provided Z scores for the indices similar to those for the ACR20 and the DAS. Odds ratios for the indices were also similar to the corresponding odds ratios for the ACR20 and the DAS, and their confidence intervals were generally overlapping (data not shown). Each of the composite indices, which served as corresponding mediating variables, accounted for the clearly significant (P < 0.001) treatment differences. In these ways, each of the composite indices, including the “Patient Only” index, showed sensitivity similar to that of the ACR20 for detecting differences between leflunomide, methotrexate, and placebo in the US301 clinical trial.
Table 3. Z scores for comparisons between leflunomide and placebo and between methotrexate and placebo for 20% response according to composite indices from 4 analytic methods, the ACR20, and the dichotomous DAS*
|“Patient + ESR”||5.43||4.77||5.13||4.06||6.03||4.64||5.20||4.48|
|“Assessor + ESR”||4.94||5.45||4.92||4.68||5.17||5.12||5.12||4.98|
|“All Core Data Set”||5.96||4.99||5.89||5.12||5.60||4.42||4.64||3.45|
|ACR20|| || || || || || ||4.36||3.32|
|DAS|| || ||4.84||4.85|| || || || |
These analyses indicate that 5 pooled indices derived from measures in the Core Data Set, including an index of “Patient Only” measures, provided indications of greater efficacy for leflunomide or methotrexate versus placebo similar to those shown by the ACR20 or the DAS. These results are not unexpected in view of the fact that the comparisons involved measures that are included in the ACR20 or DAS scores. However, rheumatologists tend to regard joint count measures as more valuable than patient questionnaire measures (10). In that sense these findings are unexpected, despite evidence in the trial that the relative efficiencies of the patient questionnaire measures were greater than those of the assessor measures among the 7 Core Data Set measures (5, 6).
All 4 analytic methods performed similarly in discriminating results of active versus placebo treatment, although each method presents advantages and disadvantages, as discussed in Materials and Methods. These findings provide confidence that any of these methods may be used to analyze results of the US301 clinical trial. The most desirable method may emerge from further studies of data sets from other clinical trials.
Limitations to this study are seen in addition to specific limitations of each analytic method. A primary limitation which must be emphasized is that the data are derived from only a single clinical trial, and analyses of additional clinical trials are required to determine whether the results are generalizable. In addition, it is not clear how effectively an index that does not include joint counts might perform prospectively to assess responses to therapy in clinical trials or in routine clinical care. Furthermore, construct validity was analyzed in the development of the ACR20 (3) and the DAS (2), and this was not done in the studies reported here, since the indices were tested only for their discriminant validity.
Our findings nonetheless raise the consideration that results of drug therapy in RA clinical trials, observational studies, and routine clinical care might be assessed effectively using only patient questionnaire data. Data from patient questionnaires are correlated significantly with data from traditional joint counts, radiographs, and laboratory tests (11), and they explain other clinical information more than any measure in RA (11). Patient questionnaire data are as effective as any other available data for documenting declines in functional status in RA and for predicting work disability, costs, and premature mortality (12, 13).
The “Patient Only” 3 Core Data Set measures are found on the HAQ (14), the clinical HAQ (15), and the multidimensional HAQ (16), which can be completed by a patient in less than 10 minutes in a waiting room. The 3 measures are found on one side of one page on the multidimensional HAQ, which can be scored in less than 30 seconds. The introduction of a simple patient questionnaire into the infrastructure of rheumatology care (12) could allow assessment of all patients according to the same simple index, which might be as effective as the current Core Data Set for characterizing quantitatively the status of patients with RA and their responses to any therapy in clinical trials and routine care.
We thank Arthur F. Kavanaugh, MD, Karen M. Simpson, MD, and Ann K. Thompson, BSN, for helpful discussions concerning this research.