Dr. Shadick has received research grants from Crescendo Bioscience, MedImmune, Abbott, Amgen, and Genentech. Ms Iannaccone has received salary support from Biogen Idec, MedImmune, and Crescendo Bioscience.
Dr. Kvien has received consultant fees, speaking fees, and/or honoraria (less than $10,000 each) from Abbott, AstraZeneca, MSD/Schering-Plough, NicOx, Roche, UCB, and BMS, and (more than $10,000) from Pfizer/Wyeth, and has received research grants from Abbott, BMS, MSD/Schering-Plough, Pfizer/Wyeth, Roche, and UCB.
Michael E. Weinblatt,
Brigham and Women's Hospital, Boston, Massachusetts
Dr. Solomon has research contracts (more than $10,000 each) with Amgen and Lilly, has received research grants from Abbott, has run a course supported by a grant from Bristol-Myers Squibb, and serves as a consultant to the Consortium of Rheumatology Researchers of North America.
Matrix-based risk models have been proposed as a tool to predict rapid radiographic progression (RRP) in rheumatoid arthritis (RA), but the experience with such models is limited. We tested the performance of 3 risk models for RRP in an observational cohort.
Subjects from an observational RA cohort with hand radiographs and necessary predictor variables to be classified by the risk models were identified (n = 478). RRP was defined as a yearly change in the Sharp/van der Heijde score of ≥5 units. Patients were placed in the appropriate matrix categories, with a corresponding predicted risk of RRP. The mean predicted probability for cases and noncases, integrated discrimination improvement, Hosmer-Lemeshow statistics, and C statistics were calculated.
The median age was 59 years (interquartile range [IQR] 50–66 years), the median disease duration was 12 years (IQR 4–23 years), the median swollen joint count was 6 (IQR 2–13), 84% were women, and 86% had erosions at baseline. Twelve percent of patients (32 of 271) treated with synthetic disease-modifying antirheumatic drugs (DMARDs) at baseline and 10% of patients (21 of 207) treated with biologic DMARDs experienced RRP. Most of the predictor variables had a skewed distribution in the population. All models had a suboptimal performance when applied to this cohort, with C statistics of 0.59 (model A), 0.65 (model B), and 0.57 (model C), and Hosmer-Lemeshow chi-square P values of 0.06 (model A), 0.005 (model B), and 0.05 (model C).
Matrix risk models developed in clinical trials of patients with early RA had limited ability to predict RRP in this observational cohort of RA patients.
Rheumatoid arthritis (RA) is a chronic disease that can cause severe joint damage and disability. During the last decades, the number of therapeutic agents and the knowledge about treatment strategies for RA have increased substantially (1). This has left clinicians with more treatment choices, but also in need of tools to identify the right patients to treat aggressively with more effective, but expensive, medication with potentially serious adverse events.
Risk model matrices have been proposed as clinical tools to identify RA patients at high risk of rapid radiographic progression (RRP) (2–5) or with a probable response to disease-modifying antirheumatic drug (DMARD) treatment (6). In addition, models to predict response to anti–tumor necrosis factor therapy in ankylosing spondylitis have been published recently (7), highlighting the interest in risk models within rheumatology. Clinicians might apply current risk models in their daily practice, but we have limited knowledge about whether the use of the models should be restricted to patient populations similar to the study populations used for the model development.
Risk models are common in cardiology, with the Framingham Risk Scores (8) and the Systematic Coronary Risk Evaluation (9) as examples of risk models for cardiovascular disease. Several publications have discussed the validation and development of such models, focusing on statistical methods to assess model fit and to compare the classification abilities of different models (10–14). Statistics have been developed to measure the degree of correct reclassification by a new model compared to a previous model, such as reclassification calibration statistics, net reclassification improvement, and integrated discrimination improvement (Table 1). These methods add information to traditional discriminatory abilities, for example, the C statistics.
Table 1. An overview of statistical methods to assess risk models (12, 14)
Log likelihood with a penalty for the number of variables included in the model
Fraction of the log likelihood explained by the predictors in the model, adjusted to a range of 0–1 Analogous to the percentage of variation explained in linear regression models (R2)
Tests the goodness of fit of the model by comparing the observed and predicted number of cases within each decile (or another number of categories) of predicted risk in the model If the P value is significant, the model does not fit the observed data
Area under the receiver operating characteristic curve Rank based
Difference in the mean differences in predicted probabilities between cases and noncases
Compares the observed and expected number of events in each cell of a reclassification table Based on Hosmer-Lemeshow statistics; requires 20 observations in each cell and a significant test indicates a lack of fit
Compares the net increase and decrease in risk among cases to that of noncases
Risk models developed in clinical trials may not be directly applicable to daily clinical settings, since selected patient groups are included in trials, often with an aggressive disease of a short duration. In this study, we assessed the performance of 3 models for prediction of RRP in RA in an observational cohort representing a broad RA population. All 3 models were developed in clinical trials populations. We applied statistical methods previously used to assess risk models in other specialties.
Significance & Innovations
This study shows a limited value of risk models for rapid radiographic progression when applied to a broad population of patients with rheumatoid arthritis.
Development of risk models is challenging, especially when data materials are limited, as in rheumatology.
MATERIALS AND METHODS
Design and study cohort.
The Brigham Rheumatoid Arthritis Sequential Study (BRASS) is a single-center observational cohort consisting of 1,100 patients with RA (15). All patients in BRASS were diagnosed with RA by board-certified rheumatologists, and 96% fulfilled the 1987 American College of Rheumatology (ACR) classification criteria for RA at inclusion (16). A total of 478 BRASS patients had radiographic data available and received treatment with DMARDs, and were therefore eligible for the analyses. Baseline examinations took place between 2003 and 2006 and included patient-reported outcome measures, biochemical markers, and clinical examinations with swollen and tender joint counts. Treatment was given according to the clinical practice of the patient's physician, and visits with treatment adjustment could be scheduled when needed. The Brigham and Women's Hospital Institutional Review Board approved the study and all patients gave written informed consent for participation in the data collection.
Risk models for RRP.
We assessed 3 matrix-based risk models predicting RRP, all developed with multivariate logistic regression modeling. Model A was developed using data from the Active-Controlled Study of Patients Receiving Infliximab for the Treatment of Rheumatoid Arthritis of Early Onset (3). Patients who had never taken methotrexate were randomized to either methotrexate monotherapy or a combination of methotrexate and infliximab. Model B is based on data from the Behandelstrategieën voor Reumatoide Artritis (Treatment Strategies for Rheumatoid Arthritis) trial, a study of treatment strategies in RA patients with a disease duration ≤2 years (2). Model C was developed using data from the Swedish Farmacotherapy (SWEFOT) trial, studying the efficacy and safety of either a combination of hydroxychloroquine, sulfasalazine, and methotrexate or infliximab and methotrexate in patients failing initial methotrexate monotherapy (4). We chose to use the model from the second year of the SWEFOT trial based on the assumption that the second year of the trial would be more similar to the established disease seen in BRASS. Models A and B classify patients according to initial treatment, methotrexate monotherapy versus infliximab and methotrexate (model A) (3), or initial monotherapy versus initial combination with prednisone versus initial combination with infliximab (model B) (2).
The treatment variables in models A–C are more strictly defined than what is seen in clinical practice. To be able to include more of the observational data from BRASS, we grouped BRASS subjects based on whether they received either synthetic DMARD treatment at baseline (monotherapy or combinations of synthetic DMARDs) or biologic DMARD therapy (monotherapy or in combination with synthetic DMARDs), and used these variables instead of methotrexate and methotrexate/infliximab as originally described in the models. Corticosteroid use was not considered in the treatment classification.
Conventional radiographs of the bilateral hands and wrists were available at baseline and the 2-year followup visit. All radiographs were scored according to the modified Sharp/van der Heijde score by trained radiologists blinded to the sequence of the radiographs (17). Due to practical reasons, 4 different readers shared the work. The interreader correlation coefficient was calculated based on 40 sets of radiographs scored by 2 of the readers, and was 0.93 for the baseline scoring and 0.85 for the change score. The main outcome was defined as an annual change of ≥5 units in the total modified Sharp/ van der Heijde score ([change in total modified Sharp/ van der Heijde score during the followup period/length of followup in years] ≥5), which is the definition of RRP used by all of the models assessed in the analyses (2–4).
Anti–cyclic citrullinated peptide antibody (anti-CCP) was measured by a second-generation enzyme-linked immunosorbent assay (Inova Diagnostics), and subjects with a level ≥20 units/ml were classified as anti-CCP positive. Rheumatoid factor (RF) was assessed by immunoturbidimetric technique on the Cobas Integra 700 analyzer (Roche Diagnostics), with reagents and calibrators from Roche. A cutoff of ≥15 IU/ml was used for positive status. Reagents from Diasorin were used to measure high-sensitivity C-reactive protein (CRP) level. Experienced clinicians performed 28 swollen and tender joint counts.
Serologic status is included as RF level in model A and as combinations of RF and anti-CCP positivity in model B. Twenty-eight swollen joint count is included in model A (<10, 10–17, ≥17). CRP level is a predictor in model A (<6 mg/liter, 6–30 mg/liter, ≥30 mg/liter) and in models B and C (<10 mg/liter, 10–35 mg/liter, ≥35 mg/liter for both). Erosion score (0, 1–4, ≥4) is included in model B, while erosion status (presence/absence) is one of the variables in model C. Model C also includes current smoker status (yes/no). Models A and B stratify subjects according to treatment.
All analyses were performed using SPSS for Windows, versions 15 and 19. Patient characteristics for all patients and patient groups according to treatment at baseline (inclusion into BRASS) were described by the median (25th, 75th percentiles) or percentages, as appropriate, and potential differences between the groups were assessed by Mann-Whitney U tests.
The distribution of the variables included in the 3 models was assessed in all patients and in the treatment groups separately. The univariate predictive value of the model variables was then tested in logistic regression models with RRP as the outcome variable. Variables with 3 levels were treated as categorical variables, both because a linear effect could not be assumed, but also because each variable's categories are included in the model, and therefore ideally each individual category should have predictive value. As sensitivity analyses, univariate regression models were run with each variable treated as an ordinal variable.
The structure of the matrix risk models and the observed progression rates in BRASS were depicted by risk model matrix charts with color coding for predicted risk from the original publications, and observed values from BRASS as numerical values in the cells. The model first published, model A, is shown as an illustration in the main text (Figure 1), and the 2 other models are included as online supplements (Supplementary Figures 1 and 2, available in the online version of this article at http://onlinelibrary.wiley.com/doi/10.1002/acr.21870/abstract).
We compared models by several measures. Discriminatory properties were tested by C statistics (also known as the area under the receiver operating characteristic curve of the predictive model), which is the area under the plot of the sensitivity versus 1 − specificity (18). The mean predicted probability was calculated for cases and noncases. The discriminatory abilities of the models were tested pairwise (models A versus B, C versus B, and A versus C) by the integrated discrimination improvement. The integrated discrimination improvement is calculated as (average predicted probability cases − average predicted probability controls) new model − (average predicted probability cases − average predicted probability controls) old model, and a positive value indicates that the new model is an improvement over the old model. P values were calculated as described by Pencina et al (18). The Hosmer-Lemeshow goodness-of-fit test examines the calibration of the model by comparing the expected and observed event rates in the subgroups of the population, typically within deciles, and a significant P value indicates that the model does not fit the observed data (12).
For each model, the observed probability of RRP was plotted for groups of subjects according to the predicted probability of RRP. The groups in models A and B were based on cutoffs for quartiles, while the groups for model C were based on tertiles. A large number of subjects sharing the same predicted probability in model C meant that the construction of 4 groups in this model would have led to large differences in the number of subjects in each group.
The median age of all of the 478 patients included in the analyses was 59 years (25th, 75th percentiles 50, 66 years), with a median disease duration of 12 years at baseline (25th, 75th percentiles 4, 23 years). Eighty-four percent were women, 70% were anti-CCP positive, and 66% were RF positive. Patients treated with biologic DMARDs (median age 57 years [25th, 75th percentiles 48, 64 years]) were slightly younger than patients treated with synthetic DMARDs (median age 59 years [25th, 75th percentiles 51, 67 years]; P = 0.02 for comparison). Those receiving biologic DMARDs had a longer disease duration (median 15 years [25th, 75th percentiles 7, 27 years]) than those receiving synthetic DMARDs (median 10 years [25th, 75th percentiles 3, 20 years]; P < 0.001).
At baseline, the median Sharp/van der Heijde score was 28 (25th, 75th percentiles 5, 84 [mean ± SD 56 ± 64]) for all patients, 22 (25th, 75th percentiles 4, 57 [mean ± SD 44 ± 55) in the synthetic DMARD group, and 49 (25th, 75th percentiles 8, 119 [mean ± SD 71 ± 72) in the biologic DMARD group. Table 2 shows the distribution of the predictive variables included in the matrix risk models in BRASS, and in subgroups according to treatment. All model variables had a skewed distribution in the BRASS data set, with more subjects in the lower inflammatory marker categories, the higher erosion score categories, and the nonsmoker group, pointing toward a different patient population in BRASS than in the early RA clinical trials from which the matrix models were developed.
Table 2. Univariate models assessing the association between variables included in the 3 models, with rapid radiographic progression as the outcome variable*
All patients(n = 478)
Patients in synthetic DMARD group at baseline (n = 271)
Patients in biologic DMARD group at baseline (n = 207)
OR (95% CI)
P for univariate association
OR (95% CI)
P for univariate association
OR (95% CI)
P for univariate association
DMARD = disease-modifying antirheumatic drug; OR = odds ratio; 95% CI = 95% confidence interval; CRP = C-reactive protein; RF = rheumatoid factor; NR = not reported (too few cases with the outcome in the cell to perform the analyses); anti-CCP = anti–cyclic citrullinated peptide antibody.
Biologic DMARDs at baseline
Swollen joint count
CRP level (model A)
<6 mg/liter (ref.)
RF (model A)
<80 units/ml (ref.)
CRP level (model B)
<10 mg/liter (ref.)
+/− or −/+
Presence of erosions
The association between the predictive variables included in the matrix risk models and RRP was assessed in univariate logistic regression models, with RRP as the outcome. Twelve percent of patients (32 of 271) treated with synthetic DMARDs at baseline and 10% of patients (21 of 207) treated with biologic DMARDs were classified as having RRP. In all 478 patients, regardless of treatment, high levels of RF and combined RF/anti-CCP positivity were associated with RRP (Table 2). Some analyses could not be performed due to the skewed distribution of the BRASS subjects within levels of the predictor variables with a subsequent lack of cases in each cell. For example, only 4% of the subjects were classified in the highest CRP level category from model B (Table 2). Overall, most variables did not have a statistically significant association with RRP in the BRASS cohort. Similar results were observed in sensitivity analyses with ordinal variables instead of categorical variables.
Performance of models in BRASS.
The observed number of cases and subjects within each cell of the 3 models is shown in Figure 1 (see also Supplementary Figures 1 and 2, available in the online version of this article at http://onlinelibrary.wiley.com/doi/10.1002/acr.21870/abstract). The figures reveal no clear gradient of risk, i.e., the observed RRP between cells coded in different colors does not follow the originally described gradient, as would have been expected if the models performed well in the BRASS study population. It is important to note that many of the cells were populated by few patients, with a tendency toward classification of subjects in the cells with lower swollen joint counts and CRP levels and few subjects in the upper right hand, “high-risk” corners.
All models had relatively low C statistics, indicating a suboptimal discrimination (Table 3). In pairwise comparisons by the integrated discrimination improvement test, model B showed a larger difference in predicted probabilities between cases and noncases than models A and C, suggesting better discrimination between cases and noncases. Model A was the only model without a significant Hosmer-Lemeshow test, but the P value was nearly significant at 0.06. The Hosmer-Lemeshow test is an indication of the fit of the model to the data, and a significant value means that the model has a poor fit. Based on these results, further classification statistics, such as net reclassification improvement (Table 1), were not calculated (12).
Table 3. Discrimination and calibration statistics for the 3 models*
The Hosmer-Lemeshow goodness-of-fit test assesses the calibration of the model by comparing the expected and observed event rates in subgroups of the population, typically within deciles, and a significant P value indicates that the model does not fit the observed data (12).
Integrated discrimination improvement is calculated as (average predicted probability cases − average predicted probability controls) new model − (average predicted probability cases − average predicted probability controls) old model, and a positive value indicates that the new model (mentioned second in the column headings) is an improvement over the old model.
When grouping subjects according to their predicted probability of RRP, a clear gradient would ideally have been seen in the observed proportion of RRP between the groups. As shown in Figure 2, this was not the case in BRASS.
We assessed the performance of 3 matrix risk models for RRP in an observational RA cohort. The findings indicate a suboptimal ability of models developed in clinical trials to predict severe radiographic joint progression in a clinic-based RA population, and highlight potential challenges for application of risk models in rheumatology.
There are several issues that might have contributed to the somewhat disappointing findings in this study. First, the models were developed in patients with a reasonably short disease duration (2–4), while BRASS includes patients with any disease duration. Second, the disease activity in the clinical trials patients was higher than in the BRASS patients, which led to a skewed distribution of BRASS subjects over the matrix cells. This observation raises the question of whether risk models should be developed for specific patient populations, or if more variables with several levels should be taken into account, for example, as a calculator instead of as a color-coded matrix. Third, treatment variables in the models were limited to methotrexate and methotrexate plus infliximab. The generalization of these variables to synthetic DMARDs and biologic DMARDs (including combinations with biologic DMARDs) in BRASS might have reduced the predictive abilities of the models. Although limitations of this study, these issues are also likely to be present if the risk models are applied in a clinical setting. Finally, 4 different readers scored the radiographs, possibly introducing some misclassification, even if interreader reliability was satisfactory.
Several points regarding the development of rheumatology risk model matrices warrant discussion. Risk model matrices for arthritic diseases are likely to be applied in situations where the patient has received a diagnosis and some treatment, as a minority of patients manages without DMARDs. This means that our patients already have received treatment that is confounded by indication by predicting factors for worse disease outcome, i.e., the same predicting factors that we include in our models. This circularity complicates the application of the models. The success of risk models in cardiology has its basis in models that have been developed to determine if previously untreated patients should be treated, in contrast to whether to modify preexisting treatment.
Models with several multilevel variables will stratify study subjects into numerous categories or strata. Since RRP is a relatively rare outcome, occurring in approximately 10–15% of RA patients, a large data set is needed to have sufficient information about cases and noncases in each stratum of the risk model. Typically, cardiology models have been developed in observational data sets consisting of several thousand subjects, and cohorts of this size with radiographic data are not available in rheumatology at the moment. It should be a goal to instigate such collaborative efforts.
Another issue in the development process is choosing cutoffs to classify subjects into the correct risk prediction groups. It is not clear at what level of expected radiographic progression a patient should be classified as low risk (green) or high risk (red) for RRP. These are the groups with obvious treatment implications, while the intermediate group (yellow) is difficult to interpret and of limited value in a clinical setting. Without meaningful risk classification groups, the matrix will not be helpful for the clinician. Previous publications have discussed the importance of 3 risk level groups (low, moderate, and high risk), and ideally, the models should categorize a majority of patients into the high- and low-risk groups, with a minority in the clinically challenging intermediate group (11, 18, 19). Potential solutions to determine cutoffs include decision analyses or seeking experts' opinions. A lack of consensus on this question is shown by the differing cutoffs used in the 3 models included in this study, and that all models had more than 3 risk categories (5, 4, and 5 in models A, B, and C, respectively) (2–4).
The present models focus solely on radiographic joint damage. An alternative outcome could have been a combination of joint damage and functional status, as was done in the development of the ACR/European League Against Rheumatism remission criteria (20).
Statistical methods to assess risk models mainly test calibration or discrimination, and the most common tests are summarized in Table 1. Calibration assesses the degree of agreement between the predicted probability and the observed probability. Discrimination examines if the model separates cases and noncases. Ideally, the models should balance both, i.e., a patient who develops RRP should get a correct predicted probability for RRP (well calibrated), and this should be different than the predicted probability for a subject who will not develop RRP (good discrimination). When comparing 2 risk models, the net reclassification index will assess the improvement in the correct classification of cases and noncases. In this study, we were not able to use reclassification statistics due to the results from the Hosmer-Lemeshow tests indicating a bad fit of the models in BRASS. There has also been evidence that such methods should be limited to nested models (19).
We found that published matrix risk models for RRP developed in clinical trials material had limited value in this clinical observational cohort. This might partially be due to the difference in disease duration between the trial and cohort study groups, and indicates a need of model development in subjects with a broad range of disease activity and duration. Future research should aim to develop consensus on thresholds for risk classification to ensure clinically relevant risk categories. In conclusion, risk matrix models are potentially useful tools in rheumatology, but the development is challenging due to methodologic issues.
All authors were involved in drafting the article or revising it critically for important intellectual content, and all authors approved the final version to be published. Dr. Lillegraven had full access to all of the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.
Study conception and design. Lillegraven, Prince, Shadick, Haavardsholm, Kvien, Weinblatt, Solomon.
Analysis and interpretation of data. Lillegraven, Paynter, Prince, Shadick, Haavardsholm, Kvien, Weinblatt, Solomon.
ROLE OF THE STUDY SPONSOR
Biogen Idec, MedImmune, and Crescendo Bioscience supported the Brigham and Women's Hospital Rheumatoid Arthritis Sequential Study with grants. The sponsors did not influence the design, data collection, data analysis, or writing of the current study. The sponsors did not review the submitted manuscript, and publication of the article was not contingent on the approval of these sponsors.