Generalization and Extrapolation of Treatment Effects From Clinical Studies in Rheumatoid Arthritis




Pragmatic clinical trials have been proposed as a solution for nongeneralizability of randomized clinical trial (RCT) results. We investigated whether treatment effects of pragmatic clinical trials are indeed generalizable to clinical practice and how efficacy estimates from published RCTs can be translated to daily practice populations.


Data from pragmatic clinical trials of the Utrecht Rheumatoid Arthritis Cohort and the observational Nijmegen Early Rheumatoid Arthritis inception cohort were used. The treatment effects of methotrexate and hydroxychloroquine as opposed to the pyramid approach were compared between the trials and observational study using a modified comprehensive cohort design analysis. The changes from baseline in disease activity (Disease Activity Score in 28 joints [DAS28]) and functional disability (Health Assessment Questionnaire [HAQ]) and European League Against Rheumatism (EULAR) response at 6 months were studied. The influence of population and treatment characteristics on the American College of Rheumatology 50% improvement criteria response compared with control therapy also at 6 months from RCTs was assessed using the relative risk (RR) and risk difference (RD).


The DAS28 and HAQ generally improved more in patients in the pragmatic trials than in daily practice. However, using EULAR response as outcome, the treatment effect was not found to be different. In published RCT data, higher glucocorticoid use, disease duration, and cotreatment with disease-modifying antirheumatic drugs increased the RR. Use of glucocorticoids increased the RD, and higher values of baseline DAS28 and HAQ decreased the RR and RD.


Pragmatic clinical trials might be directly generalizable only regarding relative treatment response. In extrapolating published RCT results to daily practice, population characteristics associated with disease severity, disease duration, and treatment history or cotreatment need to be taken into account.


Randomized clinical trials (RCTs) are widely accepted as the gold standard to evaluate the effectiveness of treatment ([1]). This design ensures the internal validity of results, i.e., the results cannot be explained by confounding ([2]). The external validity or generalizability, however, might be less optimal. Usually, the treatment effects as estimated from RCTs are based on a specific treatment population that differs from the population in daily clinical practice (the domain). Patients that use the treatment in daily clinical practice might, for instance, have less severe disease or more often have comorbidities (due to exclusion criteria of trials) ([2-6]). In practice, clinicians have to translate results from RCTs to their individual patients and base their treatment recommendations on these results. Although RCTs are considered to be the gold standard, their results should be generalizable to the domain population in order to be implemented in daily clinical practice ([3, 4]).

Observational studies are often designed to have good generalizability because they include patients as treated in daily clinical practice with no strict inclusion or exclusion criteria. However, observational studies are often viewed as less (internally) valid because of confounding, especially confounding by indication. It has also been shown that well-designed observational studies might not systematically over- or underestimate the magnitude of the effects of treatment as compared to RCTs ([1-7]). Hence, we are still in need of a methodology for estimating the treatment effect that fits both the need for internal validity and for generalizability (also sometimes referred to as external validity).

Pragmatic clinical trials have been proposed as a solution to the problem of nongeneralizability of RCT results and the low internal validity of observational studies. In a pragmatic trial, all patients treated in daily practice are eligible for inclusion (i.e., no strict inclusion or exclusion criteria are used), increasing the generalizability. However, patients are randomized to different treatment arms, which is one of the main prerequisites for good internal validity ([5, 6, 8-10]). Another option to address the problem of nongeneralizability of treatment effects as estimated in RCTs is to adjust the treatment effect using a prediction model based on population and disease characteristics in relation to the treatment effect.

Traditionally, treatment for rheumatoid arthritis (RA) consisted of lifestyle alterations, pain medications (nonsteroidal antiinflammatory drugs [NSAIDs]), glucocorticoids, and the traditional disease-modifying antirheumatic drugs (DMARDs). Nowadays, treatment is primarily aimed at reducing disease activity, and with that, the prevention of resulting joint damage and functional disability. In the past decade, many new treatments have been introduced, mainly the so-called biologic agents ([11]). It has been recognized that the results of phase II/III RCTs for these drugs are not in line with the results as observed in daily practice for patients treated with these drugs ([12, 13]). No systematic evaluation regarding the population and disease characteristics of these RCTs in relation to the obtained treatment effect has been performed. Investigator-initiated pragmatic clinical trials have also been performed in RA ([14-16]), but it has not been systematically studied whether the results from these pragmatic clinical trials are indeed generalizable to daily practice. Therefore, the objectives of our study were to investigate if treatment effects estimated from pragmatic clinical trials in RA are indeed generalizable to daily clinical practice, to develop a general prediction model for treatment effects for daily clinical practice based on disease and patient population characteristics of published RCTs in RA, and to discuss the feasibility of such a prediction model in translating efficacy estimates from RCTs into effectiveness estimates for daily practice populations.

Box 1. Significance & Innovations

  • This study investigated 2 possible solutions for the important problem of nongeneralizability of treatment effects found in randomized clinical trials (RCTs) to daily clinical practice, specifically for the treatment of rheumatoid arthritis: pragmatic trial design (more closely related to daily clinical practice) and extrapolating RCT results using a prediction model.
  • Although pragmatic trials can partly be generalized to daily clinical practice, important (prognostic) differences in patient populations were observed that still need to be considered in generalizing treatment effects.
  • Preliminary prediction models were developed and evaluated to extrapolate the effect estimates from randomized clinical trials into daily clinical practice populations using several important prognostic characteristics.
  • Extrapolations of trial results have considerable potential to impact effectiveness and cost-effectiveness results, and this study provides possible solutions for extrapolating results and designing optimal research to study treatment effects.


Patient data and measurements

To study the generalizability of pragmatic clinical trials to clinical practice, data were used from such trials from the Utrecht Rheumatoid Arthritis Cohort (URAC) study group ([15]) and from a clinical practice cohort, the Nijmegen Early Rheumatoid Arthritis (NERA) inception cohort ([17]). Inclusion criteria for both studies were comparable and have been described before ([15, 17]). In brief, patients had <1 year of disease duration (according to the American College of Rheumatology [ACR] 1987 classification criteria for RA) ([18]) and no previous DMARD or glucocorticosteroid use. The URAC trials studied several DMARD treatments compared with the pyramid approach (starting with NSAIDs only). The NERA cohort registered the use and effects of different DMARD treatments in daily clinical practice.

Patient demographic data available in both cohorts were age, sex, and rheumatoid factor (RF) status. Disease activity, functional disability, and treatment use over time were assessed and registered in both studies. Functional disability was measured using the Health Assessment Questionnaire (HAQ) disability index ([19]). Disease activity was measured using the Disease Activity Score in 28 joints (DAS28), including counts for swelling and tenderness of 28 joints ([19]). In the pragmatic trials, a different joint count for swelling and tenderness was used, the Thompson score ([20]), which provides a count of joints that are both swollen and tender, including the same joints as the swollen joint count in 28 joints (SJC28)/tender joint count in 28 joints (TJC28). For this study, the DAS28 was estimated using the corresponding Thompson score parameters instead of the SJC28/TJC28 using data from the NERA cohort of ∼1,000 patients, in which both Thompson scores and 28-joint counts for swelling and tenderness were available. The correlation between the original DAS28 and the new algorithm was >0.95, with no systematic differences observed.

The treatments that were compared between the pragmatic trial and the daily practice cohort were methotrexate (MTX), hydroxychloroquine (HCQ), and treatment according to the pyramid approach. In clinical practice, changes might occur more frequently because of the dynamic nature of treatment targeting disease activity, as compared with clinical studies aiming at evaluating the effect of the drug treatment. Treatment was therefore defined as constant use of the treatment over followup or stop/switch of treatment over followup. Common end points as used in RA clinical studies were used, namely the change in DAS28 score and the change in HAQ score from baseline and the European League Against Rheumatism (EULAR) response, both at 6 months.

To develop a prediction model for extrapolation of treatment effects from phase II/III RCTs to clinical practice populations, published data from RCTs evaluating treatments with biologic agents for RA compared with control therapy were used. A systematic literature search was performed using PubMed, EMBase, and the Cochrane trial register up to January 2012 to identify relevant RCTs (see Supplementary Appendix A, available in the online version of this article at The following inclusion criteria were used: 1) RCTs with at least 1 arm including treatment with biologic agents and 1 arm including control treatment (placebo/MTX/other DMARDs), 2) patients meeting either the ACR 1987 revised criteria for RA or the 2010 RA classification criteria ([18]), 3) assessment of the ACR 50% improvement criteria (ACR50) response at 6 months, and 4) the study not being an open-label extension study. When studies had multiple treatment arms, the arm with the most common dose regimen was used for analysis; in studies with multiple commonly used doses, different arms were treated as separate studies. Articles that met the selection criteria and were written in English, Dutch, or German were used (Figure 1). RefWorks software was used to manage references, and to perform screening and short listing. Along with the ACR50 response and specific treatment and control treatment, the following demographic (baseline) data were extracted: percentage of women, RF positivity, whites and Asians, and mean disease duration, erythrocyte sedimentation rate, C-reactive protein level, DAS28, SJC28, TJC28, HAQ, patient's assessment of pain and of disease activity, physician's global assessment of disease activity, radiographic Sharp/van der Heijde score ([21]), and cotreatment with DMARDs. ACR50 outcomes from 22 weeks to 26 weeks were taken as 6-month outcomes.

Figure 1.

Flow chart of the randomized clinical trials search. ACR 50 = American College of Rheumatology 50% improvement criteria.

Statistical analysis

Generalizability of pragmatic trials to clinical practice

Patient, disease, and treatment characteristics were compared between the pragmatic trials and the daily clinical practice population using chi-square tests for binary variables and independent t-tests for continuous variables. To compare the treatment effect between the pragmatic trials and daily clinical practice, a modified comprehensive cohort design analysis was performed. In this design, all available eligible patients are enrolled whether or not they consent to randomized treatment assignment. The design is essentially a prospective cohort study with a randomized subcohort. A comprehensive cohort design is often used as a means of verifying the external validity of a trial ([4]). By analyzing these combined data using regression analysis, it is investigated whether the treatment effect is different or modified between the patients in trials and patients not included in trials (i.e., included in a noninterventional or observational study) controlling for confounding factors using both the RCT and observational data. Logistic or linear regression (depending on the outcome studied) was used with a regression model of the following form: Y = α + β1*treatment + β2*trial + β3*(trial*treatment) + βi*confounderi, where Y refers to the outcome studied, treatment refers to the specific treatment comparison being evaluated, and trial refers to patients being randomized in a clinical trial or not (i.e., being a daily practice patient). Modification of the treatment effect by randomization (i.e., being in the trial) was tested in this model by the interaction term between treatment effect and randomization (trial*treatment). If this interaction term was statistically significant in the analyses, we concluded that pragmatic trials are not generalizable to daily practice.

To control for confounding of the treatment effect in the observational study and control for differences between the trial population and the observational study, possible confounders were added to the model (term confounder). A backward selection strategy of model building was used, where all variables and interactions were first added to the model and variables were then removed one by one. The final model included all variables that were either statistically significantly related to the outcome or changed the regression coefficient of the trial*treatment model term by >10%. All final models were corrected for calendar time (at the moment of entering the study) and for completion of treatment. A subgroup analysis for patients with complete treatment and patients with treatment changes over followup was also performed when significant modification of the treatment effect was observed. Baseline disease activity (DAS28), functional disability (HAQ), age, sex, and RF status were considered as possible confounders in the analyses. Analyses were performed separately for MTX and HCQ as compared to the pyramid approach.

Prediction model for extrapolation of treatment effects from phase II/III trials (RCTs) to clinical practice populations

A meta–regression analysis was performed to study the effect of patient and disease characteristics (predictor variables) on the treatment effect (outcome variable). To express the treatment effect, we used the risk difference (RD) as well as the relative risk (RR; calculated from data in the original articles) in separate analyses. For the meta–regression analysis with the RR as the outcome, the natural logarithm of RR (logRR) was used to obtain a normal distribution. Fixed-effects regression analysis was used and a multivariate analysis of selected variables was performed using a backward selection strategy based on the statistical significance and size of the regression coefficient. The final models included the type of biologic agent used, with golimumab used as a separate class because it has previously been found to have a different clinical efficacy form other tumor necrosis factor (TNF)–blocking agents ([13]), and the control treatment type.

SAS, version 9.1.3 was used for all statistical analyses and a P value less than 0.05 was considered statistically significant. Meta-Analyst and Comprehensive Meta-Analysis software were used for making graphs.


Generalizability of pragmatic trials to clinical practice

Table 1 shows the baseline characteristics for the patients included in the clinical trials and the daily practice cohort. Of the 422 patients randomized to MTX, HCQ, or the pyramid approach, 398 had a visit at 6 months. At 6 months, for 90% and 78% of patients, outcome was available for the DAS28 and HAQ, respectively. In the daily practice cohort, 198 patients had a visit at 6 months; of these patients, DAS28 and HAQ outcome was present in 67% and 44%, respectively.

Table 1. Baseline patient characteristics from the pragmatic trials and daily clinical practice on disease-modifying antirheumatic drug treatment*
 pRCTs (n = 398)DCP (n = 198)P
  1. Values are the mean ± SD unless otherwise indicated. pRCTs = pragmatic randomized clinical trials; DCP = daily clinical practice; DAS28 = Disease Activity Score in 28 joints; HAQ = Health Assessment Questionnaire; RF = rheumatoid factor; MTX = methotrexate; HCQ = hydroxychloroquine; EULAR = European League Against Rheumatism.
Age, years56.4 ± 13.954.7 ± 15.20.207
DAS28 score6.1 ± 1.35.0 ± 1.3< 0.0001
HAQ score1.3 ± 0.70.8 ± 0.7< 0.0001
Start year1994 ± 2.41996 ± 8.10.005
RF positive, no. (%)264 (66.8)121 (75.6)0.042
Women, no. (%)276 (69.4)98 (61.3)0.066
Treatment, no. (%)  < 0.0001
Pyramid61 (15.3)65 (32.8) 
MTX167 (42.0)67 (33.8) 
HCQ170 (42.7)66 (33.3) 
Treatment complete326 (81.9)123 (62.1)< 0.0001
Change in DAS28−1.7 ± 1.7−0.8 ± 1.2< 0.0001
Change in HAQ−0.3 ± 0.6−0.3 ± 0.60.554
EULAR good response, no. (%)79 (21.9)18 (13.5)0.038
EULAR moderate response, no. (%)257 (71.2)78 (58.7)0.008

The trial patients were somewhat older and more often women, had more active disease, and were less often RF positive. Because of the specific inclusion periods of the trials and the continuing inclusion in the Nijmegen inception cohort, the patients in the clinical trial were, on average, included earlier in time.

As expected, the distribution over the differing drug treatments was different and treatment was more often completed in trial patients (protocol adherence) at 6 months. Response to treatment with respect to disease activity, but not functional disability, was generally better in trial patients. EULAR good and moderate responses were also higher in trial patients at 6 months.

Change in disease activity (DAS28).

The treatment effect was more pronounced in trial patients, but modification of the treatment effect (the interaction term trial*treatment) was only statistically significant for the comparison of HCQ with the pyramid approach. Furthermore, a lower DAS28, a higher HAQ, and a positive RF reduced the decrease in DAS28 (as expected). For the comparison of MTX versus the pyramid approach, a younger age also was related to less reduction in DAS28, but not in HAQ (Table 2).

Table 2. Results of the linear regression analysis regarding generalizability of 6-month treatment effects on continuous outcomes*
 Change from baseline in DAS28Change from baseline in HAQ
MTX vs. pyramid, β (95% CI)HCQ vs. pyramid, β (95% CI)MTX vs. pyramid, β (95% CI)HCQ vs. pyramid, β (95% CI)
  1. aDAS28 = Disease Activity Score in 28 joints; HAQ = Health Assessment Questionnaire; MTX = methotrexate; 95% CI = 95% confidence interval; HCQ = hydroxychloroquine; trial*treatment = interaction between being in the trial (trial) and the treatment effect (treatment), if this term is statistically significant, this indicates that results of pragmatic trials are not generalizable; RF = rheumatoid factor; DI = disability index.
Treatment−0.69 (−1.22, −0.15)0.40 (−0.38, 1.18)0.02 (−0.26, 0.29)−0.02 (−0.32, 0.28)
Trial0.20 (−0.36, 0.76)0.06 (−0.63, 0.75)0.46 (0.18, 0.75)0.34 (0.07, 0.61)
Trialatreatment−0.10 (−0.77, 0.56)−0.94 (−1.84, −0.05)−0.39 (−0.72, −0.06)−0.29 (−0.64, 0.07)
Older age−0.01 (−0.03, −0.00) −0.01 (−0.01, −0.00) 
RF positive0.36 (0.03, 0.69)0.73 (0.35, 1.10) 0.15 (0.01, 0.29)
Higher baseline DAS28−0.58 (−0.69, −0.47)−0.63 (−0.78, −0.47)  
Higher baseline HAQ DI 0.41 (0.13, 0.69)−0.48 (−0.58, −0.39)−0.35 (−0.44, −0.25)
Subgroup analyses    
Constant treatment    
Treatment−0.82 (−1.55, −0.09)0.08 (−0.94, 1.10)0.05 (−0.42, 0.53)−0.00 (−0.48, 0.47)
Trial0.39 (−0.34, 1.13)0.29 (−0.63, 1.20)0.54 (0.06, 1.02)0.38 (0.06, 0.82)
Trialatreatment−0.20 (−1.04, 0.64)−1.00 (−2.11, 0.11)−0.45 (−0.96, 0.07)−0.42 (−0.94, 0.11)
Treatment changes    
Treatment−0.42 (−1.38, 0.53)0.83 (−0.72, 2.38)−0.05 (−0.49, 0.39)−0.07 (−0.58, 0.43)
Trial−0.64 (−1.80, 0.51)−1.57 (−3.11, −0.02)0.15 (−0.36, 0.66)0.07 (−0.48, 0.62)
Trialatreatment0.63 (−0.82, 2.09)0.23 (−1.69, 2.15)−0.16 (−0.85, 0.52)0.22 (−0.45, 0.90)

In patients who continued their initial (randomized) treatment, the treatment effect was larger in trial patients, and in patients who discontinued treatment, the treatment effect was smaller in trial patients, although both were not statistically significant. When tested using a 3-way interaction, this was statistically significant (txcomplete*trial*MTX, P = 0.0238 and txcomplete*trial*HCQ, P = 0.0003).

Change in functional disability (HAQ).

The treatment effect was found to be more pronounced in trial patients for both treatment comparisons, but was only statistically significant for the comparison of MTX with the pyramid approach. Furthermore, a lower HAQ and younger age at baseline (and positive RF for HCQ versus the pyramid approach) decreased the HAQ reduction, as expected (Table 2).

Subgroup analyses for continuing initial treatment showed that modification of the treatment effect was mainly present for patients who completed treatment. When tested using a 3-way interaction, this was not statistically significant for MTX versus the pyramid approach (P = 0.496) and was statistically significant for HCQ versus the pyramid approach (P = 0.026).

EULAR moderate response

The treatment effect was not different for trial versus daily practice patients for both treatment comparisons. Only a higher baseline DAS28 increased the probability of response in MTX versus the pyramid approach, as well as in HCQ versus the pyramid approach (Table 3).

Table 3. Results of the logistic regression analysis regarding generalizability of 6-month treatment effects on binary outcomes*
 EULAR moderate responseEULAR good response
MTX versus pyramid, OR (95% CI)HCQ versus pyramid, OR (95% CI)MTX versus pyramid, OR (95% CI)HCQ versus pyramid, OR (95% CI)
  1. aEULAR = European League Against Rheumatism; MTX = methotrexate; OR = odds ratio; 95% CI = 95% confidence interval; HCQ = hydroxychloroquine; trial*treatment = interaction between being in the trial (trial) and the treatment effect (treatment), if this term is statistically significant, this indicates that results of pragmatic trials are not generalizable; RF = rheumatoid factor; DAS28 = Disease Activity Score in 28 joints; HAQ = Health Assessment Questionnaire; DI = disability index.
Treatment7.39 (2.76, 19.77)4.10 (0.71, 23.60)9.49 (1.93, 46.72)1.13 (0.06, 19.72)
Trial1.63 (0.66, 4.04)4.10 (0.44, 9.86)3.00 (0.53, 16.94)4.22 (0.44, 40.80)
Trialatreatment0.31 (0.10, 1.03)2.08 (0.08, 3.63)0.32 (0.05, 2.06)3.35 (0.16, 70.85)
Older age  1.03 (1.01, 1.05) 
RF positivity  0.37 (0.19, 0.71)0.31 (0.15, 0.66)
Higher baseline DAS281.45 (1.17, 1.80)1.62 (1.20, 2.18)0.69 (0.54, 0.89)
Higher baseline HAQ DI   0.43 (0.24, 0.77)

EULAR good response

The treatment effect was not modified by being in the trial. Older age, negative RF, and lower baseline DAS28 increased the probability of response in MTX versus the pyramid approach. For HCQ versus the pyramid approach, a negative RF and lower baseline HAQ increased the probability of response.

All final regression models using change in DAS28 or a combination of change in DAS28 and DAS28 reached (EULAR response) were corrected for baseline DAS28. The differences in treatment effect therefore also related to absolute levels of disease activity reached.

Prediction model for extrapolation of the treatment effects from phase II/III trials to clinical practice populations

The flow chart (Figure 1) shows the process and results of the literature search and selection procedure. Finally, 35 unique studies ([22-56]) that presented the ACR50 response at 6 months as an outcome measure and that were eligible for our analysis were selected. The assessment of the articles based on the jaded scale (range 0–5, where 0 = worst quality and 5 = best quality) showed a score >3 for all studies, indicating good quality. Baseline population characteristics of these selected RCTs are shown in Table 4.

Table 4. Extracted baseline population characteristics of selected randomized clinical trials*
 No.Mean ± SD
  1. RF = rheumatoid factor; DAS28 = Disease Activity Score in 28 joints; VASGH = visual analog scale for general health; HAQ = Health Assessment Questionnaire.
% RF positive210.84 ± 0.095
% women350.79 ± 0.082
Disease duration, years329.13 ± 2.40
Age, years3553 ± 2.21
DAS28236.51 ± 0.49
VASGH1962.41 ± 6.52
HAQ281.59 ± 0.21
% previous corticosteroid use210.61 ± 0.094

Table 5 shows the results of the meta–regression analysis. In the final model with the logRR as the outcome, the percentage of glucocorticoid use and percentage of cotreatment with DMARDs and also an older age were statistically significantly associated with a higher RR of achieving a response, as compared with control treatment. A higher mean baseline HAQ decreased the RR for response to treatment, as compared with control treatment.

Table 5. Prediction model for extrapolating response on (biologic) treatment as compared with control treatment on the logRR and RD scale from randomized clinical trials to a specific clinical practice population*
EstimateLower 95% CIUpper 95% CIEstimateLower 95% CIUpper 95% CI
  1. Shown are the final meta–regression models with log relative risk (logRR) or risk difference (RD) for the American College of Rheumatology 50% improvement criteria response of index treatment compared with control treatment as the dependent variable. 95% CI = 95% confidence interval; anti–IL-1 = anti–interleukin-1; anti-TNFα = anti–tumor necrosis factor α; DMARD = disease-modifying antirheumatic drug; HAQ = Health Assessment Questionnaire; DAS28 = Disease Activity Score in 28 joints.
  2. aGolimumab (a TNF blocker) showed a different treatment effect from the other TNF blockers (as also found earlier by Kievit et al &lsqbr;[13]&rsqbr;), which led us to the decision to classify this drug separately.
Treatment with anti-CD201.33840.39812.27870.18020.020800.3397
Treatment with anti–IL-1−0.5810−1.24090.0789
Treatment with anti–IL-60.61580.13231.09920.24010.061220.4190
Treatment with anti-TNFα1.23780.68561.78990.20860.06360.3530
Treatment with anti–T cell−0.0059−0.44250.43060.0103−0.12550.1461
Treatment with golimumaba−0.8861−1.3850−0.3873−0.1714−0.3309−0.0119
Treatment with infliximab00
DMARD cotreatment0.71020.42200.99840.12560.07060.1805
Previous use of corticosteroid3.80951.30986.30931.06960.30181.8374
Mean age0.0838−0.012640.1804
Mean baseline HAQ−2.1884−3.7897−0.5870−0.2598−0.52750.0080
Mean baseline DAS28−0.1287−0.2459−0.0114

For the final model with RD as the outcome, the percentage of glucocorticoid use and cotreatment with DMARDs increased the RD, and a higher mean baseline DAS28 and HAQ decreased the treatment effect. No modification of the effect of the predictors by treatment could be identified.

To explore the fit of the final models, the predicted versus observed treatment effects for the RCTs were plotted, showing a reasonable fit (see Supplementary Appendix B, available in the online version of this article at


In this study, we investigated 2 ways to obtain effect estimates for disease-modifying treatment of RA from study results that are more generalizable or representative of treatment effects in daily clinical practice. First, we examined whether so-called pragmatic RCTs might be generalizable to daily clinical practice populations. We showed that, regarding treatment effect on binary outcomes (i.e., EULAR response), the relative treatment effects (i.e., multiplicative, odds ratio) might indeed be comparable in trial patients and non–trial patients (generalizable). For continuous outcomes, we found that the treatment effect (i.e., absolute difference) might be different, with more pronounced treatment effects in the pragmatic trials.

Even in pragmatic trials, there seems to be a selection of the domain population. Because the pragmatic trial from Utrecht and the observational study from Nijmegen have comparable inclusion criteria and both are from large Dutch rheumatology centers, and because we selected patients in the same stage of their disease and using the same treatments, no large differences in patient characteristics were expected. However, through the selection of patients for participation in the study by physicians and patients themselves, differences in patient characteristics between the study populations seem to have occurred. In our analysis on the treatment effect, we adjusted for these baseline differences and compared treatment versus a common control treatment (i.e., treatment according to the pyramid approach). The remaining differences for the change scores in treatment effect might be related to other reasons, including unmeasured (and difficult to measure) patient, disease, or physician characteristics. However, irrespective of the reason for this remaining difference in treatment effect, the results should be regarded as not generalizable.

When the relative treatment effect (i.e., the relative odds of responding to treatment compared with control treatment), as calculated from pragmatic trials, is indeed generalizable, as we found, the selection of patients still needs to be taken into account when prognostic characteristics differ between RCTs and the daily practice population. For example, consider a patient with only a 20% chance of responding to control treatment and a patient with a 50% chance of responding to control treatment. In the first patient, a relative treatment effect of 1.5 means a 30% chance of responding to new treatment (a difference of 10%), and in the second patient, this would mean a 75% chance of responding (a difference of 25%), which makes a difference.

Our other objective was to develop a general prediction model to be used to translate RCT efficacy estimates to effect estimates for a daily clinical practice population. We found that, with RR as well as RD as the outcome, the treatment effect was influenced by certain prognostic factors, especially DMARD cotreatment, baseline HAQ, and previous use of glucocorticoids.

For example, these results can be used to extrapolate the absolute RD on ACR50 response of TNF-blocking treatment from the trial data to a specified hypothetical daily practice population (for specific calculations, including calculations of raw cost-effectiveness estimates, see Supplementary Appendix C, available in the online version of this article at

It has been suggested that, in patients who are eligible (according to inclusion/exclusion criteria) for RCTs from daily practice, the responses are similar to patients in RCTs ([13]). This suggests that using a prediction model with variables commonly used for inclusion/exclusion criteria to adjust results from RCTs to a specific daily practice situation can be feasible. In line with Wolfe et al ([13]), who concluded that the design in RCTs exaggerated the treatment effect because of the selection of patients, and Hyrich et al ([57]), we also found that a higher DAS28 and older age increased the probability of response next to the treatment. Kievit et al ([13]) suggested that selection toward high disease activity and continued use of comedication in RCTs could be the probable explanations for the difference in effects of anti-TNF in clinical practice and in RCTs, in line with the findings of our regression analyses.

The finding from our meta–regression analysis that a lower baseline DAS28 increased the RR (and RD) of response to treatment seems contradictory to the findings from studies in the literature and our own analysis in individual patient data of higher response probabilities with a higher baseline severity of disease (i.e., DAS28 or HAQ). However, it should be noted that, in the individual patient data analysis, the influence of baseline disease activity/severity on the probability of response was investigated, while in the meta–regression analysis, the effect of baseline disease activity on the ratio of or difference in response probabilities between index and control treatment was investigated. To further examine if this was indeed the reason for this apparent contradiction, we tested the influence of baseline disease severity (DAS28 and HAQ separately) on the treatment effect (i.e., the interaction between baseline DAS28/HAQ and treatment) in our daily practice population. Although the interaction was not found to be statistically significant, the sign of the terms was negative and the calculated RRs and RD from the model indeed decreased with higher baseline DAS28 or HAQ. Therefore, we consider the results of our meta-regression valid in this regard.

The finding that concomitant DMARD therapy had a positive effect on treatment effect was expected. The effect of glucocorticoid treatment at baseline might be an indicator of disease activity, or the treatment itself might be of additional value, like cotreatment with a DMARD ([33, 34, 40, 42, 43]).

While generalizing the results from RCTs, it is often assumed that the RR as a measure of treatment effect is more or less constant over patient populations and that the RD varies more between populations with a different baseline risk of response (i.e., prognosis). However, in our analysis on published RCT data, we observed that almost the same baseline predictors influenced the treatment effect with RD as well as RR. Therefore, our results suggest that the assumption of constant RR of treatment response over populations might not be true, with larger differences between the study population and the daily practice population (i.e., in RCTs). In the pragmatic trials, we found that the relative treatment effect (i.e., odds ratio or RR) was indeed generalizable.

As in all studies, our study also has its limitations. Generalizability was only studied for a limited number of DMARD treatments. This was because only those treatments studied were used as initial DMARD treatment in a sufficient amount of patients in both cohorts. In the published reports of RCTs, only a limited number of patient/disease characteristics could be studied because of insufficient reporting and differences in trial design, and some studies had to be excluded from the analysis, resulting in less power, but we did not expect this to result in bias. The assumption in our analysis was that the effect of population/disease characteristics on treatment effect is the same for different biologic agents. This is probably justified because very general (prognostic) factors were studied and no modification could be established. These results and models developed should, however, be validated in other data sets.

Comparable factors seemed to be of importance in the generalization of pragmatic trial results as in the extrapolation of RCT results, which was reassuring. Missing data were also present in the data of the pragmatic trials and the observational study. We analyzed the missing data and found that disease activity outcomes and baseline characteristics were comparable in patients with complete data and patients with missing data, except for age, which was somewhat lower in patients with missing outcomes in trial data, and baseline DAS28, which was higher in the daily practice population.

Treatment effects as observed in pragmatic clinical trials might only partly be directly generalizable to daily practice, only concerning the relative treatment effects (i.e., odds ratio or RR). In extrapolating RCT results to daily practice populations, factors associated with disease activity, disease duration, and treatment history or cotreatment need to be taken into account, regardless of whether the treatment effect expressed is absolute (RD) or relative (RR). These results are important for the interpretation of clinical studies and the planning of such studies that focus on the valid estimate of the effectiveness or cost-effectiveness of (new) treatments for daily practice, specific for RA and more generally for other chronic diseases.


All authors were involved in drafting the article or revising it critically for important intellectual content, and all authors approved the final version to be published. Ms Nair had full access to all of the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.

Study conception and design. Nair, Kievit, Bijlsma, Welsing.

Acquisition of data. Nair, Kievit, Janse, Bijlsma, Fransen, Lafeber, Welsing.

Analysis and interpretation of data. Nair, Kievit, Janse, Fransen, Welsing.


The authors thank all the participating rheumatologists and research nurses of the URAC Study Group and NERA inception cohort for their specific contribution.