Do efficacy results obtained from randomized controlled trials translate to effectiveness data from observational studies for relapsing–remitting multiple sclerosis?

Randomized controlled trials are considered the gold standard in regulatory decision making, as observational studies are known to have important methodological limitations. However, real‐world evidence may be helpful in specific situations. This review investigates how the effect estimates obtained from randomized controlled trials compare to those obtained from observational studies, using drug therapy for relapsing–remitting multiple sclerosis as an example.

• Multiple observational studies together may supplement additional pivotal randomised controlled trial in relapsing-remitting multiple sclerosis, for instance facilitating the extrapolation of trial results to the broader patient population.

| INTRODUCTION
Randomized controlled trials are considered the gold standard for premarket assessment of efficacy of medicinal products. 1By assigning treatment through randomization, exchangeability can be achieved which can avoid forms of bias due to confounding and selection.Realworld evidence derived from observational studies can supplement evidence derived from experimental trials to show effectiveness of treatments in clinical practice.For example, to study long-term outcomes, or effects in populations not evaluated in the pivotal studies.
With the increase in availability of electronic data, there has been growing interest in the usage of real-world evidence in recent years. 2,3pecially in areas where the application of randomized controlled trials is less feasible, due to, among other reasons, ethical limitations. 3e challenges of randomized trials and the availability of realworld evidence have sparked a discussion on whether data from observational studies can substitute randomized controlled trials and if so, under which conditions.Methodological limitations of observational studies have withheld the use of real-world evidence as pivotal evidence for marketing authorization, except in very specific circumstances. 4[7] However, other threats like the presence of unmeasured confounding are harder or even impossible to deal with. 8,91][12] Further insight into the magnitude of potential heterogeneity between results from randomized controlled trials and observational studies, and possible sources for this heterogeneity may facilitate a discussion on when randomized evidence is needed and when real-world evidence with more uncertainty may be acceptable or even beneficial.
Heterogeneity in results of real-world evidence and randomized controlled trials may be explained by the previously stated dissimilarity in study designs-consciously chosen or due to limitations in the data-and its associated threats to causal inference.By using network meta-analysis, we can identify how the effect estimates obtained from randomized controlled trials compare to those obtained from observational studies through the direct and indirect comparison of an effect estimate of interest (e.g., incidence rate ratio) for a multitude of treatment options, different outcomes and over a set of studies. 13 this review, we have used the case of relapsing-remitting multiple sclerosis as an illustrative example to discuss the comparison of the results from evidence from randomized controlled trials with realworld evidence, looking at the annualized relapse rate for multiple disease-modifying therapies.Relapsing-remitting multiple sclerosis was chosen due to the number of treatment options available and the large number of experimental and observational studies published, as well as relevant literature for the comparison of our results. 14,15Using these case studies, we introduce the application of posterior predictive p-values as a novel approach to compare the differences between effects estimated from randomized controlled trials and real-world studies.Moreover, we aimed to discuss the comparability, potential sources of heterogeneity, and the potential to adjust for these differences.
The protocol of this review has been published and is publicly available at the PROSPERO website (CRD42022354152).

| Search strategy and inclusion criteria
We performed a systematic search with a restriction to the English language in the online literature databases PubMed and Embase to search for both experimental and observational studies that evaluated the effectiveness of at least one disease-modifying treatment to an active control in patients with relapsing-remitting multiple sclerosis.
Articles published until March 2022 were included.The search strategy can be found in the Data S1.Details of the in-and exclusion criteria can be found in Table 1.
Studies obtained by the systematic search were first assessed for eligibility by title and abstract and subsequently by the full text by two screening reviewers, Stefan Verweij and Wouter Ahmed.Disagreements were discussed between the reviewers.If no agreement was reached, a third reviewer was consulted, Eelko Hak, for a final decision.The search workflow is visualized using the PRISMA workflow in Table 2. 16 T A B L E 1 In-and exclusion criteria of the systematic search performed in the PubMed and Embase databases.

Population
• Patients with relapsing-remitting multiple sclerosis.
• Specific populations of patients with relapsing-remitting multiple sclerosis (e.g., those with comorbidities or pregnant women).

Comparator
• Placebo and/or disease-modifying therapies used as an active control are included.
• Passive control in the form of no treatment • Healthy controls.

Outcomes
• The annualized relapse rate (or total number of relapses in combination with the number of patients and length of followup) at 2 years of follow-up.
• No report of annualized relapse rate as an outcome at 2 years

Study designs
• Randomized controlled trials and observational studies that report primary data on the effect of disease-modifying therapies on experiencing a relapse.
• Studies based on a crossover design.
• Post-hoc analysis if explicitly stated.

Publication types
• Original research articles.
• Editorial letters language • English language • Non-English language.
T A B L E 2 PRISMA diagram of the systematic search.

| Outcome measures
The primary clinical outcome is the annualized relapse rate over a follow-up period of 2 years since study onset, where a relapse was defined as new or worsening neurological symptoms which last at least 24 h and could be appointed to relapsing-remitting multiple sclerosis, preceded by at least 30 days of clinical stability or improvement.We have chosen this outcome measure as it is the European Medicines Agency's preferred efficacy endpoint for treatments intended to modify the natural course of relapsing multiple sclerosis. 18 reported by the study, we extracted covariate adjusted effect estimates to reduce possible bias due to confounding.A threshold on the follow-up period, that is, 2 years, was chosen to support the transitivity assumption (the studies are on average similar in all important factors other than the intervention comparison being made) and to have follow-up of sufficient duration to allow for the evaluation of an effect on relapses. 19Relative rate ratios were calculated to identify how annualized relapse rates compare between disease-modifying therapies.

| Risk of bias assessment
The bias assessment framework risk-of-bias-2 (RoB-2) was used to determine the methodological quality of randomized controlled trials. 20For observational studies, the risk of bias in non-randomized studies-of interventions (ROBINS-I) assessment framework was applied. 21The bias was assessed by Stefan Verweij for all the included studies.Fifty percent of these studies (50% of the randomized controlled trials and 50% of the observational studies) were chosen at random to have their bias independently assessed by Guiling Zhou as an extra validation step.For both the RoB-2 and ROBINS-I frameworks, the two reviewers had to initially agree on the degree of bias for at least 90% of the studies assessed.When this threshold was not met, the second reviewer had to assess the remainder of the included studies as well to ensure reliable bias assessment.Any disagreement was discussed between the two reviewers.If no agreement was reached, the disagreement was consulted with a third reviewer, Eelko Hak, for a final decision.Robvis was used to visualize the risk of bias. 22

| Statistical analysis
As we aimed to compare relative effect estimates (i.e., annualized relapse rate ratios) between the network of randomized controlled trials and the real-world evidence network, a common comparator was needed to examine if the annualized relapse rates of the Rebif populations in an experimental setting compare to the rates of the Rebif populations in the real-world setting.One of the interferon β-1a's (Rebif) was chosen as comparator as it was the most common active comparator in both observational and experimental studies.4][25] When the pooled effect estimate between the two study designs was considered roughly similar, two network meta-analyses were performed using a frequentist random-effects model estimated by the "netmeta" R package on the annualized relapse rate to create rate ratios. 26A random-effects model was conservatively chosen as we assumed the true effect is different for each study due to the heterogeneity between study populations.Robustness of the results was assessed by also employing a Bayesian random-effects model with an uninformative prior distribution for heterogeneity variance, embedded in the "GeMTC" package in R. 27 100 000 iterations of Markov chain Monte Carlo were performed to sample from posterior distributions, with a burn-in of 10 000 iterations and thinning of one.Convergence of the MCMC sampler was assessed using the Gelman-Rubin diagnostics. 28The annualized relapse rate was used as a rate variable to create ratios between the treatment comparisons with 95% confidence intervals.The magnitude of the ratios and the width of the confidence intervals were visualized separately for the randomized controlled trial and real-world evidence networks and any apparent differences were described.Network heterogeneity was quantified and categorized using I 2 statistics, as described in Chapter 10.10 of the Cochrane handbook, and the τ, estimated using the maximum likelihood approach. 29,30Transitivity was evaluated by testing for inconsistency using the node splitting approach. 31When the transitivity within the network was violated (p < 0:05), the study causing the inconsistency was identified and examined on possible explanations for the violation.If transitivity was indeed doubtful, the study was removed from the analysis.Using the posterior predictive p-value approach, we assessed whether a treatment effect estimate arising from an observational study for a given treatment comparison was significantly different (p < 0:05) from what was expected, given the data from the network of randomized controlled trials. 32We thereby assumed that the information (variance) of the real-world evidence effect estimate is similar to that of the pooled estimate from the network of randomized controlled trials.The posterior predictive p-values were obtained through the prediction intervals of the randomized controlled trial network reported by the "netmeta" R package. 26Sensitivity analysis were performed by excluding studies graded with high risk (RoB-2) or serious or critical risk (ROBINS-I) of bias.

| Search results
The PRISMA diagram of the screening process is given in  S1).
On average, 324 and 231 relapses per 1000 patient years were observed in randomized controlled trials and observational studies, respectively (Table S2). Figure 1 shows the network graphs of the treatment comparisons, with the edge thickness representing the number of studies and the node size representing the total number of patient years per treatment population.
One of the included studies caused noticeable inconsistency in the network of randomized controlled trials for the treatment comparison of fingolimod with placebo (p ¼ 0:001, Figure S1).We thereby identified a pediatric study examining this treatment comparison and, due to its pediatric nature, this study was removed from the analysis. 38The causes for the violations of consistency for the treatment comparisons placebo-interferon β-1a (Avonex) and placeboglatiramer could not be traced (Figure S1).

| Risk of bias assessment
The bias assessment was sufficiently aligned between the two reviewers.Four out of the 48 studies were graded to have a low risk of bias (Table S3).Nine randomized controlled trials were graded as having some concerns and another nine as having a high risk of bias.
None of the 26 observational studies were evaluated as having a low risk of bias, 15 having a moderate risk of bias and eight serious risk or higher, mostly caused by bias in the measurement of the outcomes (Table 3).Three observational studies contained insufficient information to grade the bias (Table S3).

| Meta-analysis on annualized relapse rates
The magnitudes of the pooled outcome for the comparator, that is, the annualized relapse rates for Rebif, available in eight randomized T A B L E 3 Summary of bias tables with (A) the summary of bias table for the randomized controlled trials using the RoB-2 framework and (B) the summary of bias table for the observational studies using the ROBINS-I framework.
F I G U R E 2 Forest plot for the metaanalysis on the annualized relapse "Rate" for the comparator (Rebif).The metaanalysis was performed with subgroups based on study design, that is, randomized controlled trials (RCT) versus observational studies (RWD).In the right column the 95% confidence intervals (CI) are given."Events" are the number of relapses and "time" is the number of patient-years.
3.5 | Network meta-analysis on annualized relapse rates

| Network meta-analysis
The annualized relapse rate ratios of the treatment comparisons within both the network of randomized controlled trials and the real-world evidence network are presented in Table 4 and Figure 3.
The forest plots for the other treatment comparisons that were compared only in randomized controlled trial (cladribine, mitoxantrone, ocrelizumab, and ozanimod) or only in real-world evidence (peginterferon and teriflunomide) can be found in Figure S2 A,B, respectively.The Bayesian method provided results similar to the frequentist results (Figure S3).The confidence intervals of the second-line treatments natalizumab, fingolimod and dimethyl fumarate from the randomized controlled trial network had more than 95% overlap with their complementing confidence intervals from the real-world evidence network.Alemtuzumab and the first line treatments interferon β-1b, glatiramer and Avonex had 52%, 100%,

Note:
The upper and lower limits of the 95% confidence intervals for the real-world evidence network (upper triangle, black font) and the network of randomized controlled trials (lower triangle, blue font).The rate ratio is calculated by dividing the annualized relapse rate estimate of the disease-modifying therapy in the header row by the annualized relapse rate estimate of the disease-modifying therapy in the index column.In bold the significant effect estimates.Light red marked cells: the effect estimates between the two networks are opposing, while at least one of the estimates is not statistically significant.Light green marked cells: the estimates in the two networks favor the same treatment, while at least one of the estimates is not statistically significant.Dark green marked cells: the estimates in the two networks favor the same treatment, while both estimates are significant.Grey marked cells: the effect estimate of the treatment comparison is absent in at least one of the networks.Abbreviations: ALZ, alemtuzumab; Av, interferon β-1a (Avonex); Be, interferon β-1b; CLA, cladribine; DMF, dimethyl fumarate; FTY, fingolimod; GA, glatiramer; MTX, mitoxantrone; NAT, natalizumab; OCR, ocrelizumab; OZA, ozanimod; PEG, peginterferon; PLA, placebo; Re, interferon β-1a (Rebif); TERI, teriflunomide.
F I G U R E 3 Forest plot of the network meta-analysis on the annualized relapse rate ratio (IRR).In blue the annualized relapse rate ratio estimates obtained from the network of randomized controlled trials.In red the estimates obtained from the network of observational studies.
86%, and 32% overlap, respectively.The magnitudes of effect observed in the real-world evidence network are on average 21% larger as opposed to the effects from the randomized controlled trial network, ranging between 6% and 40%, with the only exception being Avonex (À32%).Moderate heterogeneity was observed in the randomized controlled trial network (I 2 ¼ 45%, τ ¼ 0:11), while substantial heterogeneity was observed in the real-world evidence network (I 2 ¼ 76%, τ ¼ 0:27).

| Similarity between the randomized controlled trial and real-world evidence effect estimates
Of the 28 treatment comparisons, 20 had nonsignificant posterior predictive p-values (p > 0:05), indicating that the real-world evidence effect estimate was not conflicting given the prediction interval of the randomized controlled trial network (Table 5).Five of the conflicting comparisons included glatiramer, the two other comparisons including glatiramer also had small posterior predictive p-values.There were three conflicting treatment comparisons with alemtuzumab (including the comparison with glatiramer).Only one of the conflicting comparisons did not contain glatiramer nor alemtuzumab.The effect estimates and the prediction intervals from both the network of randomized controlled trials and the network of observational studies can be found in Figure S4.

| Sensitivity analysis
Excluding the 19 studies graded with a high risk of bias (RoB-2) or serious or critical risk of bias (ROBINS-I) resulted in similar relative effect estimates (Table S4 A,B).The exception is the relative effect estimate for glatiramer, which differed greatly in the real-world evidence network after excluding observational studies graded with serious or critical risk of bias.Heterogeneity in the randomized controlled trial network dropped from 45% to 5% when excluding trials graded with a high risk of bias.Moreover, exclusion of high risk studies in general lead to better predictability of the real-world evidence estimate given the prediction intervals from the network of randomized controlled trials, with the sole exception being treatment-comparisons with alemtuzumab (Table S4 C).

| DISCUSSION
In this network meta-analysis, we showed that the relative effect estimates, that is, the annualized relapse rate ratios, obtained from observational studies are predominantly comparable to those obtained from randomized controlled trials given the posterior predictive distributions.However, the estimates obtained from observational studies tended to have greater magnitudes of effect combined with wider confidence intervals, possibly explained by increased levels of heterogeneity, lower data quality and sources of bias. 14,15e meta-analysis on the absolute effect estimate, that is, the annualized relapse rate, of the common comparator (Rebif) showed comparable rates for the pooled effect estimates between the randomized controlled trial and real-world evidence subgroups.This implied that the relative effect estimates, that is, the annualized relapse rate ratios, between the randomized controlled trial network and real-world evidence network were on the same scale and therefore could be compared.However, one should take into account that heterogeneity between the studies included in the meta-analysis was large, which could be explained by the clinical and methodological differences between the studies.Sensitivity analysis on a subset excluding high, serious and critical risk of bias studies was not feasible due to the limited number of Rebif studies remaining in the subset.
Greater magnitudes of effects for the annualized relapse rate ratios were observed for the observational study network as opposed to the network of randomized controlled trials.Comi et al. reported these greater magnitudes as well in observational relapsing-remitting multiple sclerosis studies. 14In their paper, Comi and colleagues attributed this difference in magnitudes to methodological divergence in the real-world as opposed to the experimental settings. 14We would like to add that this methodological divergence leads to different sources of bias, such as selection and confounding bias, which may explain the differences in magnitudes.Moreover, publication bias could have played a role as it is argued that studies with a successfully proven hypothesis have a greater chance of being published. 80rger confidence intervals were observed for the pooled results in the real-world evidence network as compared to the network of randomized controlled trials (even though estimates from individual observational studies were more precise).The same observation was made in the study of Jenkins et al. 15 These increased confidence intervals could be explained by the increased heterogeneity of populations included in observational studies in the real-world evidence network or by issues with the observational data quality. 81These confidence intervals generally stretched beyond the lower and higher confidence limits from the randomized controlled trial network, with only Avonex and alemtuzumab being clear exceptions.The discrepancies for Avonex for both the confidence intervals as well as the magnitude of the effect estimates, however, could be explained by the high annualized relapse rate of the Avonex population in the older and relatively small randomized controlled trial of Etemadifa et al. and the indirect evidence related to the high annualized relapse rate of the Avonex population in the randomized controlled trial of Durelli et al. 42,43 Alemtuzumab in its turn was only studied by Kalincik et al.
in the real-world evidence network. 17 For 20 of the 28 treatment comparisons there was no significant conflict between the real-world evidence effect estimate and the posterior predictive distributions obtained from the randomized controlled trial network (p > 0:05).Here it should be noted that the employment of uncalibrated posterior predictive p-values as a conflict diagnostic is known to lack robustness. 82,83The choice of 0.05 as a significance level is also arbitrary and does not necessarily reflect clinical significance in this context.The results should be thus treated, as exploratory and further methodological research is required in order for robust measures between two networks of evidence to be proposed.
Respectively five and three of the eight conflicting comparisons included glatiramer and alemtuzumab.As previously explained, alemtuzumab was only studied in one observational study which may have caused the conflict with the posterior predictive distribution obtained from the randomized controlled trial network (possibly due to heterogeneity between the sole observational alemtuzumab study and the experimental alemtuzumab trials).On its turn, the conflicts for glatiramer comparisons may be explained by bias in the observational studies, as sensitivity analysis showed a very different relative effect estimate for glatiramer when excluding high risk of bias studies.
Excluding 19 high risk of bias studies even lead to more congruent results.The similarity in effect estimates we observed could be attributed to similar clinical and demographic characteristics in multiple sclerosis patients between observational studies and randomized controlled trials as reported by Rojas et al. 84 The approval of most disease-modifying therapies in the European Union is based on at least two pivotal phase III experimental trials.Thus, given the similarity in effect estimates between randomized controlled trials and observational studies considering the posterior predictive distributions of the randomized controlled trial network, it is argued that multiple observational studies together may supplement additional pivotal randomized controlled trials in relapsing-remitting multiple sclerosis.For instance, by facilitating the extrapolation of trial results to the broader patient population.However, this is only applicable for parallel group design studies, as we excluded crossover designs, single arm studies and label extensions to avoid complicating the network of treatment interactions.Moreover, one should take into account the greater magnitudes of the effect estimates observed in observational studies, accompanied with less confidence in these estimates, as opposed to effect estimates obtained from randomized controlled trials.And lastly, bias may have played a role in our (network) meta-analysis and thereby may have distorted the observations made.Therefore, bias-reducing measures (e.g., adjustment, matching, and inverse weighting), high quality data sources, evidence grading (e.g., through the usage of quality assessment frameworks such as CINeMA) and clinical expertise are pivotal to generate more convincing observational effect estimates. 81,85,86These steps help to better approach the true underlying effect in clinical practice for the treatment of relapses in relapsingremitting multiple sclerosis.
While our observations confirm that efficacy estimates from randomized controlled trials indeed translate to effectiveness estimates from observational studies for the treatment of relapses in relapsingremitting multiple sclerosis, the hypothesis should still be tested for other types of multiple sclerosis, other neurodegenerative diseases and other, more distinct therapeutic areas to support the use of observational studies in the post-approval efficacy/effectiveness assessment of regulatory authorization procedures.

F
controlled trial arms were similar to those observed in five arms from the network of observational studies (IR = 0.37, 95%CI = [0.30-0.47] vs. 0.30, 95%CI = [0.18-0.51]resp.for the random effects model) (Figure2).However, heterogeneity had a significant role in this metaanalysis.with the I 2 of the randomized controlled trial, real-world evidence and combined network being 95%, 99% and 98%, respectively (Figure2).This is confirmed by the standard deviation of the distribution of true effect sizes (τ ¼ ffiffiffiffiffiffiffi 0I G U R E 1 Network diagrams of the three networks with (A) the randomized controlled trial network and (B) the real-world evidence network.The size of the node is relative to the number of patient years per treatment population.The edge thickness and labels represent the number of studies that directly compare the interventions.The interventions: ALZ, alemtuzumab; Av, interferon β-1a (Avonex); Be, interferon β-1b; CLA, cladribine; DMF, dimethyl fumarate; FTY, fingolimod; Ga, glatiramer; MTX, mitoxantrone; NAT, natalizumab; OCR, Ocrelizumab; OZA, ozanimod; PLA, placebo; PEG, peginterferon; Re, interferon β-1a (Rebif); and TERI, teriflunomide.
Thus, for both Avonex and alemtuzumab more research on their effects of lowering the 2-year annualized relapse rate is necessary (in the form of experimental trials and observational studies respectively) to validate these indicative observations related to the magnitude of effect estimates and confidence intervals.Moreover, some observational studies applied multiple matching techniques as earlier described in the methods, as was the case with the study of Kalincik et al., and were therefore split in separate parallel group design studies.The study populations in these separated studies could therefore contain an unknown fraction of overlap, which may have biased the results.Sensitivity analysis showed robust relative effect estimates for the disease-modifying therapies compared to Rebif when excluding high risk of bias studies, except for glatiramer.

•
Multiple sclerosis populations of which at least 5% are not diagnosed with relapsing-remitting multiple sclerosis (i.e., primary or secondary multiple sclerosis or clinically isolated syndrome) where the outcomes are only given for the entire multiple sclerosis population.

Table 2
The 48 studies in total covered37121 patients with 70 736 patient years of follow-up (25 536 and 45 200 patient years for the 22 randomized controlled trials and 26 observational studies respectively) with 18 719 relapses observed (8273 and 10 446, respectively).The patients had mean disease durations varying from 1.1 to 11.8 years and were on average between 14 and 50 years old.Baseline characteristics of the included studies are reported in the Data S1 (Table
T A B L E 4 Table of the 28 treatment comparisons and their types of evidence in the network (direct, indirect or both), posterior predictive pvalues, effect estimates (ARRR RWE and ARRR RCT) and upper and lower limit of the prediction intervals from the network of randomized controlled trials (UL RCT and LL RCT, respectively).