A two‐stage prediction model for heterogeneous effects of treatments

Treatment effects vary across different patients, and estimation of this variability is essential for clinical decision‐making. We aimed to develop a model estimating the benefit of alternative treatment options for individual patients, extending a risk modeling approach in a network meta‐analysis framework. We propose a two‐stage prediction model for heterogeneous treatment effects by combining prognosis research and network meta‐analysis methods where individual patient data are available. In the first stage, a prognostic model to predict the baseline risk of the outcome. In the second stage, we use the baseline risk score from the first stage as a single prognostic factor and effect modifier in a network meta‐regression model. We apply the approach to a network meta‐analysis of three randomized clinical trials comparing the relapses in Natalizumab, Glatiramer Acetate, and Dimethyl Fumarate, including 3590 patients diagnosed with relapsing‐remitting multiple sclerosis. We find that the baseline risk score modifies the relative and absolute treatment effects. Several patient characteristics, such as age and disability status, impact the baseline risk of relapse, which in turn moderates the benefit expected for each of the treatments. For high‐risk patients, the treatment that minimizes the risk of relapse in 2 years is Natalizumab, whereas Dimethyl Fumarate might be a better option for low‐risk patients. Our approach can be easily extended to all outcomes of interest and has the potential to inform a personalized treatment approach.

treatment effects to choose a given patient's best option. Prediction models aim to identify and estimate the impact of patient, intervention and setting characteristics on future health outcomes.
Effect modification and risk modeling approaches estimating heterogeneous treatment effects (HTE) are gaining ground in meta-analysis. [1][2][3] Effect modification predicts individualized treatment effects via a model that incorporates a term for treatment assignment and treatment by covariate interaction terms. 1,2,4 However, selecting effect modifiers in a meta-analysis context is challenging for many reasons. These include low power and overfitting, misleading estimates because of unreliable, exaggerated, and highly influential interaction terms and the risk of discovering false subgroup effects because of weak prior knowledge. 1,[5][6][7] In addition, guidance is missing about the model selection techniques and shrinkage methods in meta-regression models optimal to examine effect modification. Alternatively, modelers can take advantage of the fact that patients' baseline risk is often a determinant of HTE. 1,[7][8][9] A risk modeling approach predicts the risk for patients based on their baseline characteristics. It then uses this risk to predict HTE at the absolute scale, typically within a randomized clinical trial (RCT). 1,2,7,[10][11][12][13] In this sense, risk modeling deals better with dimensionality, low power and limited prior knowledge than an effect modification approach. However, its use constrains the model's flexibility, as all prognostic factors also act as effect modifiers via a single coefficient. 2 The baseline risk expresses the probability of experiencing the outcome of interest in the study. Models that link the baseline risk to patient characteristics have been referred to as prognostic or risk models. These models can be integrated with the risk modeling approach. The first step is to develop a multivariable prognostic model that predicts the probability of the studied outcome blinded to the treatment -this can be done using observational or RCT data. We will term this baseline risk from now on, and a transformation of this risk will be termed baseline risk score. Several established methods exist for developing a prognostic model. [14][15][16][17] In the second step, relative treatment effects within RCTs can be estimated as a function of the baseline risk score using a prediction model. 18 This methodology allows for heterogeneity in baseline risk, in the relative treatment effects and consequently in the absolute treatment effects. The risk modeling approach has recently gained ground for personalized predictions for a given treatment. 1,11 Multiple sclerosis is an autoimmune disease of the central nervous system with several subtypes. The most common subtype is relapsing-remitting multiple sclerosis (RRMS). 19 Patients with RRMS present with intense symptoms (relapses) followed by periods without symptoms (remission). 20 Several treatments are available, but patient responses are heterogeneous, and each treatment has a different safety profile. 21 The evidence on drugs for RRMS has been summarized using network meta-analysis. 22,23 These networks typically synthesize published aggregated data, and their ability to explore how patient characteristics influence treatment effects (relative or absolute) across different patients is limited. 24 More efficient analyses use individual patient data (IPD), considered the gold standard in evidence synthesis. 24 IPD are necessary for estimating HTE and making personalized predictions of expected outcomes. 24,25 This article aims to define a methodological framework that allows personalized predictions for the most likely outcome under several treatment options. To achieve this, we adapt the risk modeling approach for the context of meta-analysis, extending it to a network meta-analysis framework. We combine prognostic modeling ideas to estimate the baseline risk score and include this score in an IPD network meta-regression (NMR). We apply this method to a set of placebo-controlled trials of three drugs in patients with RRMS. We also examine how different prognostic models to estimate the baseline risk score influence the predictive model's results and the estimated absolute and relative treatment effects. 15,26 We present results primarily for the absolute treatment effects. These will vary across patient groups, even if heterogeneity is present only in the baseline risk but not in the relative treatment effects. We describe the general framework applicable to any type of data and network, along with the detailed methods for our application to drugs for RRMS.

METHODS
In this section, we present a general description of the two-stage model, where we first obtain the baseline risk score and then estimate outcomes' probabilities as a function of the score. The baseline risk score is determined using established methods for the predictors' selection (eg, prespecified, stepwise, and penalized methods), for the estimation and shrinkage of the coefficients (eg, uniform, elastic net, and penalized maximum likelihood estimation method), and for its validation and presentation. 15,26,27 In the second stage, we used an IPD NMR model with the baseline risk score, developed in stage 1, as prognostic factor and effect modifier of the outcome. Our approach assumes that the set of selected variables captures both prognosis and effect modification adequately. We describe the approach for a dichotomous outcome of interest, although continuous outcomes can also be modeled with minor modifications. Along with the general description of the framework, we describe an application of our methodology, which predicts relapses in 2 years for individuals diagnosed with RRMS. In section 2.1, we describe the data we used, and in section 2.2, we present the notation used in our statistical models. In sections 2.3 and 2.4, we present the first and second stages and the methods implemented in the example. Finally, we present in section 2.5 the software and the functions used. In our application, we chose to implement the first stage in a frequentist framework to take advantage of the shrinkage options readily available in software and the second stage in a Bayesian framework.

Data description
We analyzed IPD from three phase III RCTs: AFFIRM, 28 DEFINE, 29 and CONFIRM 30 on patients diagnosed with RRMS. Altogether, the trials included 3590 patients randomized to placebo, Natalizumab, Dimethyl Fumarate, and Glatiramer Acetate. The outcome of interest was relapse or not relapse in 2 years. Table 1 presents the aggregated-level data of the trial arms as well as some baseline characteristics. We also had access to IPD from 1083 patients with RRMS, randomized to placebo arms included in nine other clinical trials. The latter data were provided by the Clinical Path Institute (https://c-path.org/) and are also described in Table 1. We excluded variables with more than 50% missing values. We used complete case analysis for the remaining variables, assuming that any missingness does not depend on the risk of relapse. We think this is reasonable as all variables are measured at baseline, and the outcome is observed in a 2-year's time window. Between correlated variables (correlation coefficient larger than 70%), we retained those that were biologically plausibly associated with the outcome based on the literature, their distribution and the amount of missing values. Finally, we transformed some of the continuous variables to approximate the normal distribution and merged categories with very low frequencies in categorical variables.

Notation
Let Y ij denote the dichotomous outcome for individual i where i = 1, 2, … , n j in the j study out of ns trials. PF ijk is the k prognostic factor and np is the total number of prognostic factors. An individual can develop the outcome (Y ij = 1) or not (Y ij = 0) according to their risk at baseline, which is a function of the prognostic factors, and we denote it with R ij . Assume we have a set of treatments  each denoted by t ∈  where t = 1, 2, … , T. The probability p ijt is the probability of the outcome for the i individual in j study under treatment t and depends on the treatment, baseline risk score and the interaction between the risk score and the treatment.

Stage 1: Developing a baseline risk score model
We developed risk models for dichotomous outcomes using two different methods. The first model was selected via the LASSO (least absolute shrinkage and selection operator) method. The second used a prespecified risk model. 26 Observational or RCT data may be used for this purpose. For the application, only placebo-controlled RCTs were available. Following the PATH recommendation, 31 when developing a baseline risk score using RCTs, not only the placebo arms but the entire trial population blinded to the treatment should be used. 7,31,32 Using only placebo arms only decreases the effective sample size. It may also lead to differential model fit on trial arms, biasing treatment effect estimates across risk strata, and exaggerating HTE. 1,7,31,33 It can make completely ineffective treatments appear to be beneficial in high-risk patients and harmful in low-risk patients. 2 The logistic regression model is The regression coefficients and intercept can be independent (each b 0j is given a prior distribution), exchangeable across studies. For model selection, methods that include some form of penalization are preferred to stepwise selection. 14,15,26 The latter include LASSO. However, including a set of predictors informed by prior knowledge (either in the form of expert opinion or previously identified variables in prognostic studies) has conceptual and computational advantages. 26,27,34 The estimated effects of the selected covariates also need some form of penalization to avoid extreme predictions. 14,15 In the illustration of our empirical example, we discuss several possibilities.
In our empirical example, we developed a baseline risk model for relapse in 2 years. We first examined if the available sample size was enough for the development of a prognostic model. 35 We calculated the events per variable (EPV), accounting for categorical variables and nonlinear continuous variables. 35 We also used Riley et al's method to calculate the required minimum sample size for a logistic model. 36 We set Nagelkerke's R 2 = 0.15 (Cox-Snell's adjusted R 2 = 0.11) and the desired shrinkage equal to 0.9.
We then fitted two main prognostic models. In the first, we included predictors with nonzero coefficients in the LASSO. 37 We used the LASSO method both for the variable selection and for estimating the coefficients. We used 10-fold cross-validation to find the optimal penalty parameter that maximizes the area under the curve. The penalty parameter we chose is the one within one SE of the minimum parameter, as previously recommended. 15 The second prognostic model included previously identified prognostic factors. Pellegrini et al analyzed the annualized relapse rate in the DEFINE (training dataset) and CONFIRM (validation dataset) trials, 38 both of them included in our dataset as described in section 2.1. They used different modeling approaches, including a fully additive model, ridge regression, LASSO, and elastic net regression. They selected the additive model, including 14 prognostic factors based on its discriminative ability. We estimated the coefficients in each of these prognostic factors in our dataset (section 2.1), using penalized maximum likelihood estimation shrinkage method. 15,39 The penalty's optimal value was chosen as the one that maximizes a modified Akaike's information criterion. 15 Both models use common effects for the intercept and the regression coefficients (b 0j = 0 , b kj = k ). This decision was taken because all three trials were designed by the same company using a similar protocol, as described in section 2.1, and any differences in the included populations shall be captured by including the baseline risk in the NMR model.
Finally, validation is essential for evaluating the performance of a prognostic model. 16 As external data were not available, we performed internal validation only. We estimated the c-statistic and the calibration slope of the developed risk models to assess the discriminative performance and calibration. To account for optimism, which is particularly important when comparing various models, we used the bootstrap method. 15 We produced 500 bootstraps samples and reran the model selection process and estimation in each sample. Then, we assessed the performance of each bootstrap-based model in the original sample. 40,41

Stage 2: IPD NMR model
We used the baseline risk logit as a covariate in an IPD NMR model in the second stage. 42 Each study j has an arbitrarily chosen baseline treatment h j ∈  and then each individual i is randomised to any treatment t ∈  included in study j.
The meta-regression equation in study j with a baseline treatment h j will be: where logit(R ij ) j is the average of logit-risk in all individuals in study j and u j is the log-odds for the reference treatment arm when the logit-risk is equal to logit(R ij ) j . The nuisance parameters u j are then considered to be independent. The relative treatments effects are the log-odds ratios (ORs) d jh j t and can be random (d jh j t ∼ N(D h j t , 2 D )) or common (d jh j t = D h j t ) across studies. Then, assuming consistency, we set the constraint D h j t = t − h j and ref = 0 where t is the summary estimate for log-ORs for treatment t vs the overall reference treatment (denoted as ref). Parameter g 0j is the coefficient of the risk score (as a prognostic factor) and should be independent across studies (so that each g 0j is given a prior distribution) to avoid compromising randomization. In case of model nonconvergence or studies following the same protocol (as in our example), exchangeable (g 0j ∼ N( 0 , 2 0 ), or common (g 0j = 0 ) coefficients can be used. 43,44 Similarly, g jh j t refers to the treatment effect modification of the risk score, for treatment t vs study's baseline treatment h j , and can be random (g jh j t ∼ N(G h j t , 2 G )) or common (g jh j t = G h j t ). Similarly to the relative treatment effects, the regression coefficients G h j t between two active treatments are parametrized using basic parameters t (of each active treatment vs control), where G h j t = t − h j and ref = 0. Finally, exp( t ) is the ratio of two ORs of treatment t vs the reference: the OR of a group of people with baseline score x over the OR in a group of people with baseline risk score x − 1.
Assume an overall reference treatment (like placebo or the current standard treatment) for which predictions are less important. Then, consider a patient at the mean (logit) baseline population risk, R who is under the reference treatment. This logit-probability of the outcome is denoted with, say . To make predictions for a new patient with predicted risk logit(R i ) and in treatment t, we use the equation: Estimation of a and logit(R) depends on the context within which we plan to make predictions: one can use registry data, observational studies or RCT data. For example, logit(R) can be estimated as the mean of logit(R ij ) across all individuals in the (randomized or observational) studies. Similarly, a can be estimated from the synthesis of all untreated or placebo arms.
In our empirical example, we used an IPD NMR model for comparing three active treatments and placebo in RRMS patients, using the predicted risk obtained from the first stage (LASSO and prespecified model), logit(R i ). We assume that study-specific relative treatment effects do not have any residual heterogeneity beyond what is already captured by differences in baseline risk. Consequently, we employ a common effect IPD NMR model, both in the relative treatment effects d jb j t and for the treatment effect modification of the risk score. Note that the between studies variance could not be estimated with only three studies (d jh j t = D h j t = t − h j , ref = 0, g jh j t = G h j t = t − h j , ref = 0). We also assumed common coefficients for the risk score (g 0j = 0 ), as all three studies are very similar in terms of design characteristics.
To estimate the logit-probability ( ) of the outcome of a patient in placebo who has a baseline risk score equal to the average risk logit(R) we synthesised external IPD placebo-arm data.

Implementation and software
All our analyses were done in R, 45 using R 3.6.2 version and in JAGS 46 (called through R). We make the code available in a GitHub library: https://github.com/htx-r/Reproduce-results-from-papers/tree/master/ ATwoStagePredictionModelMultipleSclerosis. To develop the baseline risk model (2.3), we used the pmsampsize command to estimate if the available sample size was enough for the developed model. The LASSO model was developed using cv.glmnet. We first fitted the prespecified model using the lrm command and used the pentrace command for the penalized maximum likelihood estimation. For the bootstrap internal validation, we used self-programmed R-routines. The IPD NMR model (2.4) was fitted in a Bayesian framework, and we used programming routines in the R2Jags package. 47 We set a normal distribution (N(0, 1000)) as prior distributions for all of the model parameters. We simulated two chains of 10 000 samples, discarded the first 1000 samples and thinned for every 10 samples. This was deemed appropriate based on autocorrelation plots and the visualization of the chain convergence.

Stage 1: Developing the baseline risk score
A total of 57 candidate prognostic factors were available. After exclusion of variables with missing data and highly correlated data, we ended up with 31 candidate prognostic factors ( Figures A1 and A2). For the LASSO model, we used 2000 RRMS patients with complete data, 742 of whom relapsed in 2 years. The full model had 45 degrees of freedom, and the EPV was 16.5. The recommended sample size for a newly developed model is 3456 patients, which is more than the available sample size. For the prespecified model, which does not involve the selection of variables, the small number of degrees of freedom (14) led to a large EPV of 53 and a recommended minimum sample size of 1076, which is well below the available sample size. Table 2 shows the two models, their coefficients and their performance with internal validation. Both models have almost the same discriminative ability, but the prespecified model has a much better calibration slope.
Both models predict almost the same mean risk for patients in our data (about 37%), as shown in Figure 1. The variation in the estimated baseline risk score is much higher in the prespecified model, using the predictors of Pellegrini et al 38 Figure 1 also indicates that the baseline risk could be a prognostic factor for relapse, as the baseline risk score is higher for patients who relapsed than for patients who did not, using both models. However, the overlap is considerable, as also shown by the c-statistics in Table 2. Table 3 shows the estimated parameters from the NMR model using the two different scored developed from the LASSO model and prespecified model. Both models indicate the baseline risk as an important prognostic factor for relapsing at 2 years, as shown by the large values for 0 . The estimates of log ORs for each treatment vs placebo ( t ) are very similar with both models. However, they provide slightly different summary estimates for the coefficients of effect modification, that is, DF , GA , N . Overall, none of the coefficients DF , GA , N is large. Figure 2 shows the estimated predicted probabilities to relapse within 2 years depending on the estimated baseline risk, via LASSO and prespecified risk models, under the four available treatment options. Figure A3 presents the same results on the OR scale. Both models give almost the same results for the treatment-effects estimation: Glatiramer Acetate seems to have the same performance as Dimethyl Fumarate in the observed range of baseline risk; placebo results in the highest risk to relapse. Natalizumab is a drug initially considered less safe than the other two active options and associated with increased mortality. 48,49 Table 4 shows the estimated predicted probabilities and the ORs of relapsing under all three available active treatments, using both models separately, for TA B L E 2 Estimated LASSO (least absolute shrinkage and selection operator) shrunk coefficients and coefficients from the prespecified model together with penalized maximum likelihood estimation  all patients, for low-risk patients (baseline risk <30%) and for high-risk patients (baseline risk >50%). The benefit of all three treatments depends on the risk group. For high-risk patients, the absolute benefit of Natalizumab compared with Dimethyl Fumarate is 15% using the prespecified model and 10% for the LASSO model. These correspond into 7 and 10 patients, respectively that need to be treated with Natalizumab to prevent one relapse. For low-risk patients, the absolute benefit of Dimethyl Fumarate compared with Natalizumab is 3% for the prespecified model and 2% for the LASSO model. The absolute differences between the treatments for all risk-groups are smaller using LASSO compared with the (penalized) prespecified model. The predictions for the three drugs and placebo for RRMS have been implemented in an interactive R-Shiny application available at https://cinema.ispm.unibe.ch/shinies/ koms/.

TA B L E 4
Predicted % probabilities and odds ratios (ORs, relative benefits) of relapse in 2 years, using baseline risk scores developed with the LASSO (least absolute shrinkage and selection operator) and prespecified models

DISCUSSION
We developed a prediction model for HTE that combines risk modeling and network meta-analytical methods to make personalized predictions for an outcome of interest. As the treatment options for each condition are numerous and patient characteristics often play an important role in modifying treatment effects, this approach could contribute to personalized treatment decisions. We illustrated our method by comparing three active treatments and placebo in patients with RRMS. Only a few characteristics are required, and doctors and patients can enter these into our online tool (https://cinema. ispm.unibe.ch/shinies/koms/) to estimate the risk of relapse in 2 years under four treatment options. The application to RRMS shows the approach's potential but is not ready for use in clinical practice. Decision-making tools need external validation with new patients. They need to provide evidence about all available treatment options for patient-relevant outcomes (eg, long-term disability status 50 ) and also consider safety and costs. Unfortunately, long-term results are not available from RCTs, and observational data would need to be integrated for this purpose. We did not have access to such data, which would have also allowed us to validate the model externally. Because of the limited data availability (only three RCTs), we used common-effects IPD-NMR to facilitate model convergence. The common-effects assumption can be relaxed if more studies are available. Making personalized predictions using a random-effects model will increase the uncertainty, and the interpretation of results in the presence of large heterogeneity will be challenging. Another extension of our model could accommodate aggregate-level data to increase the relevant information and ensure that the findings are representative. This is particularly important when the analyst is interested in comparing all available treatments and making corresponding predictions. 42 We aimed to examine whether and by how much the risk modeling results are influenced by the method used to develop the baseline risk model (ie, stage 1). A prespecified model used variables previously identified as important prognostic factors. 38 In addition, we used a variable selection approach via LASSO. The two models in stage 1 differ in terms of included variables, however the choice of the model had only a small impact on the results of stage 2. Whether this is a general feature of the approach or this happen to be true in this particular dataset should be subject of further research. More applications and a simulation study would be needed to pinpoint the sensitivity of the final results to the choice of the model in the first stage. The models' discrimination was small but sufficient for our aim. Indeed, risk models with a low predictive ability (0.6-0.65) are often adequate to detect risk-based HTE. 11 The available sample size was sufficient for the prespecified model (as it did not involve variable selection), whereas it was not for the LASSO model. The prespecified model's discriminative ability was slightly better, and the calibration slope much better than the LASSO model. This finding corroborates previous guidance in the literature that suggests that use of prior evidence in model development is advantageous. 2,26 The approach has several limitations. Our framework requires at least one IPD dataset for each included intervention to estimate all model parameters. IPD data are not readily available: several papers have documented the difficulties encountered in the process. [51][52][53] When condensing all patient information into the risk score, we assume that the selected variables adequately capture both prognosis and effect modification. This assumption is difficult to evaluate unless the outcome is well studied, and many prognostic studies exist on the topic, which is rarely the case. Besides, it is possible that other study-level characteristics, such as the risk of bias and the year of randomization, may also impact the treatment effects. If the number of studies permits, such variables can be added to the meta-regression model. In addition to these limitations, the common challenges encountered in prognostic modeling apply. Some prognostic factors may not be available for some individuals or even in whole studies. In this case, multiple imputation methods may be used to improve precision. 54 Finally, numerous candidate prognostic factors might render the available sample size insufficient and model selection challenging. 15 Further work is needed to extending the model and enhance its flexibility. We developed the baseline risk (stage 1) in a frequentist framework to take advantage of the software's shrinkage options. However, this might render the results from stage two to be overprecise because the approach does not account for the uncertainty in the baseline risk prediction. An alternative approach would be to carry out a simultaneous estimation of both stages within a Bayesian paradigm. This would allow uncertainty in the estimation during the first stage to be propagated through the model and reflected in the second stage results. Finally, more work is needed to validate both stages of the model. We validated the risk score (stage 1) only internally, using the bootstrap validation method, but an internal-external validation could also be an option. The predictive accuracy of our two-step framework has not been validated at all. In future work, its performance needs to be validated not only by discrimination and calibration but also metrics related to the absolute benefit. 55 The proposed approach offers many methodological advantages and opportunities for further development. Model selection approaches and methods to shrink coefficients to avoid extreme predictions are not well established in the meta-analysis context. 1,2 Our proposal shifts the variable selection problem in the logistic regression model for which penalization methods both in Bayesian and frequentist framework are well established. NMR models can also include aggregated data from published studies, so our approach can be extended accordingly. 42 Observational data can also be integrated to develop the risk score, calibrate or update the risk score model, and externally validate the model. Such data may also inform the baseline effects, or the relative treatment effects and their interactions with the score using appropriate bias-adjusted modelling. 56,57 Methods to include single-arm trials and expert opinion are also available and could be incorporated to extend the model further. [58][59][60][61] Overall, our framework is flexible enough and combines useful features of predictive modeling and evidence synthesis. It can be applied to as many treatments as required and can be easily extended to include various outcomes. It can inform patients and their doctors, manufacturers, and HTA agencies about the most appropriate treatment for each patient or patients' subgroup and hence contribute to personalized medicine.