A Bayesian hierarchical meta-analytic method for modelling surrogate relationships that vary across treatment classes

Surrogate endpoints play an important role in drug development when they can be used to measure treatment effect early compared to the final clinical outcome and to predict clinical benefit or harm. Therefore, such endpoints must be assessed for their predictive value of clinical benefit by investigating the surrogate relationship between treatment effects on the surrogate and final outcomes using meta-analytic methods. When surrogate relationships vary across treatment classes, such validation may fail due to limited data within each treatment class. In this paper, two alternative Bayesian meta-analytic methods are introduced which allow for borrowing of information from other treatment classes when exploring the surrogacy in a particular class. The first approach extends a standard model for the evaluation of surrogate endpoints to a hierarchical meta-analysis model assuming full exchangeability of surrogate relationships across all the treatment classes, thus facilitating borrowing of information across the classes. The second method is able to relax this assumption by allowing for partial exchangeability of surrogate relationships across treatment classes to avoid excessive borrowing of information from distinctly different classes. We carried out a simulation study to assess the proposed methods in three data scenarios and compared them with subgroup analysis using the standard model within each treatment class. We also applied the methods to an illustrative example in colorectal cancer which led to obtaining the parameters describing the surrogate relationships with higher precision.


Introduction
Advanced colorectal cancer is the second most common cause of cancer-related mortality in developed countries [26]. New advances in science have led to discovering of promising therapies which often are targeted to specific patient populations, for example defined by a genetic biomarker. This leads to clinical trials of smaller size, whilst the increased effectiveness of these therapies reduces the number of events or deaths and consequently lead to measurement of treatment effect on overall survival (OS) with large uncertainty. Therefore, surrogate endpoints allowing the measurement of treatment effect with higher precision have been investigated to accelerate the availability of these treatments to the patients. These alternative endpoints often can be considered a cost effective replacement of final clinical outcome, as they are particularly useful when they can be measured earlier, easier, more frequently compared to the final clinical endpoint or if they require smaller sample size and shorter follow up times [6]. the surrogate endpoint. Study level association requires data from many randomised controlled trials (RCTs) and can be investigated carrying out a bivariate meta-analysis [13,8,5]. In this paper we focus on the third level of association only.
A bivariate meta-analytical method that was developed by Daniels & Hughes [13] can be used to validate a candidate surrogate endpoint by evaluating the association pattern between the treatment effects on the surrogate and the final outcomes and to predict treatment effects on the final clinical outcome from the effects on surrogate endpoint This method implemented in a Bayesian framework can be used to evaluate a surrogate endpoint in a disease area overall or in each treatment class separately through a subgroup analysis.
Traditionally, a surrogate relationship between treatment effects on a surrogate endpoint and treatment effects on a final outcome investigated in a disease area regardless of treatment classes. For instance, in advance colorectal cancer (aCRC) progression free survival (PFS), tumor response (TR) or time to progression (TTP) have been investigated as potential surrogate endpoints for OS [7,18,12,10]. In previous work, Buyse et al. [7] have found a strong association between treatment effects on PFS and OS in this disease area, including in their meta-analysis studies on one treatment class only (modern chemotherapy). More recently, Ciani et al. [12] investigated the surrogate relationships in a aCRC across all modern treatments, including a range of targeted therapies, which led to suboptimal surrogate relationships in this disease area. They concluded that in a aCRC the surrogacy patterns could vary across treatment classes and a strong surrogate relationship observed in a specific treatment class may not directly apply across other treatment classes or lines of treatment. This may be particularly important for targeted treatments used only in a subset of population. For example anti-EGFR treatments are recommended for patients without a KRAS/panRAS mutation as these mutations are associated with resistance to the anti-EGFR therapies [28,32] and the surrogacy pattern might be different for this particular treatment class in this subset of population with this unique characteristic.
Furthermore, Giessen et al. [18] who investigated the surrogate relationships in aCRC including all available treatments and subgroups of therapy, inferred that for validation of surrogacy in targeted treatments such as anti-EGFR receptor directed monoclonal antibodies or anti-VEGF treatments further research is required once more data become available. Consequently, the assumption that a surrogate relationship remains the same across different treatment classes or lines of treatment does not seem feasible in aCRC, which may be the case in other disease areas. Therefore potential differences in surrogate relationships across classes should be investigated. This can be achieved by performing subgroup analysis using a standard model (e.g. Daniels and Hughes model [13]) or extending the standard model by adding another level to the hierarchical structure of the model for a surrogate relationship accounting for differences between treatment classes. In this paper, we propose two new methods which allow different degrees of borrowing of information for surrogate relationships across treatment classes aiming to obtain estimates of surrogate relationships with higher precision by taking advantage of borrowing of information [15,25,20]. The first approach assumes full exchangeability of surrogate relationships exploiting the similarity of surrogate relationships and borrowing information across treatment classes. The second method is able to relax this assumption, by allowing for partial exchangeability [30] of surrogate relationships across treatment classes to avoid excessive borrowing of information from distinctly different treatment classes. In this model, the parameters describing surrogate relationships can be either exchangeable or non-exchangeable giving more flexibility when the assumption of exchangeability is not reasonable.
The modelling techniques were demonstrated using an example in advanced colorectal cancer where the surrogate relationships may vary across treatment classes [12]. To assess their performance and compare them with subgroup analysis we carried out a simulation study. In the remainder of this paper, we present the illustrative example in Section 2, the two proposed models are introduced in Section 4, the results of the illustrative example and the simulation study are demonstrated in Section 6 and 7 respectively. The paper concludes with a discussion in Section 8.

Application: Advanced colorectal cancer
The data of the illustrative example were obtained from a systematic review that was conducted by Ciani et al. [12]. It includes 101 RCTs from 2003 to 2013 in advanced or metastatic colorectal cancer evaluating multiple interventions.
The review consist of trials that report treatment effects on OS or/and on alternative endpoints such as progressionfree-survival or tumor response (PFS, TR). OS was defined as the time from randomisation to time of death, PFS was set as the time from randomisation to tumor progression or death from any cause. Tumor response was estimated using objective tumor measurements which are measured using imaging methods and determined according to the Response Evaluation Criteria in Solid Tumors guidelines [37] or the World Health Organization recommendations [31]. The RCTs in the systematic review contain five treatment classes, the class of chemotherapies, the anti-epidermal growth factor receptor (Anti-EGFR) monoclonal antibodies class, angiogenesis inhibitors, other molecular-targeted agents (MTA) and intrahepatic arterial (IHA) chemotherapies .
Using these data, Ciani et al. [12] investigated whether these surrogate endpoints establish a strong surrogate relationship between treatment effects on the endpoints and on OS. They showed that none of the endpoints they investigated were found to satisfy the criteria that has been set to validate a very strong surrogacy between them and OS in advance colorectal cancer. Furthermore, they stated that PFS is deemed acceptable surrogate for OS whereas, TR should not be used as a surrogate endpoint for this final outcome. They concluded that good surrogacy observed in previous studies may not apply directly across other classes of treatment. More details about the studies and how the systematic review was designed can be found on Ciani et al. [12]. We refer these data as 'Ciani data' in the remainder of this paper.
In our example, we focused on a subset of these data examining the surrogacy patterns between treatment effects on TR and PFS and treatment effects on PFS and OS including data from three treatment classes. We obtained data from 35 studies reporting treatment effect on PFS and OS where, 15 of them belonged to the chemotherapy treatment class, 9 of them investigated anti-EGFR therapies and 11 anti-angiogenic treatments. Subsequently, 35 studies reported treatment effects on TR and PFS, 17 of them investigated chemotherapies, 8 and 10 studies anti-EGFR and anti-angiogenic treatments respectively. TR can be evaluated as a surrogate endpoint to treatment effect on PFS, as treatment effects on TR is typically measured earlier compared to treatment effects on PFS. Figure 1 provides a graphical representation of the data set we used. It illustrates the surrogate relationships between the treatment effects across classes on each pair of outcomes. There is quite a lot overlap between treatment classes and it appears the data follows similar surrogacy patterns across treatment classes on each pair of outcomes.
For the PFS-OS pair, there is a weak positive relationship between the log hazard ratios on PFS and the log hazard ratios on OS while for TR-PFS pair the relationship between the log odds ratios on TR and the log hazard ratios on PFS is negative. Figure 1: Scatterplots of PFS-OS, TR-PFS Individual patient data (IPD) were available for four RCTs and they were used to estimate within-study correlations [1,34,9,21]. By applying a bootstrap method (see section 3.3) we estimated three within-study correlations, one for each treatment class. We assumed that within treatment classes within-study correlations are the same.

Scale of the outcomes
The treatment effects on OS and on the PFS were modelled on the log hazard ratio scale logHR(OS), logHR(P F S) whereas, the treatment effects on TR were modelled on log odds ratio logOR(T R) scale. We retrieved the corresponding standard errors on PFS and OS from the 95% confidence intervals by applying the following formula se(logHR) = log(HR U )−log(HR L ) 2 * 1.96 and by using the standard formulae for the standard errors of logOR(T R) .

Standard surrogacy model
To investigate surrogate relationships within treatment classes, we performed subgroup analysis adopting a standard surrogacy model that was introduced by Daniels and Hughes [13] for the evaluation of potential surrogate markers.
Equation (1) corresponds to the within-study model where Y1i, Y2i are the estimates of treatment effects on surrogate endpoint and on the final outcome. These effects follow a bivariate normal distribution with µ1i and µ2i corresponding to the true treatment effects on the surrogate and the final clinical outcome respectively while, σ1i, σ2i and ρwi are the within-study standard deviations for both outcomes and within-study correlations between the treatment effects on the two outcomes for each study i.

Y1i
Y2i ∼ N µ1i Equation (2) corresponds to the between-study level where, the true effects on the surrogate endpoint µ1i are considered to be fixed effects for each study, in contrast to the true effects on the final outcome µ2i which are assumed to be random effects, i.e. follow a common distribution. The relationship between the true effects on the surrogate µ1i and the true effects on the final endpoint µ2i is described with a simple linear model. This relationship plays a very important role as it can be used to predict µ2i from known µ1i in a new study i. The parameters λ0, λ1, ψ 2 correspond to the intercept, the slope and the conditional variance of the linear model and measure the shape of the relationship and the strength of association between the treatment effects on the surrogate endpoint and the effects on the final outcome.
In the Bayesian framework, the Daniels & Hughes model can be implemented by assuming no prior knowledge about surrogate relationship by using vague prior distributions. This allows the data to dominate the posterior distribution even if the dataset is relatively small. The following prior distributions can be used: By adapting this method in our research, we applied this standard model to subsets of data that consist only of one class of treatment examining the surrogate relationship of each subgroup separately, taking motivation from similar analyses in clinical trials [2,19]. This kind of analysis is very practical when surrogacy patterns in a given disease area are very different and the treatment classes consist of many studies. By performing subgroup analysis using the standard model, we explore potential differences in the surrogacy patterns across treatment classes and use them as a reference for results obtained with the newly developed methods.

Criteria for surrogacy
As we mentioned previously, the parameters λ0, λ1, ψ 2 play a very important role, as they are used to evaluate surrogacy. A good surrogate relationship should imply that λ1 = 0 as slope establishes the association between treatment effects on the surrogate and the final outcome. Subsequently, having ψ 2 = 0 implies that µ2i could be perfectly predicted given µ1i. The parameter λ0 corresponds to the intercept and is expected to be zero for a good surrogate relationship. This ensures that no treatment effect on the surrogate will imply no effect on the final outcome.
These three criteria proposed by Daniels & Hughes [13], will be referred as surrogacy criteria in the remainder of this paper. A simple way to examine these surrogacy criteria is to check whether or not zero is included in the 95% credible intervals (CrIs) of λ0, λ1 and to compute the Bayes factor for the hypothesis H1: ψ 2 = 0. The model with ψ 2 = 0 is a nested model within the standard model [23] and to compare these models Bayes factors can be computed using the Savage Dickey density ratio [38]. To implement the Savage Dickey density ratio proper prior distributions for ψ are needed. In our research a moderately informative half normal prior distribution N (0, 2)I(0, ) was used for the conditional standard deviation. A strong surrogate relationship requires zero to be included in the CrI of λ0, zero not to be included in the CrI of λ1 and the Bayes factor of ψ 2 to be greater than 3.3 [27].

Cross-validation
One of the main aims of this paper is to explore whether the two hierarchical methods, that we propose in the next section, improve the predictions of treatment effect on the final outcome (by reducing bias and/or uncertainty) compared to subgroup analysis using the standard model. To evaluate this, a cross-validation procedure was carried out. It is a similar to the 'leave-one-study-out' procedure that was described by Daniels & Hughes [13] and it is repeated as many times as the number of studies in the data set. In a simulated data scenario, this can be used to draw inferences about predicting the true effect on the final endpoint µ2i in a 'new' study i but, in a real data scenario true effects are unknown and therefore, we can only compare the observed values Y2i with their predicted intervals. For each study i(i = 1, .., N ), treatment effect on the final endpoint Y2i is omitted and assumed unknown. This effect is then predicted from the observed effect on the surrogate endpoint Y1i and by taking into account the treatment effects on both outcomes from the remaining studies. In a Bayesian framework it can be achieved by performing Markov chain Monte Carlo (MCMC) simulation.
Following this, we checked whether predictive intervals of Y2i contain the observed estimate Y2i. We predicted the estimatesμ2i with variance σ 2 2i + var(μ2i|Y1i, σ1i, Y 1(−i) , Y 2(−i) ) where Y 1,2(−i) denote the observed treatment effects from the remaining studies without the study that is omitted in i th iteration.

Bootstrapping method
A bootstrapping method was used to estimate the within-study correlations ρwi between the treatment effects on the surrogate and the final outcome by drawing 5000 bootstrap samples with replacement from the IPD [14]. The treatment effects on all outcomes (TR, PFS and OS) were estimated for each bootstrap sample by fitting Cox regression to data on PFS and OS and a logistic regression to data on TR. Pearson correlation coefficient were obtained and used as a measure of association for the two pairs of outcomes: TR-PFS and PFS-OS.

Methods for surrogate endpoint evaluation incorporating data from different treatment classes
When subgroup analysis is used to investigate surrogate relationships within treatment classes the validation process may fail due to limited data resulting in inaccurate posterior means and CrIs of the parameters describing surrogate relationships obtained with considerable uncertainty [2]. We propose two hierarchical models to investigate surrogate relationships within treatment classes allowing different degrees of borrowing of information about the relationships across classes, as alternative approaches to subgroup analysis with the standard model. These models can be applied to account for differences in surrogacy patterns whilst using data from multiple treatment classes or lines of treatment taking advantage of the attractive statistical properties of exchangeability [15,25,20].

Hierarchical model with full exchangeability (F-EX)
Our first approach extends the standard model accounting for differences in surrogacy patterns across different treatment classes [3,16,11]. Similarly as in the standard model, at the within-study level we assume that correlated and normally distributed observed treatment effects Y1ij and Y2ij in each study i estimate the true treatment effects µ1ij and µ2ij on the surrogate and final outcomes respectively. In addition, by introducing index j denoting treatment class j we account for the differences between the classes.

Y1ij
Y2ij ∼ N µ1ij The parameters σ1ij, σ2ij, ρwij correspond to the within-study variances and within-study correlations for each study i in treatment class j. Similarly as in the standard model, the true effects µ1ij on the surrogate endpoint are modelled as fixed effects.
In contrast to the standard surrogacy model, this method assumes unique surrogate relationships between true treatment effects on the surrogate endpoint and the final outcome for all treatment classes in a single model, allowing for borrowing of information across the treatment classes. Each relationship between the true effects on the surrogate endpoint µ1ij and the final outcome µ2ij is described by a linear model where, λ0j denotes the intercept of the j th treatment class and λ1j establishes the relationship between treatment effects on surrogate and final outcomes within the treatment class j. To evaluate whether a candidate endpoint is considered a valid surrogate endpoint in a given treatment class, all three surrogacy criteria need to be met for this particular class. Implementing this model in the Bayesian framework, we place non-informative prior distributions on the model parameters such as: F-EX model extends the standard model (described in section 3.1) by including an additional layer of hierarchy to the linear relationship between true effects on the surrogate and the final outcome, assuming that slopes and intercepts are fully exchangeable across treatment classes. This can be implemented by placing common normal distributions on λ0j and λ1j with means and variances β0, ξ 2 0 and β1, ξ 2 1 leading to borrowing of information across treatment classes. Hierarchical models have desirable statistical properties that allow us to improve our inferences taking advantage of borrowing of information from other treatment classes. The exchangeable estimates, however, are shrunk towards the means β0, β1 and the amount of shrinkage depends on the number of studies within each class, the between treatment class heterogeneity [30] and the number of treatment classes. Although these statistical properties are very attractive in terms of potential reduction of uncertainty around the parameters of interest, they are advantageous only when the assumption of exchangeability is reasonable, otherwise there is a danger of excessive shrinkage.

Hierarchical model with partial exchangeability (P-EX)
F-EX method can be extended into a method with partial exchangeability similarly as the method proposed by Neuenschwander et al. [30]. This model is able to relax the assumption of exchangeability allowing the parameters of interest for each class to be either exchangeable with all or some of the parameters from other treatment classes or non-exchangeable. The proposed method is more flexible compared to F-EX model, in particular in data scenarios where the assumption of exchangeability is not reasonable for all treatment classes.
The within and the between study level of this model is exactly the same as in the method with full exchangeability where, Y1ij, Y2ij are the treatment effects on the surrogate and final clinical outcomes and they follow a bivariate normal distribution with mean values corresponding to the true treatment effects µ1ij and µ2ij on the two outcomes in the hierarchical framework.

Y1ij
Y2ij ∼ N µ1ij However, the parameters of slopes are modelled in a different way compared to F-EX model. In this approach two possibilities arise for these parameters for each treatment class j, when pj = 1 the parameter λ1j can be exchangeable with some or all the parameters of the slopes from the other treatment classes via an exchangeable component where the parameter follows a common normal distribution as in F-EX model. On the other hand, when pj = 0 the slope can be non-exchangeable with any parameters from the other treatment classes. In this case a vague prior distribution can be placed on the parameter, as in the standard model. The method evaluates the degree of borrowing of information for each parameter λ1j by using these two components with respective mixture weights.
The main advantage of this method is that it allows the mixture weights to be inferred from the data. In each MCMC iteration, the sampler chooses between the two components by using a Bernoulli distribution pj ∼ Bernoulli(πj). By calculating the posterior mean of this Bernoulli distribution we derive the mixture weights of each treatment class. The hyper-parameters πj of the Bernoulli prior distribution can be either fixed or, in a fully Bayesian framework, they can follow a prior distribution for example, a Beta distribution πj ∼ Beta(1, 1). We have used fixed πj, since placing a prior distribution required longer chains to converge and provided almost the same results.
In a special case where pj = 1 for all treatment classes, P-EX model reduces to full exchangeability model as it uses only the exchangeable component. Having mixture weights pj = 0 for all treatment classes makes the P-EX model equivalent to subgroup analysis using the standard model as only the non-exchangeable component is used to estimate λ1j in this case. In a Bayesian framework vague prior distributions can be placed on the parameters β0, β1, ξ0, ξ1, µ1ij as in F-EX model.

Software Implementation and computing
All models were implemented in OpenBUGS [36] where posterior estimates were obtained using MCMC simulations performing 50000 iterations (after discarding 20000 iterations as burn-in period). Convergence was assessed visually by checking the history, chains and autocorrelation plots using graphical tools in OpenBUGS and R. All estimates are presented as means with corresponding 95% CrIs apart from the estimates of conditional variances where the median was used as a measure of central tendency since their posterior distributions were very skewed. The cross-validation procedure was performed in R using R2OpenBUGS [36] package to execute OpenBUGS code multiple times.

Results from advanced colorectal cancer example
Before starting our analysis of surrogate relationships in aCRC, within-study correlations for each treatment class were estimated using a bootstrapping method (as described in section 3.3). Table 1 presents the within-study correlations between treatment effects on each pair of outcomes for each of the treatment classes. The first aim of our analysis was to explore potential differences in surrogacy patterns across treatment classes.
Therefore, the two proposed models and subgroup analysis using standard model were applied deriving posterior dis- To be consistent with previous works [7,18], we explored the surrogacy patterns across treatment classes for PFS-OS and TR-PFS pairs of outcomes.

Results from subgroup analysis using the standard model
The results of subgroup analysis presented in Table 2 showed strong surrogacy between the treatment effects on PFS and the effects on OS in the class of chemotherapies and the anti-angiogenic treatment class since all three criteria for surrogacy were satisfied. In contrast, we can infer that the surrogacy between treatment effects on PFS and the effects on OS in the anti-EGFR treatment class was found to be weak, as the 95% CrI of the posterior distribution of the slope included zero.
Investigating the surrogacy on TR-PFS pair we found the same pattern, thus we can infer that there was an acceptable surrogate relationship between treatment effects on TR and PFS in the chemotherapy and the anti-angiogenic classes, as the 95% CrIs of λ01 and λ03 included zero, the 95% CrIs of λ11 and λ13 did not contain zero and there was substantial evidence using Bayes factors in favour of the hypotheses H1 : ψ 2 1 = 0, and H1 : ψ 2 3 = 0 (see details in the supplementary material C.2). The relationship was negative overall, since the slopes were negative across classes. On the other hand, the surrogacy criteria indicated poor surrogacy between the treatment effects on TR and the treatment effects on PFS for anti-EGFR class, since the 95% CrI of the slope λ12 included zero.
After estimating the surrogacy criteria across treatment classes, we carried out cross-validation procedure to predict the treatment effects µ2i on the final outcome. The results in Table 3 Table 4 shows results of applying the F-EX model to data for all treatment classes for the two surrogate relationships between the treatment effects on PFS and OS and between the effects on PFS and TR. For the PFS-OS pair of outcomes, the surrogacy patterns were very similar in the anti-angiogenic and chemotherapy treatment classes as both classes satisfied the surrogacy criteria and the slopes were of similar magnitude. The 95% CrIs of λ01 and λ03 included zero indicating that zero treatment effect on the surrogate implies zero treatment effect on the final outcome for these two classes. The intervals of λ11 and λ13 did not contain zero indicating positive association as the two slopes were positive. The conditional variances in these two classes were small indicating strong surrogate relationships which supported by substantial evidence in favour the hypotheses H1 : ψ 2 1 = 0, H1 : ψ 2 3 = 0 using Bayes factors (see details in the supplementary material C.2). On the other hand, the relationship was weak in the anti-EGFR treatment class failing to meet one of the criteria, as the 95% CrI of the slope λ12 includes zero.

Results from F-EX model
On the contrary, for TR-PFS pair of outcomes all three surrogacy criteria were satisfied across treatment classes taking advantage of the assumption of exchangeability for the parameters λ0j and λ1j. This implies that TR was an acceptable surrogate endpoint for PFS across treatment classes in this data set.
Moving to the results from the cross-validation procedure of F-EX are presented in Table 5, the model fitted the     indicating that TR was an acceptable surrogate for PFS across treatment classes in the Ciani data set.

Comparison of the results from F-EX, P-EX and those from subgroup analysis
Comparing the aforementioned methods in regards to the surrogacy criteria on the PFS-OS pair, we can conclude that F-EX model estimated the parameters of the surrogate relationships with reduced uncertainty compared to the subgroup analysis and P-EX model taking advantage of borrowing of information across classes. P-EX relaxes the assumption of exchangeability resulting in borrowing of information reduced on average by 3.6%. It gave narrower CrIs of the parameters of interest compared to subgroup analysis but larger than those obtained form F-EX model.

Furthermore, both F-EX and P-EX methods can distinguish between the different surrogacy patterns avoiding to
give over-shrunk estimates, although they allow different degrees of borrowing of information. In particular, this pair of outcomes (PFS-OS) illustrates well the impact of number of studies per class on the degree of borrowing of information. In general, borrowing of information is determined by the number of studies within treatment classes, between treatment classes heterogeneity as well as, the number of treatment classes. In this case, the fewer studies we have within a treatment class, the bigger is the impact of borrowing of information and the reduction in uncertainty of the estimates of surrogate relationships. This effect was particularly strong for the anti-EGFR treatment class. Figure 2: 95% Credible intervals of λ 1j and λ 0j for the PFS-OS pair of outcomes On the other hand, TR-PFS pair is a good example to visualise the performance of the hierarchical methods when between treatment class heterogeneity is relatively large. In this case, subgroup analysis performed equally well as the proposed methods in terms of uncertainty of the CrIs of the paramaters describing the surrogate relationships. For instance by using F-EX and P-EX models, we did not observe any decrease in uncertainty around λ1j and λ0j in the class of chemotherapies and the anti-EGFR treatment class and there was slightly more uncertainty around λ1j in the anti-angiogenic class. This is because the between treatment classes heterogeneity was not relatively large for TR-PFS pair and hence there was not much shrinkage. Furthermore using subgroup analysis, the surrogacy criteria were satisfied in the anti-angiogenic and the chemotherapy treatment classes (sample size was 17 and 10 for each class respectively), but failed in the anti-EGFR class (zero was included in the 95% CrI of the slope) due to the limited number of studies in this treatment class (only 8 studies available). By applying P-EX and F-EX models, we were able to draw different inferences for the surrogacy pattern in the anti-EGFR class as these methods allow for borrowing of information for the surrogate relationships from the other classes. As illustrated in Figure 3, both hierarchical models moved the 95% CrI of the slope towards to the direction of the CrIs of the other two classes satisfying the surrogacy criteria across all treatment classes.
By carrying out cross-validation, we wish to ensure that not only predictive intervals contain the observed values but also that they are sufficiently narrow. In general, adding a hierarchical structure to slopes and intercepts reduces the uncertainty and leads to more precise predictions compared to those obtained from subgroup analysis. For the PFS-OS pair of outcomes, the accuracy of the predictions was very similar across all methods (similar discrepancies) but the uncertainty varied depending of the level of borrowing of information. F-EX model gave on average the most precise estimates (μ2ij) having the narrowest 95% predictive intervals of the effect on the final outcome (smallest width ratio seen in Tables 3, 5, 7) reducing the overall uncertainly by 5%. The benefit was smaller in the chemotherapy class where the number of studies was much larger compared to the anti-EGFR treatment class where we had only 8 Figure 3: 95% Credible intervals of λ 1j and λ 0j for the TR-PFS pair of outcomes studies available. Overall, P-EX performed better than subgroup analysis and equally well with F-EX regarding the uncertainty of its predictions. This indicates that the assumption of full exchangeability seems to be plausible for this pair of outcomes and P-EX model was able to identify this.
In contrast, for the TR-PFS pair subgroup analysis using the standard model was a robust approach in terms of the accuracy of its predictions. Although the overall discrepancies were very similar across models, F-EX and P-EX had significantly higher discrepancies compared to subgroup analysis in the anti-angiogenic class. This implies that the posterior means of the true effects were to some extent 'overshrunk' due to excessive borrowing of information.
Similarly, there was no significant decrease in the degree of uncertainty of the estimatesμ2ij of F-EX and P-EX models. The results indicate that the hierarchical methods performed better compared to subgroup analysis in terms of uncertainty only in the class of chemotherapy and the anti-EGFR treatment class giving 1.5% and 3% narrower predictive intervals respectively. This kind of behaviour might be caused by the relatively large between treatment class heterogeneity and the small number of treatment classes for this pair.

Simulation study
The hierarchical methods developed in this paper allow different levels of borrowing of information for the parameters of interest. F-EX model assumes exchangeability for slopes and intercepts whilst, the P-EX model allows for partial exchangeability for these parameters. We carried out, a simulation study to assess the performance of the hierarchical methods and to compare them with subgroup analysis using the standard model.

Aims
As F-EX and P-EX models allow for different degrees of borrowing of information for the parameters of λ1j, the primary aim of our analysis was to estimate the performance of these parameters across three simulated scenarios and to compare the results of the proposed models and subgroup analysis. In addition we set out to investigate which method gives the best predictions of the true effects on the final outcome µ2ij. The last aim of our analysis was to examine which method predicts a strong surrogacy pattern better, checking whether the three surrogacy criteria (described in section 3.1) were satisfied or not across all simulations for each data scenario in the simulation study.

Methods
We simulated data under three different scenarios with 1000 simulations each, having 5 treatment classes and eight studies per class in a simulation for the first two scenarios and six studies for the third one. We assumed that data in each treatment class had a different heterogeneity pattern. Therefore, to have a control over such heterogeneity patterns when simulating the data we needed to make an assumption about the distribution of the true effects both on the surrogate and the final endpoints. The standard model by Daniels & Hughes assumes fixed effect for the true effects on the surrogate endpoint (no common distribution) so instead, we simulated data using a product normal formulation of BRMA (equation set 5), assuming normal random effects on the surrogate endpoint. Apart from this assumption, this method is the same as Daniels & Hughes model using a bivariate normal distribution to describe the within-study variability and a linear relationship to model the association between the surrogate and the final outcome. This decision however, can lead to results of the simulation obtained with slightly increased uncertainty because we fit the data with a model that makes fewer distributional assumptions than the model we used to simulate them.

Y1i
Y2i Furthermore, a cross-validation procedure was applied to each method across the simulated data scenarios. In the simulation study, the true effect on the final endpoint µ2ij was known, since it had been simulated ,therefore the cross-validation procedure was applied on the true effects (in real data scenarios we compare the predicted effect with the observed effect) by checking whether the simulated value of the true effectμ2ij was included in the predictive interval of µ2ij with the corresponding variance var(µ2ij|Y1ij, σ1ij, Y 1(−ij) , Y 2(−ij) ).

Design of scenarios
To test and compare the methods and to illustrate their applicability, we carried out a simulation study in three data scenarios assuming different patterns of surrogacy within treatment classes.
In the first scenario, where our aim was to visualise the properties of exchangeability, we simulated data in five treatment classes assuming high degree of similarity for their slopes and intercepts. Data in each treatment class were simulated separately assuming strong surrogacy for each individual class but weak overall. The slopes were generated using normal distributions having very similar means (but sufficiently different to ensure weak overall surrogacy pattern) and the same standard deviation: To achieve this, we generated three out of five treatment classes with strong surrogate relationship and the remaining two classes with a weak surrogate relationship. Each slope was drawn from a different normal distribution having similar means and the same standard deviation λ11∼ N (0.5, 0.03), λ12∼ N (0.7, 0.03), λ13∼ N (0.9, 0.03), λ14∼ N (1.1, 0.03), λ15∼ N (1.3, 0.03). The intercepts followed the same normal distribution as in the previous scenarios, while the conditional standard deviations were chosen to be very large for the second and the forth treatment classes ψ2 = 0.30, ψ4 = 0.33 creating weak surrogate relationships for these two classes. On the other hand, small conditional standard deviations were used for the first, the third and the fifth class: ψ1,3,5 = 0.06 creating strong surrogate relationships. The same pattern was also followed for the between-study correlations ρ b1 = 0.95, ρ b2 = 0.75, ρ b3 = 0.97, ρ b4 = 0.75, ρ b5 = 0.985.

Results
The following sections list the performance of the posterior means ofλ1j, the performance of the posterior means of µ2ij as well as the probabilities of predicting a strong surrogacy pattern for each class by each method. To evaluate the goodness of fit of the models, we calculated the coverage probability of the 95% CrIs of λ1j and the 95% predictive intervals of µ2ij. To integrate the bias and the variance ofλ1j andμ2ij in a single measure we monitored the root mean squared error (RMSE) of these estimates. In order to investigate potential decrease in the degree of uncertainty of the estimates, as a result of borrowing of information across treatment classes, we calculated ratios of the width of the 95% CrIs. The width ratio w λ F EX,(P EX)  Table 8 presents averages of the measures we monitored forλ1j over the five classes of treatment. All the models performed equally well in terms of the coverage probability of the 95% CrIs of λ1j giving acceptable probabilities, as more than 95% of the CrIs of λ1j contained the true value across scenarios. Monte Carlo errors were small for all the scenarios implying good accuracy of the Monte Carlo samples and that convergence was achieved for all the methods. In the first data scenario, where the treatment classes were very similar in terms of surrogacy patterns, F-EX was superior compared to subgroup analysis as it gave posterior means of slopes with the lowest RMSE and reduced uncertainty (narrower 95% CrIs) due to borrowing of information across classes, P-EX achieved almost the same level of borrowing of information as its mixtures weights were very close to 1 across treatment class (see details in the supplementary material C.1). In the second scenario when the exchangeability assumption was not reasonable for a particular class, P-EX model yielded the most robust results giving posterior means with the smallest RMSE, reducing the degree of borrowing of information for the class where the slope was distinctly different (the posterior of the mixture weight in this class was p1 = 0.6) whilst borrowing of information across the remaining classes was still guaranteed (p2, p3, p4, p5 ≈ 0.97). More specifically, the model gave a posterior means of the slopesλ1j with the smallest RMSE and the lowest degree of uncertainty across classes (see details in the supplementary material A.2). Additionally, F-EX and P-EX gave better posterior means of the slopes in terms of RMSE and uncertainty compared to subgroup analysis when data were limited and the surrogacy patterns varied across classes as simulated in the scenario.
The last column of Table 8 shows the probabilities of strong surrogacy for each scenario across models. F-EX and P-EX methods predicted a strong surrogacy pattern (based on the three surrogacy criteria) better compared to subgroup analysis across all scenarios. In the first data scenario where the surrogate relationship was designed to be strong for all the classes, F-EX and P-EX models managed to predicted strong surrogacy in the 89% simulations, whereas subgroup analysis predicted only the 82.6% of them. Similarly in the second scenario, where the surrogate relationships were also designed to be strong across all classes, P-EX and F-EX achieved more than 91% correct predictions about the surrogacy pattern, 4% more compared to subgroup analysis.  Table 9 presents the results from the last scenario, where the surrogate relationship varied across classes. F-EX and P-EX methods were able to predict a strong surrogacy pattern with higher probability compared to subgroup analysis in the classes where the surrogate relationship was designed to be strong. At the same time the methods identified classes for which the association was designed to be weak, despite the assumption of exchangeability of    In the first scenario, F-EX and P-EX models outperform subgroup analysis in terms of the RMSE and the uncertainty ofμ2ij, however, there is no winner between them as both had almost the same degree of borrowing of information. They yielded 17.5% narrower predictive intervals compared to subgroup analysis.
In the second scenario, P-EX yielded posterior means with the smallest RMSE and CrIs with the smallest width ratio across classes. Furthermore, P-EX method gave the most robust results for the 'extreme' treatment class relaxing The third scenario gave similar results as the first one in terms of the uncertainty and the RMSE ofμ2ij. F-EX P-EX models performed almost equally, whilst subgroup analysis with the standard model was the worst approach since it gave inflated predictive intervals, larger RMSE and worse MCE in all cases.

Discussion of the results
The aim of the simulation study was to illustrate and assess the performance of the methods under different data scenarios. All the methods gave slightly higher than 95% coverages probabilities. It means that they derived more conservative CrIs of parameters than expected. This is because, we did not use the same model to simulate and analyse the data. The model we used to simulate our scenarios makes distributional assumptions about the true treatment effects on the surrogate endpoint (being random effects) whilst the models we used to perform our analyses assumes fixed effects for true effects on the surrogate endpoints as explained in section 7.2. In the first scenario where the assumption of exchangeability is reasonable, F-EX and P-EX model performed equally well and better than subgroup analysis giving on average narrower 95% CrIs of λ1j and 95% predictive intervals of µ2ij. This indicates that P-EX model successfully identified the correct level of borrowing of information inferring that the mixture weights should be very close to 1. P-EX model was the best choice in the second scenario where there was a treatment class with distinctly different surrogacy pattern. It relaxed the degree of borrowing of information for the 'extreme' class giving the most precise posterior means of the slopes. Moreover, F-EX and P-EX models performed equally well in terms of predictions of the true effect on the final endpoint, reducing the width of predictive intervals by 22% compared to subgroup analysis. Last but not least, the proposed methods predicted a strong surrogate relationship better compared to subgroup analysis across all data scenarios. In the last scenario in particular, the proposed hierarchical methods were able to predict strong surrogacy significantly better compared to the subgroup analysis for the treatment classes where the surrogacy was designed to be strong. This illustrates well the benefits of using hierarchical methods when data are limited. Furthermore, F-EX and P-EX could easily distinguish between the different surrogacy patterns as they did not overestimate the strength of surrogacy for the classes where surrogacy was designed to be weak.

Discussion
We developed two hierarchical models allowing to account for distinct treatment classes when examining the sur- F-EX model is appropriate only when the degree of similarity of surrogate relationships is relatively high. It can offer substantial gains in precision, reduced RMSE of the posterior means of the parameters describing surrogate relationships and it can improve the predictions of the true effects on the final endpoint. For example, F-EX model gave posterior means of the slopes and predicted effects with reduced uncertainty (smaller credible intervals) compared to subgroup analysis for the first simulated data scenario and for the illustrative example on PFS-OS pair where the parameters describing the surrogate relationship were very similar and the assumption of full exchangeability was reasonable. These findings are consistent with the results from other hierarchical Bayesian methods which assume full exchangeability and were developed in other research areas [3,16]. However, P-EX model can achieve the same degree of borrowing of information in such data scenarios, where the similarity between classes is high. The model estimates the degree of borrowing of information using its exchangeable and non-exchangeable components with respective mixture weights. These weights, estimated by the model, quantify the degree of similarity of the treatment classes in the data which can lead to relaxing the assumption of exchangeability when it is necessary by using the non-exchangeable component with higher weight. For instance, when between treatment class heterogeneity is relatively large or there is a treatment class with distinctly different surrogacy pattern, P-EX model is the best option as it avoids the excessive borrowing of information, as illustrated in the second scenario of the simulation study. All the above illustrate the benefits of partial exchangeability, as described by Neuenschwander et al. [30] in their work.
Subgroup analysis using the standard model is a simple method which will perform well when there are sufficient data available for each class, but it will produce estimates with higher bias and uncertainty when data within a class are limited.
Although, the proposed methods provide additional robustness to the CrIs and the posterior means of the parameters describing the surrogate relationships compared to subgroup analysis, potential limitations should always be kept in mind. First, in real data scenarios it can be challenging to find data sets with sufficient number of treatment classes.
The small number of treatment classes can affect the performance of hierarchical methods substantially [29] reducing the impact of borrowing of information. For instance, applying P-EX model to the illustrative example (in aCRC with three treatment classes) led to a situating where in some of the MCMC iterations only one class was deemed exchangeable by the model which is not possible since there were no other classes to exchange information with.
However, in our example it did not affect the performance of the model as it occurred only in the 0.5% of the MCMC iterations.
Another limitation of the illustrative example is that treatment switching was applied in a subset of trials in this data set. Patients were allowed to switch from the treatment initially assigned to the other treatment arm, mostly from control to experimental arm, typically after progression, if there was sufficient evidence during the trial that the experimental treatment was better than control [24]. Treatment switching potentially diminishes the difference in treatment effects on OS when applying intention-to-treat analysis, leading to larger uncertainty and zero effect on the final endpoint. This makes the estimation of surrogacy patterns between treatment effects on the surrogate and treatment effects on the final outcome very challenging. Many adjustment methods have been proposed, however their validity is often questionable [24].
Furthermore, as it was mentioned in section 2.2, each treatment class consist of studies with multiple treatment comparisons. According to Daniels and Hughes [13] and Shanafelt et. al [35] different treatment comparisons and the use of active or inactive control interventions may influence the surrogate relationship. This could potentially be resolved by classifying treatment according the treatment class comparison (for example anti-angiogenic therapies versus chemotherapy) which potentially would lead to more treatment classes, but with reduced number of studies per class.
A possible extension of these methods is to add another layer of hierarchy accounting for the different treatments within a treatment class, however a reasonable number of studies for each treatment and number of treatments per class would be required . The models could also be extended by making an additional exchangeability assumption about conditional variances, however this may lead to over-parameterising the model or potentially to preventing the model from distinguishing between classes with a strong surrogate relationship from those with weak surrogacy patterns. Furthermore, taking advantage of the setting that Bujkiewicz et al. [4] have proposed, both hierarchical models can be extended to allow for modelling multiple surrogate endpoints (or the same surrogate endpoint but reported at multiple time points) at the same time. Further research is also needed to extend the proposed methodology to Binomial data or to time to event data where the assumption of normality is not plausible. Moreover, to overcome the convergence issues that vague priors on the hyper-parameter of the mixture weights cause, alternative prior distributions should be developed extending the P-EX in a similar way as proposed by Kaizer et al. [22].
In summary, we developed hierarchical Bayesian methods for evaluating surrogate relationships within treatment classes whilst borrowing of information for surrogate relationships across treatment classes. We believe that the proposed methods will improve the validation of surrogate endpoints in the era of personalized medicine, where the surrogacy patterns may depend on the mechanism of action of specific targeted therapies.
A Tables for the performance ofλ 1j for all different scenarios within treatment classes and across methods A.1 1st scenario