A comparison of Bayesian information borrowing methods in basket trials and a novel proposal of modified exchangeability‐nonexchangeability method

Recent innovation in trial design to improve study efficiency has led to the development of basket trials in which a single therapeutic treatment is tested on several patient populations, each of which forms a basket. In a common setting, patients across all baskets share a genetic marker and as such, an assumption can be made that all patients may have a homogeneous response to treatments. Bayesian information borrowing procedures utilize this assumption to draw on information regarding the response in one basket when estimating the response rate in others. This can improve power and precision of estimates particularly in the presence of small sample sizes, however, can come at a cost of biased estimates and an inflation of error rates, bringing into question validity of trial conclusions. We review and compare the performance of several Bayesian borrowing methods, namely: the Bayesian hierarchical model (BHM), calibrated Bayesian hierarchical model (CBHM), exchangeability‐nonexchangeability (EXNEX) model and a Bayesian model averaging procedure. A generalization of the CBHM is made to account for unequal sample sizes across baskets. We also propose a modification of the EXNEX model that allows for better control of a type I error. The proposed method uses a data‐driven approach to account for the homogeneity of the response data, measured through Hellinger distances. Through an extensive simulation study motivated by a real basket trial, for both equal and unequal sample sizes across baskets, we show that in the presence of a basket with a heterogeneous response, unlike the other methods discussed, this model can control type I error rates to a nominal level whilst yielding improved power.

Table S1: Calibrated ∆ α values for the simulation study based on a planned sample size of 13 per basket.These cut-offs are also applied to the realized sample size scenario without re-calibration.2 Simulation Study for Realized Sample Size With Re-Calibrated ∆ α Values In addition to those in the main text, a further simulation study was conducted on the realized sample size case of 20, 10, 8, 18 and 7 patients across the five baskets.In the previous study, the decision cut-off, ∆ α , was calibrated under a null scenario based on n k = 13 patients in each basket and applied to the realized sample sizes.In this simulation, the ∆ α values are re-calibrated based on the unequal sample sizes to again achieve a basket specific type I error rate of 10% under the null scenario.

Estimation Ability of Information Borrowing Models
Provided in Tables S7 (planned sample size scenario), S13 (realized sample size scenario) and S11 (realized sample size scenario with re-calibrated ∆ α ) are the mean posterior point estimates of p k across the 10,000 simulations in the simulation study.In brackets are the standard deviation of these mean estimates.mEXNEX 1/13 0.344 (0.106) 0.163 (0.098) 0.167 (0.106) 0.158 (0.076) 0.171 (0 mEXNEX 1/13 0.163 (0.074) 0.435 (0.147) 0.436 (0.158) 0.165 (0.076) 0.434 (0 There are an infinite number of data scenarios one could fall in when conducting clinical trial analysis, the results presented in the previously mentioned simulation studies are only a subset of these feasible possible data scenarios.
The data scenarios used in the simulation studies were selected to cover a wide range of cases, however, some important cases may have not been investigated.
To overcome this, a further simulation study was conducted within which, rather than fixing the true probability of success parameter prior to the study, for every simulation run a new random truth vector, p, was generated and data simulated from a Binomial distribution using these p values.In order to ensure equal chances of lying in the null and non-null case, p was selected with uniform probability across the ranges [0,0.15] and [0.35,0.5].
A total of 20,000 simulations for each borrowing method was run under the three simulation cases: planned sample size of n k = 13 in each basket, realized sample sizes of 20, 10, 8, 18 and 7 without re-calibration of ∆ α and the realized sample size case with re-calibration.For each method and setting, we find the following operating characteristics: • Type I error rate -the percentage of times the null was rejected out of the cases where the null was in fact true.This is computed for each basket.
• Power -the percentage of times the null was rejected out of the cases where the true response rate was non-null.This is also computed for each basket.
• Percentage of all correct inference (% Correct) -the percentage of times the correct decision regarding whether to accept/reject the null was made across all 5 baskets.
• Family-wise error rate (FWER) -the percentage of times at least one type I error was made across the 5 baskets (excluding the global alternative cases where p k is non-null in all k baskets.

Planned Sample Size
Figure S6 presents results for the planned sample size case in which 13 patients were observed in each of the 5 baskets.The top section of the figure demonstrates the type I error rate under each of the 7 methods.Similar to results presented in the planned sample size simulation in the main manuscript, the BHM and BMA have the highest error rates at approximately 5.0% and 4.51% respectively.All methods have errors less than or equal to the nominal 10% level.The reduced error rates comes from, in some cases, the true response rate lying well below the null level of 15%.The cut-off value ∆ α was calibrated for each method under a null scenario where the true response rate is 0.15.When a basket has a true response rate less than 0.15, the ∆ α value becomes conservative as it is easier to correctly identify that the treatment is ineffective.Under each of the borrowing models there is some degree of pull towards the common mean, which is most evident in the BHM and BMA case, and thus all have a higher error rate than under an independent analysis.Both the standard EXNEX and mEXNEX 1/13 model have almost identical rates both at around 3.2% each, whilst the mEXNEX 0 has a lower error rate of 2.7%.When considering power, those methods with higher error rates also demonstrate the greatest power.The BHM has a power value increased 6% to that of the independent model but that came with the inflation of error as mentioned above.Again both the EXNEX model and mEXNEX 1/13 model perform almost identically with a power of 86%.The mEXNEX 0 model has lower power at 83.9% but this is still a 2% improvement over the independent model.The CBHM has very similar power and error rate to the independent analysis.
The percentage of times correct conclusion was made across all 5 baskets is presented in the bottom left of Figure S6.The BMA approach alongside the BHM, EXNEX and mEXNEX 1/13 model all make the correct conclusion across baskets in 63% of the simulations.However, the modified EXNEX approach with c = 0 makes slightly fewer all correction conclusions at 61.2% but this is still greater than an independent analysis which has a value of 57.6%.
A more substantial difference in methods is observed when looking at the family-wise error rate.Methods that demonstrated lower type I error rate also present lower FWERs, with the independent analysis giving the lowest error alongside the CBHM.This must be weighed up with the lower percentage of all correct inference and power that these two methods possess.The BHM and BMA have much larger FWER values, as expected based on the inflated type I error rate.
To summarise, in the planned sample size case when the true response rate is varied, the BHM and BMA continue to display undesirable error rates whilst the independent analysis and CBHM lack power.The modified EXNEX model with c = 1/13 performs almost identically to the standard EXNEX model, whereas, when a more conservative cut-off value c = 0 is implemented, error rates are reduced by 0.5% from the standard EXNEX model but with a 2.1% reduction in power (but still a 2.4% improvement over an independent analysis).

Realized Sample Size
In the realized sample size case, basket sample sizes are equal to 20, 10, 8, 18 and 7 and ∆ α is calibrated based on the planned sample size of n k = 13 in each of the k baskets.Results are presented in Figure S7.Similar error rates are observed as in the planned sample size case, with the BHM and BMA approach having inflated error rates with higher errors in baskets where the sample size is small.In this case the mEXNEX 0 model performs almost identically to an independent analysis due to the discreetness of the data making it impossible for a basket to not be analysed as independent in first step of the mEXNEX c procedure.The EXNEX and mEXNEX 1/13 again behave similarly in all metrics, thus little would be gained by using the modified EXNEX approach in this case (particularly for the choice of c made).The mEXNEX 1/13 does generate a 0.3% higher probability of all correct inference across the baskets compared to the EXNEX model whilst giving the same FWER.
Looking at the percentage of all correct inference across the 5 baskets and the family-wise error rates on a whole, performance of methods are very similar to the planned sample size case but with uniformly lower values for the first metric.The exception is the mEXNEX 0 model, which now is identical to the independent approach for the aforementioned reasons.
Overall, we conclude from these results that again a cut-off of c = 0 is not appropriate in the realized sample size case, however, performance when c = 1/13 is selected produces very similar results to the standard EXNEX model.Other values of c could be beneficial here but this was calibrated based on the planned sample size case.Power for smaller basket sizes tend to be considerably lower, so methods that borrow more strongly show clear benefits in power improvement.For example, the BHM improves power over an independent analysis by 7.2% in basket 5 when the sample size is just 7. The mEXNEX 1/13 model also demonstrates power improvement over an independent analysis by 2.8%.

Realized Sample Size with Re-Calibrated ∆ α
The above took the decision cut-off value ∆ α calibrated based on the planned sample size of 13 patients in each basket applied to the realized sample sizes of 20, 10, 8, 18 and 7 for the 5 baskets.In this section these ∆ α values are re-calibrated based on the realized sample sizes.Results are akin to the realized sample size without re-calibration case and hence discussion is omitted.The specification of the mEXNEX c model outlined in Section 2.6 of the main text, requires a two step procedure: • Step 1: Remove clearly heterogeneous baskets to be analyzed independently, setting π k = 0 in the EXNEX model.This is conducted based on some pre-defined cut-off value c which is compared to the minimum pair-wise difference in responses. • Step 2: Of remaining baskets, compute pairwise Hellinger distances and set the prior borrowing probability π k , to be the average of these distances (excluding the distance to itself).
Here we explore why the use of both of these steps is advantageous compared to making just one of these alterations to the standard EXNEX model.We consider four model settings: 1.The standard EXNEX model where π k = 0.5 for all K baskets.
2. The EXNEX model with just step 2 i.e. no removal of heterogeneous baskets but Hellinger distances used to define the π k values.Denote this as EXNEX Hell .
3. The EXNEX model with just step 1, i.e. removing heterogeneous baskets and assigning remaining baskets a borrowing probability of 0.5.Denote this as EXNEXR c .
4. The mEXNEX c model as outlined in Section 2.6 which implements both steps.

Simulation Study Based on the Motivating VE-BASKET trial
To make such a comparison we initially consider the simulation setting outlined in Section 3 of the main text which is based on the motivating VE-BASKET trial.Results presented in Figure S9 are based on a planned and equal sample size of n k = 13 patients in each of the k baskets and a cut-off of c = 0 implemented for the methods that require removal of heterogeneous baskets.Figure S9 displays the percentage of simulated data sets in which the null hypothesis is rejected for each basket and model setting under the data scenarios outlined in Table 1 of the main text.When the null is true, the bars represent the basket's type I error rate, else it is the power.  1 of the main text.
From Figure S9 one can clearly see that when heterogeneous baskets are not removed from the borrowing component (i.e. in the EXNEX and EXNEX Hell models) an inflation in error rate is evident.Whereas, models that take this removal step have far better error control at the cost of slightly lower statistical power.We would therefore not recommend the EXNEX Hell for use as it's performance is inferior to the other proposed modifications.
Only minimal differences in the mEXNEX c and EXNEXR c are observed here with the mEXNEX c method giving consistently greater power compared to the EXNEXR c method but only up to an increase of 0.5%.This increase occurs alongside insubstantial inflation in error rates.Although the difference in these two methods is only slight in this single study, we explore further trial settings to investigate the differences between the two proposals.

Varying the Study Design
When conducting simulation studies there are a few design parameters to consider, these include: • The number of baskets, K, included in the study.
• The sample size, n k , within the k baskets.
• The null and target response rate.
In the simulation study in Section 5.1, the design parameters are specified to have K = 5 baskets with n k = 13 patients in each of the k = 1, . . ., K baskets, while the null and target response rates are fixed at q 0 = 0.15 and q 1 = 0.45 respectively.These design parameters are in line with those of the VE-BASKET trial, which the simulation is motivated by.We now use this simulation as a reference setting, while we vary one of the three design parameters at a time to determine where differences in the mEXNEX c and EXNEXR c , becomes more prominent.The different settings of design parameter alterations are provided in Table S14.For the comparison of the design settings, the EXNEX Hell model was excluded due to it's inferior performance in the previous simulations.For both models that remove heterogeneous baskets we opt for two values of c, c = 0.05 and c = 0.1.The data scenario applied is one in which we have a ratio of two baskets with a response rate of q 1 to three baskets with a response rate of q 0 .Plotted in Figure S10 are the rejection percentages for each basket and model under the 8 different simulation settings outlined in Table S14.The first two of these settings vary the number of baskets from a very small number to moderately large, whilst settings 3-5 cover different possible sample sizes per basket, a case where the number of patients is very small, a moderate number and then in setting 5 a larger number of patients.The final two settings vary the target response rate with one value being close to the null response rate and the second being fairly different from q 0 .Looking first at the reference model, an average of 0.7% increase in power is observed with just a 0.3% increase in type I error when using mEXNEX c over EXNEXR c when c = 0.05.However, as the number of baskets, K, increases, the difference in type I error rate is more evident.The increased inflation in error rates across all methods as K increases comes about as less baskets are being treated independent at the removal step.To treat a basket as independent we compute all pairwise difference in response rates, i.e. for basket i, we treat it as independent if |X i − X j | > c for all i ̸ = j where X i = Y i /n i , with Y i being the number of responses observed in basket i which consists of n i patients.If we consider the case where the response rates are IID, then the probability a basket is treated as independent is P(|X 1 − X 2 | > c) K−1 which is decreasing as K increases.As a result, when K = 10, borrowing will occur more frequently between heterogeneous baskets and hence the error rates inflate to anywhere between 12 and 18% dependent on method chosen (compared to 10.5 to 12.4% when K = 5).When K = 3, more baskets are treated as independent and thus differences in error rate and power are minimal.S14.
More error rate inflation is observed when using mEXNEX c compared to EXNEXR c when K = 10 (up to a 2% increase when c = 0.05 and 3% when c = 0.1).This again comes down to fewer baskets being treated independently as despite this, the EXNEXR c model limits borrowing by fixing the prior probabilities at π k = 0.5, whereas under the mEXNEX c model, these probabilities tend to be higher resulting in more borrowing from heterogeneous baskets and hence higher type I error rates.
Looking at the effect of sample size, we note that as the sample size increases, the Hellinger distance between two baskets decreases.These reduced Hellinger distances for both homogeneous and heterogeneous baskets, when averaged, result in smaller π k values and hence less borrowing.In fact, as n k increases to 100, these probabilities can actually fall below 0.5, so we therefore have a lower chance of borrowing between homogeneous baskets under the mEXNEX c model compared to the EXNEXR c model, and hence we observe better error control (11.45% compared to 12.48%).However, as the sample size increases, the increased certainty in estimates results in power tending towards 100% regardless of method and hence no improvement in power is observed.Even when n k is small at just 5 patients in each basket, the difference in error rates is also relatviely small at a 0.8% increase under the mEXNEX c model, whilst gaining on average 1.2% power when c = 0.05.
The final varied design parameter considered is the target response rate, q 1 .When q 1 is closer to the null response rate, we observe minimal change in the type I error rate from the nominal 10% level, whereas, when q 1 = 0.7 this error rate rises to above 11% for both methods.Inflation in error rates is caused by a pull away from the true mean towards the common mean.When the target response rate is close to the null, this pull is less substantial compared to larger q 1 values, hence explaining the minimal difference in type I error rate when q 1 = 0.3.Similar error rates are observed under both methods, however the gain in power is greater under the mEXNEX c model at 53.2% compared to 52.5% under the EXNEXR 0.05 model when q 1 is closer to the null response rate.This is due to the Hellinger distance, and hence π k values, being closer to 1 under the mEXNEX c model whereas under the EXNEXR c model these are fixed at 0.5.
From the above we conclude that, in general, the mEXNEX c model performs more favourably than EXNEXR c when the sample size is very small or very large and when the target response rate is closer to the null response rate.Whereas, the EXNEXR c model is preferred when the number of baskets increases and when the target response rate is very different to the null response rate.Overall, we would recommend the two-stage mEXNEX c method over the EXNEXR c model as a trial with fewer baskets and a target response rate closer to the null is more realistic.Although an argument could be made in some cases to use just the removal of heterogeneous baskets step.

Simulation Study for a Varied Number Baskets
In Section 5.2 a simulation study was conducted with one parameter varied at a time, this section now focuses in on just one of those parameters: the number of baskets, denoted K. Two values of K are considered -K = 3 and K = 10 -with full simulation results for several data scenarios and under each of the borrowing methods provided in Sections 6.1 and 6.2.
For both cases a total of 10,000 simulation runs were used for each method and data scenario.The VE-BASKET trial remains as the motivating example, as such, the sample size in each basket is fixed at n k = 13 for all k = 1, . . ., K with a null and target response rate of q 0 = 0.15 and q 0 = 0.45 respectively.Model specifications are consistent with those outlined in Appendix A of the main text.

Simulation Study for K = 3 Baskets
Under K = 3 baskets, 8 data scenarios are considered and outlined in Table S15.Scenarios 1-4 cover varying number of effective baskets from none to all 3, whilst scenarios 5-8 consist of cases where some baskets are marginally effective with a true response rate of p k = 0.35.
The modified EXNEX model was calibrated as outlined in the procedure in Section 2.6 of the main text and results for two values, c = 1/13 and c = 4/13, are presented here.The CBHM was also tuned to get parameters a = −1.390and b = 3.674.Calibrated ∆ α values are provided in Table S16.Simulation results are presented in table form in Table S17, with rejection percentages also displayed in Figure S11.
Under data scenario 2, all methods give reasonably similar power values ranging from 86.9% to 88.4%.Both the BHM and BMA approach give the smallest power in this case with the mEXNEX 1/14 model presenting the highest.Error rates tend to be significantly smaller in the 3 basket case compared to the previous 5 basket simulation, as does the difference in performance between all methods.The BHM and BMA approach have inflated error rates at around 13%. Almost identical inflation is observed under the EXNEX and mEXNEX 4/14 model but the modified EXNEX model with c = 1/13 has lower error rates at 10.8% (compared to 11.5%).So the mEXNEX 1/13 model is appealing here for both it's error control and superior power value.Across all scenarios 1-8 the standard EXNEX model and mEXNEX 4/14 model continue to perform almost identically with only marginal differences observed in terms of the type I error rate, power, FWER and percentage of all correct conclusions.However, when the cut-off is reduced to become more conservative at a value of c = 1/13, noticeable differences arise, including a slight reduction in error rates with a slight loss in power.
As in the previous simulation studies, the BHM and BMA procedure continue to inflate the type I error rate to an unacceptable level but in the 3 basket case, unlike the 5 basket case, tends to do little in terms of power improvement compared to the EXNEX models which possess superior error control.For example, under scenario 3 in which 2 of the 3 baskets are effective to treatment, power under the BHM is averaging at 90.9% with a type I error rate of 21.8%, whereas, under the EXNEX model power is 90.5% with a type I error rate of 11.6%.Therefore, one could argue that the minute power improvement can not justify the 10% increase in type I error rate.However, more improvement in power for the methods that borrow more strongly is observed in baskets with marginally effective response rates, most notably seen in scenario 8.
In a simulation akin to that in Section 4, a further study was conducted within which the true response rate, p, was randomly generated within each simulation run.Results of this is presented in Figure S12.Again, the standard EXNEX model and mEXNEX 4/13 model produce very similar results where both have type I error rate of 2.8%, but the mEXNEX 1/14 model has a slightly higher power at 85.9% compared to 85.7% under the EXNEX model.However, if the cut-off value c is reduced to 1/13 the error rate then becomes 2.7%, so a marginal decrease, with power 85.4%.Weighing up error control and power improvement one would favour the less conservative modified EXNEX approach or the standard EXNEX model as the error rates are all relatively similar but with power improvement over an independent analysis of about 2.5%.
All methods bar an independent analysis give approximately a 76% chance of making the correct conclusions across all baskets however, family-wise error rates vary.The BHM and BMA approach give FWER values of around 6.2% whilst the standard EXNEX and mEXNEX 4/13 models have a FWER of 4.7% with the mEXNEX 0 model slightly lower at 4.6%.
To summarise, the effect of reducing the number of baskets is the slight reduction in power alongside a smaller inflation in error rates particularly for the BHM and BMA approach, however, the conclusions regarding method comparison appear to be very similar to that in the planned sample size case with K = 5 baskets.S18.Scenarios 1-11 cover varying number of effective baskets from 1-10, whilst scenarios 12-15 consist of cases where some baskets are marginally effective with a true response rate of p k = 0.35.The modified EXNEX model was calibrated as outlined in the procedure in Section 2.6 of the main text and results of two calibrated cut-off values are presented here -c = 0 and c = 1/13.The CBHM was also tuned to get parameters a = −23.475and b = 10.963.Calibrated ∆ α values are provided in Table S19.Full results are presented in Tables S20, S21 and S22, with the hypothesis rejection percentages also displayed in Figures S13 and S14.
Consider scenario 2 where a single basket is heterogeneous and effective, the results are similar to that in the K = 5 basket case.Again, substantial inflation in the type I error rate is observed under the BHM and BMA approach, with an averaged error rate of 17.2% and 12.7% respectively.Under the K = 5 basket case, a similar scenario with a single heterogeneous basket resulted in an error rate of 16.9% and 13.2% respectively -these values are very similar indicating that increasing the number of baskets does little to eliminate error rate inflation under the two models that demonstrate the worst performance, particularly as in these scenarios where the power is substantially lower than an independent analysis (82.2% under the BHM compared to 88.3%).
Looking at other methods under the same scenario, the CBHM also has inflated error rates at approximately 12%.This contradicts the calibration nature of this model that takes a 'strong' definition of heterogeneity in that: if a single or multiple baskets have a heterogeneous response, then all are deemed heterogeneous so analysed as independent.In this scenario, there is clear heterogeneity thus it would be expected that the CBHM performs similarly to the independent model.This is not the case, leading to the conclusion that perhaps the calibration was slightly off.
The EXNEX model under scenario 2 also has fairly substantial error rates at 11.9% with a power of 87%, whereas, under the more conservative modified EXNEX approach with c = 0, error rates are around 10.5%, so close to the nominal level, with a power of 87.5% -an improvement over the standard EXNEX model.If the cut-off was increased to c = 1/13, error rates increase to 11.3% with 87.4% power.Bringing together these results, under scenario 2, of the borrowing models the mEXNEX 0 model has both the highest power and best error control.Moving on to cases in which multiple baskets are effective to the treatment (as in scenarios 3-10), the pattern of results described above hold in that the BHM and BMA inflate error rates, the mEXNEX c models have better error control than the standard EXNEX model, particularly when c is more conservative, whilst still improving power over an independent analysis.These are also the same conclusions drawn from the 5 basket simulation study presented in the main text.
One can see that the error rate gets far more substantial across all methods as the ratio of effective to ineffective baskets increases with the inflation greater than that in the 5 basket case.For example, under K = 5 the maximum error rate for the BHM is 42.1%, whereas in the K = 10 case this is 76.8% which occurs under scenarios 5 and 10 respectively.In scenario 10 for K = 10, just one basket is ineffective with a true response rate of 0.15 whilst the other 9 are effective with a higher response at 0.45.The presence of 9 effective basket compared to just 4 in the K = 5 case causes a larger pull up towards the common mean for the single heterogeneous basket, hence the greater inflation.This holds for all methods, with maximum error rates uniformly increasing from the K = 5 to K = 10 case.However, although this error inflation is observed, power is improved across all methods, this is unsurprising due to the additional certainty we gain from the extra 5 baskets included in the study.
Under the four scenarios in which a number of baskets have a marginally effective response rate of 35%, all borrowing methods show a substantial gain in power compared to an independent analysis, particularly for the marginally effective baskets.But inflation in error rates continues to be an issue.
As in Section 4 of the supplementary material, a further simulation was conducted within which the truth vector, p, is varied within each simulation.The results are provided in Figure S15.Unlike in the K = 3 basket case, more substantial differences are observed in the standard EXNEX model and the modified EXNEX approaches.The standard EXNEX model has error rates and power of 4.6% and 88.3% respectively, the mEXNEX 0 model has an average error rate of 3.3% and power 86.2% whilst the mEXNEX 1/13 model has 4.4% error rate and 88.1% power.All have power improvement over an independent analysis which has an average power of 83%, thus even under the most conservative modified EXNEX approach, power is improved by around 6.2%.
Then looking at the percentage of times correct inference was made across all 10 baskets, highest values were observed under the EXNEX model, mEXNEX 1/13 model, a BMA approach and the BHM at around 42%.The mEXNEX 0 model has value of 41.1% for all correct inference but the mEXNEX 0 model had 5% lower familywise error rate compared to the standard EXNEX model.Weighing up both FWER and all correct inference, the mEXNEX 0 model appears optimal with FWER closer to that of the independent analysis and CBHM, whilst giving 3% higher percentage of correct inference compared to an independent analysis.
To summarise, this study confirms the conclusions drawn from the K = 5 basket simulation study presented in the main text whilst also highlighting that the larger number of baskets, although improving certainty of estimates and power, causes an even less favourable type I error rates.

Figure S1 :
Figure S1:The family-wise error rate (FWER) and percentage of simulated data sets within which correct inference is made across all baskets (% All Correct) for each method under each data scenario based on a planned sample size of 13 patients per basket.

Figure S2 :
Figure S2: The family-wise error rate (FWER) and percentage of simulated datasets in which the correct inference is made across all baskets (% All Correct) for each method under each data scenario based on realized sample sizes of 20, 10, 8, 18 and 7 across the 5 baskets.

Figure S4 :
Figure S4: Percentage of rejections of the null hypothesis for each method under data scenarios 11-16, based on the realized sample sizes of 20, 10, 8, 18 and 7, with re-calibration of ∆ α to take into account unequal sample sizes.

Figure S5 :
Figure S5: The family-wise error rate (FWER) and percentage of times correct inference is made across all baskets (% All Correct) for each method under each data scenario based on realized sample sizes of 20, 10, 8, 18 and 7, with re-calibration of ∆ α to take into account unequal sample sizes.

Figure S6 :
Figure S6: Operating characteristics under varied truths for the planned sample size case where 13 patients are observed in each basket.

Figure S7 :
Figure S7: Operating characteristics under varied truths for the realized sample size case where 20, 10, 8, 18 and 7 patients are observed in the 5 baskets and ∆ α is not re-calibrated

Figure S8 :
Figure S8: Operating characteristics under varied truths for the realized sample size case where 20, 10, 8, 18 and 7 patients are observed in the 5 baskets and ∆ α is re-calibrated

Figure S9 :
FigureS9: The percentage of simulated data sets in which the null hypothesis was rejected in each basket under the four model settings outlined above across the simulation settings provided in Table1of the main text.

Figure S10 :
Figure S10: The percentage of simulated data sets in which the null hypothesis was rejected in each basket under the eight comparison settings outlined in TableS14.

Figure S12 :
Figure S12: Operating characteristics under varied truths for the K = 3 basket simulation study

Figure S13 :
Figure S13: Percentage of rejections of the null hypothesis for each method under the 10 basket case across scenarios 1-8.

Figure S14 :
Figure S14: Percentage of rejections of the null hypothesis for each method under the 10 basket case across scenarios 9-15.

Figure S15 :
Figure S15: Operating characteristics under varied truths for the K = 10 basket simulation study

Table S3 :
Operating characteristics for a simulation based on the realized sample size of 20, 10, 8, 18 and 7 across the baskets under data scenarios 1-6, with re-calibration of ∆ α to take into account the unequal sample sizes.

Table S4 :
Operating characteristics for a simulation based on the realized sample size of 20, 10, 8, 18 and 7 across the baskets under data scenarios 7-12, with re-calibration of ∆ α to take into account the unequal sample sizes.

Table S5 :
Operating characteristics for a simulation based on the realized sample size of 20, 10, 8, 18 and 7 across the baskets under data scenarios 13-16, with re-calibration of ∆ α to take into account the unequal sample sizes.
FigureS3: Percentage of rejections of the null hypothesis for each method under data scenarios 1-10, based on the realized sample sizes of 20, 10, 8, 18 and 7, with re-calibration of ∆ α to take into account unequal sample sizes.

Table S6 :
Mean point estimates of p k across the simulations (standard deviations) based on a planned sample size of 13 per basket under scenarios 1-6.

Table S7 :
Mean point estimates of p k across the simulations (standard deviations) based on a planned sample size of 13 per basket under scenarios 7-10.

Table S9 :
Mean point estimates of p k across the simulations (standard deviations) based on realized sample sizes of 20, 10, 8, 18 and 7 patients across the 5 baskets for scenarios 7-12.

Table S10 :
Mean point estimates of p k across the simulations (standard deviations) based on realized sample sizes of 20, 10, 8, 18 and 7 patients across the 5 baskets for scenarios 13-16.

Table S11 :
Mean point estimates of p k across the simulations (standard deviations) based on realized sample sizes of 20, 10, 8, 18 and 7 patients across the 5 baskets with re-calibration of ∆ α under scenarios 1-6.

Table S12 :
Mean point estimates of p k across the simulations (standard deviations) based on realized sample sizes of 20, 10, 8, 18 and 7 patients across the 5 baskets with re-calibration of ∆ α under scenarios 7-12.

Table S13 :
Mean point estimates of p k across the simulations (standard deviations) based on realized sample sizes of 20, 10, 8, 18 and 7 patients across the 5 baskets with re-calibration of ∆ α under scenarios 13-16.
4 Simulation Study in Which the Truth Vector, p, is Varied

Table S14 :
Simulation settings for comparing the modified EXNEX models where we vary a single design parameter at a time.

Table S15 :
Simulation study scenarios

Table S17 :
Operating characteristics for a simulation consisting of K = 3 baskets.Percentage of rejections of the null hypothesis for each method under the 3 basket case.

Table S18 :
Simulation study scenarios

Table S20 :
Operating characteristics for a simulation based on K = 10 baskets with a sample size of n k = 13 in each