A Flexible Multi-Metric Bayesian Framework for Decision-Making in Phase II Multi-Arm Multi-Stage Studies

We propose a multi-metric flexible Bayesian framework to support efficient interim decision-making in multi-arm multi-stage phase II clinical trials. Multi-arm multi-stage phase II studies increase the efficiency of drug development, but early decisions regarding the futility or desirability of a given arm carry considerable risk since sample sizes are often low and follow-up periods may be short. Further, since intermediate outcomes based on biomarkers of treatment response are rarely perfect surrogates for the primary outcome and different trial stakeholders may have different levels of risk tolerance, a single hypothesis test is insufficient for comprehensively summarizing the state of the collected evidence. We present a Bayesian framework comprised of multiple metrics based on point estimates, uncertainty, and evidence towards desired thresholds (a Target Product Profile) for 1) ranking of arms and 2) comparison of each arm against an internal control. Using a large public-private partnership targeting novel TB arms as a motivating example, we find via simulation study that our multi-metric framework provides sufficient confidence for decision-making with sample sizes as low as 30 patients per arm, even when intermediate outcomes have only moderate correlation with the primary outcome. Our reframing of trial design and the decision-making procedure has been well-received by research partners and is a practical approach to more efficient assessment of novel therapeutics.


Introduction
Decision-making in phase II clinical trials carries risk and is far from straightforward.While only 18% of phase II studies establish sufficient evidence to advance a drug into phase III, it seems the wrong drug is often advanced resulting in a failure rate of 50% of phase III studies 1 .Current approaches are inefficient at differentiating good from poor regimens under phase II settings.Sample sizes tend to be considerably smaller in phase II trials than in phase III.Further, adaptive phase II trials tend to rely on intermediate outcomes for decision-making at interim analyses.While in some disease areas, phase II outcomes are the same as those in phase III 2 , it is common that alternative endpoints are used which may not have perfect correspondence with the primary outcome of interest.In addition to the complications of phase II designs, the typical estimands for decision-making are often suboptimal.Standard approaches in multiarm studies include selecting the k best performing arm(s) or more broadly advancing any arms "close" to the best performing arm 1 .A recent extension of Network Meta-Analysis highlighted the pitfalls of basing selection on ranking alone and authors provided recommendations for best practices that "[consider] not only the magnitude of relative effects but also their uncertainty and overlap of their confidence/credible intervals " 3 .An additional factor for regimen selection in phase II studies is ensuring sufficient evidence has been collected to have confidence that the regimen credibly meets a target product profile with respect to safety, efficacy, and general desirability.Frequentist approaches, such as significance testing and group sequential methods, can advance regimens where there is little to no potential to meet the target product profile [4][5][6] .Bayesian frameworks, using a single or a multi-level framework 5,6 , have recently been proposed to more directly address the critical question: "How likely is it that the [target product profile] is [fulfilled] based on my observed data?" 6 The aim of this paper is to present a Bayesian-supported decision framework which we have developed in the context of a phase II trial with an intermediate endpoint that is not a perfect surrogate and with limited outcome data.We propose a multi-metric approach for 1) ranking of arms and 2) comparison of each arm against a control, using a two-level target product profile.We demonstrate via simulations the potential for de-risking decision-making at interim analyses under a flexible decision framework comprised of metrics incorporating point estimates, estimate variability, and evidence towards desired performance thresholds (i.e., a target product profile).
2 Proposed Framework

Motivating example
This decision-making framework is motivated by UNITE4TB, a global public-private partnership with the objective of identifying, in phase IIb trials, new combinations of novel and existing compounds that perform better than the six-month standard of care, HRZE, for the treatment of tuberculosis (TB) when given for four months, thereby supporting evaluation of even shorter durations in a phase IIc trial 7 .The primary clinical outcome in UNITE4TB's PARADIGM4TB-01 trial is the number of unfavorable outcomes (treatment failure, relapse, or re-treatment) occurring within 52 weeks of follow-up.In addition, weekly sputum samples will be collected for twelve weeks post-randomization to monitor the change in time-to-positivity (TTP), defined as "the time [from inoculation in culture media] it takes for a given sputum sample to yield a positive mycobacteria growth indicator tube culture" 8 .This biomarker, while by no means a validated surrogate endpoint, is available much sooner than the primary endpoint, reflects the potency of the regimen in killing off drug-susceptible TB bacterium 9 , and is associated with the primary clinical endpoint such that a more potent regimen (one with a steeper change in TTP) is expected to have a lower rate of unfavorable outcomes than a less potent regimen 8,10 .We do not assume either formal individual-level or trial-level surrogacy but instead rely on TTP as a biomarker reflective of trial-level bactericidal behavior.

Framework components
The proposed framework is designed for decision-making during the interim analysis of a multi-stage multi-arm trial and evaluates clinical trial arms comprised of therapeutic regimens based on three critical components described in more detail below: 1) an arm-wise lack of benefit assessment based on the early accumulation of occurrences of the primary endpoint (number of unfavourable outcomes), 2) arm-wise performance based on the recommended decision from the application of the Bayesian multi-level target product profile framework proposed by Pulkstenis, Patra, and Zhang 6 as applied to the readily available intermediate endpoint (change in TTP), and 3) the arm-wise relative ranking based on estimation from an appropriate Bayesian model of the intermediate endpoint.Table 1 displays the specific proposed decision objectives, their triggers, and statistical estimands.
We propose a sequential application of the framework, as it is intuitive and better reflects natural decision-making in terms of predetermined hierarchies of risk tolerance.Figure 1 demonstrates such a stepwise decision-making framework.

Arm-wise lack of benefit
Our first objective is to identify and deprioritize sub-optimal arms early.Arms will be flagged for lack of benefit based on whether the number of observed unfavorable outcomes exceeds a set threshold, M .While unfavorable outcomes are a definitive clinical outcome and typically used as the primary outcome in phase III studies, they tend to occur after treatment completion 11 .As such, there are likely to be very few observed at the time of the interim analysis, thereby limiting our ability to perform statistical analysis.Instead, this can be thought of as an early screening for removal of arms with larger than acceptable anticipated unfavorable event rates.The remaining metrics rely on the intermediate outcome, TTP, as all patients will have TTP data by the time of the interim analysis.

Arm-wise performance
Arms are then assessed according to a pre-specified two-level target product profile based on the change in log 10 (TTP) slope relative to the control slope (θ, expressed as a percentage change).The quantities that must be pre-specified for the target product profile include the "target value" or level of efficacy corresponding to solid competitiveness, θ T V , the "minimum acceptable value" or minimal level of acceptable efficacy, θ M AV , the maximum allowable risk that an arm is issued a NO-GO decision when it has an unequivocal improvement in efficacy, τ T V , and the maximum allowable risk that an arm is advanced that does not reach the minimal level of acceptable efficacy, τ M AV .
Has the number of observed unfavorable outcomes exceeded the pre-determined threshold?
Is there an acceptable probability that it matches the target product profile?For each arm k, we can then issue a GO, NO-GO, or CONTINUE decision based on the posterior distribution of θ k .We issue a NO-GO decision if the probability that θ k exceeds the target value is sufficiently low (Pr θ (θ k ≥ θ T V ) ≤ τ T V ).We issue a GO decision if the probability that θ k exceeds our minimum acceptable value is sufficiently high (Pr θ (θ k ≥ θ M AV ) ≥ 1 − τ M AV ) and the probability that θ k exceeds our target value is not too low (Pr θ (θ k ≥ θ T V ) > τ T V ).If neither of these conditions is met, a CONTINUE decision will be issued.Pulkstenis, Patra, and Zhang 6 point out that "there is no universal way to characterize the desired efficacy in the [target product profile]", and so we refrain from offering general guidance as to specifying the risk and desired parameter values here and instead encourage stakeholders to identify what is appropriate based on historical data and knowledge of the disease area.

Arm-wise relative ranking
Finally, arms are ranked based on a suite of posterior probability estimands targeting their relative ranking and comparison with the control.We also report a credible estimate (median of the Bayesian posterior distribution) for the relative percent-change in log 10 (TTP) slope as compared to the control, along with a credible interval (confidence level: 1 − α).

Simulation Study Methods
We describe our simulation study using the the Aims, Data Generation, Estimand, Methods, and Performance Measures (ADEMP) framework outlined by Morris et al 12 .

Aims
Our overall aim is to evaluate how well the proposed framework can de-risk decision-making around arm selection for multi-arm phase II trials.

Data-generating mechanism
TTP.The weekly individual-level TTP data is simulated from a parametric linear mixed effects model using the approach described by Arnold et al 13 .Analysis of longitudinal TTP data from the REMoxTB phase III trial 14 motivated our choice.For individual i and visit j, let T ij denote the weeks since randomization at visit j.Let X i denote the assigned treatment arm for individual i, X i = 1, . . ., K where X = 1 denotes the control arm.Equation 1 allows for flexibility in individual-level intercepts and slopes.
We pre-specify the random intercept ), the correlation between the random effects ρ = Cor(β 0i , β 1i ) and the residual error e ij ∼ N (0, σ 2 e ).I{} is an indicator function, returning 1 when the condition is true and 0 otherwise.The parameter values used for data-generation are defined in Section A.1 of the Supplemental Material.
Unfavorable outcomes.Individual-level time to unfavorable outcomes, t i , measured from end of treatment, is simulated using a two parameter Weibull proportional hazards model (Equation 2).All individuals are assumed to complete treatment.Assuming there is no loss to follow-up, event times are censored at the end of 52 weeks of post-randomization follow-up if an unfavorable outcome does not occur before.We assume that an individual's hazard of unfavorable outcome depends only on their intervention assignment, not on their individual-level TTP trajectory; correlation between intermediate and final outcomes is therefore induced only at the level of allocated treatment arm.
The Weibull parameters are tuned such that approximately 75% of unfavorable outcomes occur within the first 13 weeks of post-intervention follow-up 15 (setting scale parameter p = 0.425) and such that unfavorable outcomes by the end of follow-up occur according to pre-specified rates.
Interim.Enrolment dates are randomly assigned such that a rate of ten patients are enrolled per week and randomized to one of five different arms.The first interim analysis occurs one week after complete TTP results are available for the sample size of interest and uses the full TTP data as well as any unfavorable outcome data accumulated up to that point in time.This simulation study only considers the operating characteristics at the first interim analysis.
All simulated datasets consist of one control and four novel arms.TTP and unfavorable outcomes were simulated according to the parameterizations in Table 2. Contrary to PARADIGM4TB's 12-week TTP collection plan, the simulated datasets only consist of TTP for 8 weeks post-randomization.This difference is due to changes made to the PARADIGM4TB study design after initial simulations were performed.We consider four settings for TTP slopes representing a null setting ('No Winners') where all TTP slopes are equivalent, evenly spaced slopes with a clear winner ('One Winner'), a mixture of steep and shallow slopes ('Two Winners'), and a setting were all four arms have similarly steep slopes ('Four Winners').We also consider four settings for unfavorable outcome rates whereby 2.5% unfavorable outcome is considered desirable, 5% is considered minimal and 10% is considered sub-optimal for treatment shortening in the context of a 4-month regimen.All possible combinations of TTP and unfavorable outcome were simulated for each possible sample size in 1,000 simulated datasets.This results in settings where the intermediate and final outcomes were well correlated (steep slopes and low unfavorable outcome rates correspond) and where they were poorly correlated (shallow slopes and low unfavorable outcome rates correspond, and vice versa).Results for any combinations not described here are available in the Supplemental Material and GitHub repository (https://github.com/sdufault15/tb-seamless-design).

Targets of analysis
The targets of analysis are the arm decision objectives as supported by the framework metrics (Table 1).Specifically, we aim to determine whether the framework, when used with standard phase II sample sizes, is sufficient to determine the appropriate arm(s) to flag for lack of benefit or progress, with an acceptable level of risk.

Analysis methods
The weekly log 10 (TTP) data are analyzed using a Bayesian linear mixed effects model with random intercept and random slope specified at the level of the individual and weakly informative priors.The model formula is reported in the Appendix (Eq. A.1), but echoes that used for data generation (Eq.1).Bayesian methods were chosen since they lend themselves to direct probability statements addressing the likelihood of arm success that better facilitate complex decision-making involving nonstatisticians 4,16,17 .Additionally, in this setting, Bayesian methods are desirable because of their ability to handle limit-censoring of the outcome variable 18 .The maximum recommended MGIT incubation time for a sputum sample is 42 days, resulting in a maximum observable TTP value of 42 days and right censoring of TTP values above this limit 8 .While alternative approaches exist to handling right censored outcome variables, likelihood-based approaches have been integrated into standard Bayesian statistical software and are readily available in the setting of non-linear mixed effects models.
Unfavorable outcomes are counted at the arm level and compared against count-based thresholds as described in Table 1.
Simulations and analyses are performed using R version 4.1.2(2021-11-01) "Bird Hippie" 19 .All code necessary to simulate the data, perform the analyses, and recreate the figures presented in this manuscript is available in a GitHub repository maintained by the first author (https://github.com/sdufault15/tb-seamless-design).Bayesian estimation was performed with the brms package 18,20 .

Performance measures
To assess the performance of the proposed multi-metric framework (Table 1), we examine how each of these metrics can contribute to decision-making when used simultaneously.While a common-sense approach should be taken to guide decision-making, cosidering all available data including safety data, these results are generated under a series of hypothetical, rigid decision-criteria in order to gain intuition into the operating characteristics of the framework.Because the relationship between TTP and unfavorable outcomes is not well understood, we additionally assess the performance of the framework as the correspondence between TTP slope and unfavorable outcomes becomes less well correlated.
We then investigate the performance of each framework component individually.For arm-wise lack of benefit, we examine the rates of deprioritization for desirable, minimal, and sub-optimal arms when the unfavorable outcome threshold is set at fewer than one, two or three unfavorable outcomes by the time of the first interim analysis.Arm-wise performance is evaluated by the proportion of simulations returning "GO", "NO-GO", and "Continue" decisions for an array of the log 10 (TTP) slopes and sample sizes.For arm-wise ranking, we focus on the proportion of simulations returning posterior probability estimates that favor the arm with the true steepest slope (θ (1) ) over the arm with the true second steepest slope (θ |X), in order to identify our ability to differentiate between top performers as the gap in their performance decreases from 10% to 2%.

Results
We first present results from a single simulated trial to illustrate how the framework is implemented (Section 4.1).Next, we report the performance results from the simulation study for the overall decision-making framework (Section 4.2).Finally, we report the performance of each framework component separately (Section 4.3).

Demonstration in a single simulated trial
To demonstrate what this framework will look like in practice, we provide the estimated framework metrics for a single simulated trial (Table 3) where TTP has been simulated according to the '1 Winner' setting and unfavorable outcomes were simulated according to the 'Mixed' setting (Table 2).As such, for Arm 5 the true relative improvement in TTP slope compared to Arm 1 is 40% and the true unfavorable outcome rate is a desirable 2.5%.As expected, in the simulated data very few unfavorable outcomes have accrued, suggesting there is not sufficient evidence of lack of benefit to stop any of the arms at this point.Evidence is accumulating that Arm 4 and Arm 5 will meet the target product profile set for TTP; at a sample size of 30 per arm, there is sufficient evidence that these arms are at least as good as the control in terms of their TTP slopes (Pr θ (θ k > θ M AV |X) > 99.1%) and a strong posterior probability that the slopes are at least 20% better than the control (Pr θ (θ k > θ T V |X) > 79.0%).Arm 2 and Arm 3 do not yet have the same strength of evidence suggesting their (in)ability to meet the target product profile.While it is likely that these arms have steeper slopes than control (ψ 1 ), they rank consistently lower than Arm 4 and Arm 5 (ψ 2 , ψ 3 ).If resources are available, it would be prudent to continue enrolling participants on these arms and collect more evidence.If resources are not available, the metrics provided by this framework will contribute to the thoughtful evaluation of which arms to continue and which to stop. Figure 2: Interim analysis decision probabilities for simulated arms with TTP slope relative to the control as specified in the panels (all corresponding to the '2 Winners' simulation setting) and unfavorable outcome rates of A) 10% (unfavorable), B) 5% (minimal), and C) 2.5% (desirable).Arms are flagged for a GO decision if they meet all the following conditions: fewer than 2 unfavorable outcomes, evidence of meeting the target product profile, and a posterior probability greater than 50% of ranking in the top two arms.A pattern is applied to differentiate the criterion responsible for NO-GO decisions.Arms are flagged for a NO-GO decision if they experience any of the following conditions: 2 or more unfavorable outcomes (stripes), do not have evidence of meeting the target product profile (dots), or meet both conditions (plain).All arms that do not meet the criteria for a GO or NO-GO decision, receive a "Continue" designation.Results are based on 1,000 simulated datasets per setting.

Evaluation of the proposed metrics as an overall package
First, we examine the operating characteristics for the simulated data where the TTP slope and the unfavorable outcome rate are in correspondence with each other.When the arms have sub-optimal (10%) unfavorable outcome rates and poor TTP slopes (Fig. 2A, Panel: -10%), there is a 45.3% probability of correctly flagging the arm for a "NO-GO" decision (n k =20); the probability of correctly flagging these sub-optimal arms increases as sample size increases.Further, none of the suboptimal arms received an erroneous "GO" decision, resulting in a false-go rate of 0% in this setting.When the arms have desirable (2.5%) unfavorable outcome rates and true TTP slopes above the target product profile target value (θ T V =20%) (Fig. 2C, Panels: 35%, 40%), the probability of a "NO-GO" decision is relatively rare.For example, when the TTP slope is 40% better than control (Fig. 2C, Panel: 40%), the probability of an erroneous "NO-GO" decision is less than 7.5% (n k = 40) while the probability of correctly flagging an arm for a "GO" decision is at least 88.9% (n k =40).In other words, if the TTP slope observed at the interim analysis corresponds well with the arm-level unfavorable outcome rates, the framework correctly classifies arm decisions with a level of efficiency not accessible through traditional single-metric means while simultaneously maintaining low error rates in terms of "false-go" and "false-no-go" rates.
When the correspondence between TTP slope and the primary endpoint diminishes, the effectiveness of the framework diminishes, but does not completely disappear.For example, if TTP slope is a relatively uninformative proxy for the primary endpoint, the framework will reliably flag arms with a "Continue" decision (Fig. 2, Panels: 10%).This allows decision-makers to continue enrolling participants, hopefully to the point where the primary endpoint will have sufficient power for decision-making.If TTP slope has a negative correspondence with the primary endpoint, such that an arm with a sub-optimal unfavorable outcome rate (10%) returns a relative TTP slope of 40% at the interim analysis, the risk of making a false-go decision increases to 69.6% (Fig. 2A, Panel: 40%, n k =30).Alternatively, if the arm has a desirable unfavorable outcome rate (2.5%) and a poor relative TTP slope (-10%), the false-no-go rate increases to the similarly high level of 73.7% (Fig. 2C, Panel: -10%, n k =40).However, such extremes are exceptionally unlikely.

Examining each framework component
Arm-wise lack of benefit.Figure 3 shows the impact of various count-based thresholds for flagging of lack of benefit during the interim analysis.A good decision threshold should result in a high probability for flagging sub-optimal arms and a low probability for desirable arms.It is immediately apparent that decision-makers must be sensitive to the sample sizes considered when pre-specifying the threshold they will use.At a sample size of n k = 30 per arm, an unfavorable outcome threshold of 2 is associated with a 22% probability of correctly flagging a sub-optimal arm while maintaining a low risk (3%) of incorrectly flagging a desirable arm for lack of benefit.If the sample size per arm can be increased to n k = 40, the efficiency in flagging sub-optimal arms based solely on early observation of unfavorable outcomes more than doubles (53%) while maintaining a relatively low risk of flagging a desirable arm (7%) given the same threshold.
Arm-wise performance.Our second step in arm assessment is based on characterizing a two-level target product profile on the log 10 (TTP) slope.Figure 4 displays the impact of assessing arm performance on the basis of the log 10 (TTP) slope against a multi-level target product profile with the specifed values of θ M AV = 0%, θ T V = 20%, τ M AV = τ T V = 0.025.In this setting, an arm with a 10% poorer slope than the control would be flagged for deprioritization (NO-GO) at least 44% of the time, even when the sample size is as low as 20 per arm.The probability of advancing (GO) promising arms, those with a log 10 (TTP) slope 20% greater than the control, is at least 25% with a sample size of 20 per arm and increases with increasing sample size.Notably, at a sample size of 40 patients per arm, a promising arm with a log 10 (TTP) slope 20% greater than the control is rarely stopped (by design, this proportion hovers around τ T V ) and is recommended for early advancement in nearly 50% of simulations.
Arm-wise ranking.Figure 5 demonstrates that the ability to properly rank the arm with the true steepest slope depends on sample size and competitiveness of the other arms.For clarity, we have restricted these figures to compare the arms with the true steepest and second steepest slopes in log 10 (TTP).Each density curve corresponds to the distribution of posterior probability estimates that a given arm is the steepest; ideally, the arm with the true steepest slope (θ (1) , blue curve) would have a posterior probability estimate of 1 in all simulations and the other arms would have posterior probability estimates of 0. Despite uncertainty in estimation in small sample sizes, the posterior probability estimates are often sufficiently higher for the arm with the true steepest slope than for its competitors (median, vertical lines), resulting in a sufficient metric for decision-making.For example, when θ (1) − θ (2) ≥ 10% ('1 Winner', Fig. 5A), a sample size of 30 per arm is sufficient to separate the posterior probability distributions in most simulated datasets.The posterior probabilities associated with ranking are also reliably responsive to the setting of '0 Winners' (Fig. 5D), suggesting this metric will not return deceptive ranking results when arms are truly similar in terms of TTP slope.

Discussion
Decision-making at any point along the clinical trial pathway is an inherent challenge.We have proposed a flexible, multi-metric framework to de-risk decision-making at interim analyses during phase II trials in TB and, with slight adaptation, other disease settings.Our framework combines innovation in both performance evaluation (multi-level target product profile frameworks) 6 Middle-development TB clinical trials have relied on a handful of commonly used candidate biomarkers (e.g., 14-day EBA, colony forming unit counts, proportion culture negative at 2 months, time to stable culture conversion) as well as novel biomarkers (e.g., MBLA, RS Ratio, gene signature, PET-CT, sputum LAM) to assess regimen efficacy.The relative utility of the various endpoints remains a topic of debate 9,10,[21][22][23][24][25] .Our work is based on TTP as the intermediate endpoint as it is the most commonly and readily available outcome in TB trials and appears somewhat promising in terms of trial-level correlation with the primary endpoint.In this setting, we are not using TTP on an individual level to predict or anticipate a single patient's likelihood of cure.Instead, we are assuming that, at the trial-level, the intermediate TTP slope and final outcomes are correlated and that the differences between arms that is observed on TTP is meaningfully correlated with the differences expected in terms of arm performance for the primary endpoint.In the presence of a positive individual level correlation (which may be a plausible assumption for existing drugs 26 and perhaps also for new drugs), we anticipate the operating characteristics of the framework to be even more favorable.As research progresses on this endpoint, general learnings about the relevance of TTP for regimen development can be used to adjust the target and minimum acceptable values.Our proposed framework, when applied with an appropriate model for the intermediate endpoint, can be extended or adapted to alternative biomarkers, should another option (or the inclusion of additional biomarkers) be of interest to decision-makers.
Bayesian methods for the evaluation of Phase II studies are growing in acceptability 17 and have been approved by regulatory agencies as the primary method of analysis 27,28 .One advantage of Bayesian estimation is the ability to explicitly state and incorporate prior information into the estimation procedure.In the setting of TB studies, there is a wealth of knowledge around the standard of care.Ignoring the decades of evidence that has been accumulated is inefficient and, perhaps, unethical when phase II studies are required to keep sample sizes low for equipoise.Though not explored here, future research and applications of this framework should consider the effect of incorporating prior information for the log 10 (TTP) slope for the standard of care.Following guidance generated by ongoing efforts to incorporate translational pre-clinical and clinical data to improve regimen evaluation (e.g., ACTG RAD-TB), such data sources could also be used to inform reasonable priors on novel regimens as well.Proper incorporation of informative priors should decrease estimator variability in the log 10 (TTP) slopes, ultimately 1) strengthening the ability to compare novel regimens against the standard of care, 2) improving confidence in ranking, particularly for novel regimens with small relative differences in slope, and 3) result in fewer "Continue" categorizations within the target product profile framework.Each of these changes will improve efficiency in the evaluation of regimen performance.Further, it is straightforward to perform sensitivity checks on the impact of the priors and can be an additional tool in guiding decision-makers 29 .
One concern with the use of Bayesian methods for the planning and analysis of clinical trials is its inability to strictly control the type I error rate.This is further complicated by our recommendation that the multi-metric framework be applied holistically, upon the close evaluation of all metrics to comprehensively evaluate a study arm's performance and promise.These concerns are worth investigating and future research will evaluate how more complex decision frameworks, such as the one proposed here, can be properly evaluated to limit this risk.One key advantage of our multi-metric framework includes a direct adaptability to decision-makers' level of risk tolerance.Instead of focusing on a strict frequentist type I error, we have shown that this framework has good operating characteristics for prioritizing arms with desirable performance and de-prioritizing sub-optimal arms which directly addresses the objectives of middle-development clinical trials.Further, strict control of the type I error rate may not be the driving determinant in study design for some trial settings.In UNITE4TB-01, this framework can be used to identify which arms advance from phase IIb to phase IIc, a period of further observation where the duration of the arm is also randomized.Evidence generated in this second phase will help to further elucidate which arms (and durations) should be advanced into large, definitive phase III trials.
In summary, we propose a Bayesian decision framework, building on the two-level target product profile 6 , for the setting of multi-arm middle development clinical trials using intermediate endpoints that are not perfect surrogates.We have shown that our flexible multi-metric framework has good operating characteristics and is a practical solution for de-risking drug development.

Disclaimer
This communication reflects the views of the authors and neither IMI nor the European Union and EFPIA are liable for any use that may be made of the information contained herein.

Figure 1 :
Figure 1: Example flowchart of the decision-making framework applied in a sequential manner.The third component (Does it rank well?) is in a dashed-line box as it is only relevant when more than one arm has successfully advanced through the first two decision-making steps.

Figure 2
Figure 2 demonstrates the strength of using each component of the framework in concert to inform decision-making at the time of the interim analysis.This figure only reflects the '2 Winners' TTP simulation setting.The other settings can be observed in the Supplemental Material (Figures A.3-A.5).

Figure 3 :
Figure 3: The proportion of simulations where an arm with a given unfavorable outcome rate (panels) would be flagged for deprioritization on the basis of collected unfavorable outcome counts at the first interim analysis given varying sample sizes per arm (n k ) and pre-specified unfavorable outcome thresholds.The first interim analysis is triggered by the complete collection of 8 weeks of post-randomization log 10 (TTP) data on n k patients per arm.Results are based on the evaluation of 1,000 simulated datasets.

Figure 4 :
Figure 4: The proportion of trials where an arm with a given percent change in log 10 (TTP) slope relative to the control (panels) would be assigned a particular decision at the first interim analysis given varying sample sizes per arm (n k ).Results are based on the evaluation of 1,000 simulated datasets and assume θ M AV = 0%, θ T V = 20%, τ M AV = τ T V = 0.025.

Table 1 :
Proposed quantities for the multi-metric decision-making framework.