Determining the minimum duration of treatment in tuberculosis: An order restricted non‐inferiority trial design

Tuberculosis (TB) is one of the biggest killers among infectious diseases worldwide. Together with the identification of drugs that can provide benefits to patients, the challenge in TB is also the optimisation of the duration of these treatments. While conventional duration of treatment in TB is 6 months, there is evidence that shorter durations might be as effective but could be associated with fewer side effects and may be associated with better adherence. Based on a recent proposal of an adaptive order‐restricted superiority design that employs the ordering assumptions within various duration of the same drug, we propose a non‐inferiority (typically used in TB trials) adaptive design that effectively uses the order assumption. Together with the general construction of the hypothesis testing and expression for type I and type II errors, we focus on how the novel design was proposed for a TB trial concept. We consider a number of practical aspects such as choice of the design parameters, randomisation ratios, and timings of the interim analyses, and how these were discussed with the clinical team.


| INTRODUCTION
Tuberculosis remains one the biggest infectious killers worldwide and the global control of the disease is slow. 1 Detection and treatment of infectious pulmonary tuberculosis is a key component of public health strategies in high-burden countries but the current standard of care for first-line therapies requires strong adherence to combination therapy for 6 months to ensure a stable cure. The current first-line regimen evolved over three decades of clinical trial activity and has remained unchanged since the early 1980s. 2 While novel anti-tuberculosis drugs are finally becoming available, new clinical trial evidence and re-examination of historical trials suggest that novel combinations of existing drugs could improve efficacy of the current first-line standard tuberculosis treatment. 3 TB clinical trials in the past typically evaluated homogenous populations of patients, often those with more severe diseases able to provide repeated bacteriological specimens and modern guidance does not recommend modification of the composition or duration of treatment regimens according to clinical characteristics. Recent data have, however, lent support to the concept of stratification of treatment on the basis of readily available prognostic factors including rapid microbiological tests and radiology. 4 Strong correlations have been observed in trials and observational studies of measures of baseline bacillary load and long-term outcomes. In a recent pooled re-analysis of three trials of first-line regimens containing fluoroquinolones, the TB-ReFLECT consortium showed that, despite the failure of these trials to show significant non-inferiority, patients on 4-month regimen with low-risk prognostic factors (sputum smear grade <3+ and absence of radiologically apparent cavitation) in fact achieved non-inferior results to the 6-month regimen. 4 These results were also validated externally using a previous clinical trial of prospective simple prognostic stratification, 5 suggesting that the concept of a unitary regimen for all TB patients may result in overlong treatment for some. Furthermore, since two-thirds of patients in these trials fulfilled low-risk prognostic criteria, such a stratified approach clearly has the potential to shorten treatment for a substantial number of patients.
Due to the excellent efficacy of the 6-month first-line regimen, the conventional designs in TB are non-inferiority trials that evaluate whether a novel treatment regimen may shorten treatment of a single reduced duration, estimated on the basis of preclinical and early phase clinical trial data. For instance, Study 31 6 has recently shown that a 4-month rifapentine-based regimen containing moxifloxacin was non-inferior to the standard 6-month regimen in the treatment of tuberculosis. However, the conventional designs are not able to determine the minimum possible duration for any given regimen and their success rests heavily on correctly predicting the target duration from limited data. Such trials thus pose a risk to drug developers and do not provide the information of most interest to policymakers.
Adaptive designs have been proposed as a solution to the question of how to identify the most promising treatment in confirmatory trials rapidly. Multi-arm multi-stage designs (MAMS) have been argued to be a highly efficient approach to clinical trials [7][8][9] and they have been suggested 10 for improving the treatment of tuberculosis. In this disease area, MAMS have been proposed by Bratton et al., 11 Cellamare et al. 12 and a completed trial has been adopted by the Pan-African Consortium for the Evaluation of Antituberculosis Antibiotics (PanACEA). 13 In addition, the TRUNCATE-TB trial 14 is a MAMS recently implemented for drug-sensitive tuberculosis. These trials, however, evaluate different treatment regimens, but only a single study duration.
New approaches to estimate the duration of therapy have become available which are capable of evaluating multiple nested durations in an efficient manner. 15,16 The design proposed by Quartagno et al. 15 relies on a parametric model for the duration-response relationship and it does inflate the type I error under certain scenarios where the steepness of the duration-response curve is increasing at the optimal duration. The work proposed by Serra et al., 16 instead, incorporates the order of treatment effects in the decision-making when no parametric duration-response model is assumed and it guarantees strong control of multiple testing.
Depending on the clinical setting and the disease area, several primary endpoints can be measured and used to address the objectives of the trial. The adaptive design by Serra et al. 16 was proposed considering only normally distributed endpoints in a superiority trial setting. In this work, we extend the design for a non-inferiority trial with a binary endpoint. We describe its application to a clinical trial in TB which aims to compare the efficacy of multiple treatment durations against a standard treatment regimen. We consider several practical aspects such as the choice of design's parameters, randomisation ratios and timing of the interim analyses and how all these aspects were discussed with the clinical team.
The rest of the manuscript continues as follows. The motivating clinical trial is introduced in Section 2 before a detailed description of the novel design is proposed in Section 3. Section 4 describes the choice of the design's parameters for the actual clinical trial before a suggestion for the clinical trial design is provided at the end of the section. A simulation study is described in Section 5 in order to compare the proposed design with a standard MAMS design. Section 6 analyses two different strategies that consider different timings of the interim analyses. We conclude with a discussion.

| MOTIVATING TRIAL: RESTRUCTURE TRIAL
The REStrUCTuRe trial is a Randomised Evaluation of Stratified Ultra-Short Combination Tuberculosis Regimens. This is a Phase III trial that aims to determine the shortest possible yet effective duration of treatment for people suffering from pulmonary tuberculosis with low-risk or high-risk prognostic factors using a dose-optimised fluoroquinolonecontaining first-line regimen containing Rifampicin (RIF) 40 mg/kg, Pyrazinamide (PZA) 35 mg/kg and Levofloxacin (LFX) 15 mg/kg: R 40 Z 35 L 15 . The primary aim is to compare the efficacy of R 40 Z 35 L 15 at durations of treatments from 2 to 4 months, while secondary aims are to compare the safety and tolerability of the investigational drug at different durations and to estimate the value of measures of culture conversion as predictors of long-term outcome. The schematic of the experimental arms planned to be considered, depending on the patients' prognostic factors, are given in Figure 1. Durations of 2, 3, and 4 months will be tested in the low-risk prognostic sub-population (no cavitation on CXR and/or Smear Grade < +++) and durations of 3 and 4 months in the high-risk prognostic sub-population (cavitation on CXR and/or Smear Grade ≥ +++). Standard of care 2HR 10 Z 25 E=4HR 10 is administered daily for 6 months irrespective of participant prognostic score.
The primary outcome of "no durable cure" (NDC) is defined as a confirmed positive mycobacterial culture at the end of treatment with genetically identical isolates (on whole-genome sequencing) to their baseline isolate.
In the next sections, we illustrate and describe the extension of the ordered restricted MAMS 16 to a non-inferiority setting with a binary endpoint and we consider a number of practical aspects.

| ORDER RESTRICTED DESIGN
In this section, we first summarise the testing procedure of the original multi-arm multi-stage superiority order restricted design (ORD 16 ), and then propose the extension for the non-inferiority setting with a binary endpoint. Noninferiority trials are designed to test whether new regimens are non-inferior in efficacy compared to the standard regimen currently used. These designs are more appropriate when new regimens may have practical advantages compared to the standard intervention and thus may be preferred in real-life settings even if the new regimen is modestly less efficacious. 3 The level of acceptance is defined by the non-inferiority margin. Given the high efficacy of currently recommended regimens, non-inferiority designs are necessary to be adopted in TB setting. 17 We start with describing a general design proposal for arbitrary number of arms and stages. Then, given the different treatment durations studied in each sub-population of the motivating trial, we provide the design for each subpopulation.
3.1 | Multi-arm multi-stage superiority order-restricted design for normally distributed endpoint Consider a clinical trial with K À 1 active treatment arms, T 1 , …, T KÀ1 , against a control treatment T 0 and J stages at which treatment arms can be dropped or the trial could be stopped for benefit or lack of efficacy. Assume that a patient's response follows a normal distribution with known common variance, be the observation of the i-th patient on treatment k (the control arm is denoted by 0) and n We test the null hypotheses:

| Number of arms and stages for the REStrUCTuRe trial
Given the different treatment durations studied in each sub-population, for the high-risk sub-population we denote the vector of treatment effects by θ 12 ¼ θ We define the set of the indices for the active treatment arms in this subgroup as K l ¼ 1,2,3 f g.
For both sub-populations, a single interim analysis is planned after half of the total planned maximum number of participants have completed their treatment. From the theory of group sequential design, 20 the design performs well in terms of operating characteristics when the interim analysis is planned after having observed between 30% and 70% of the total information. Thus, it was decided to consider a middle value among those and plan the interim analysis after having observed 50% of the total information. Thus in the next sections, we consider a 3-arm (for the high-risk sub-population) and a 4-arm (for the low-risk sub-population) 2-stage design.

| Family-wise error rate
For confirmatory clinical trials, control of the family-wise error rate (FWER) in the strong sense at level α (one-sided), that is the probability to reject at least one true null hypothesis, is typically required. 21 Using the rules described in Table A2, the FWER for the 3-arm 2-stage ORD can be written as P rejecting at least one true H 0k , k K h jH 012 , p 12 ð Þ ¼ P Z ð Þ l ¼ 0. The primary objective of the trial is to identify the shortest possible treatment duration in each sub-population. Thus, the trial is to be powered to reject all correct hypotheses. However, alternative power strategies can be also considered, for example to reject at least one hypothesis or to reject the first and the second hypotheses only in the 4-arm trial. The equations that need to be satisfied to power the design are provided in Appendix A.

| No early stopping for declaring non-inferiority
As explained in Section 2, we look for the culture status at the end of the treatment. However, early success claims might be misleading, if patients transition to a failed state after longer follow-up, given the slow-growing nature of Tuberculosis. To accommodate this, an alternative approach could be to consider an interim analysis with only futility bounds and then evaluate efficacy outcomes at the final analysis-the decision rules used in this case are summarised in Tables A5 and A6 for 3-arm and 4-arm designs, respectively. In this way, the interim analysis considers the primary endpoint evaluated at the end of the treatment, while the final analysis would be at some months-for example 6 or 12-post-randomization. All other design features considered above remain the same.
In the following section, we will explore the operating characteristics of the proposed design when applied to the TB trial.

| Setting
In the setting of the motivating trial we consider a study where patients are randomised with equal probability to receive either the current first-line regimen 2HR 10 Z 25 E=4HR 10 at a fixed duration of 6 months or the experimental regimen R 40 Z 35 L 15 at duration determined in two sub-populations defined by their baseline factors. Within each subpopulation, the allocation between the arms will be balanced (low-risk 1:1:1:1, high-risk 1:1:1). Patients are enrolled and divided into sub-groups depending on their prognostic factors. The distribution in the screened population is expected to be approximately 2:1 in favour of the low-risk group and we consider around 10% losses to follow-up. This is to ensure that the total sample size is feasible for the trial. Note that losses to follow-up in the intervention arms in the low-risk stratum could be lower due to shorter treatment duration, but also could be higher due to safety or lack of efficacy in some of the lower duration groups.
The assumed true response rates in the control arm for the low-risk and high-risk sub-populations are 92% and 86%, respectively. One interim analysis is planned after half of the total population has their primary endpoints evaluated. The design is constructed to control the FWER at level α ¼ 5% and to reach 80% of power under various power strategies (see details in Section 4.3).
Below, we specify the required parameters to design the study, such as the non-inferiority margin, the shape of the critical bounds, and the appropriate power configurations together with the rationale that was used in the discussion with the clinical team to justify them.

| Non-inferiority margin
In a non-inferiority setting, the determination of the non-inferiority margin is a critical step and it is often challenging. 23 In recent Phase III trials in TB, non-inferiority margins of between 6% and 12% have been accepted by regulators in different contexts-see Table 1. With the exception of STAND, trials evaluating relative reductions in the duration of treatment of a third have all used non-inferiority margins of 6%-8% whereas trials with more ambitious goals have used 10% for a halving of duration (STREAM) and 12% for a reduction of two-thirds (TRUNCATE-TB). According to FDA guidance on setting non-inferiority margins, 23 and assuming cure rates of 30% and 90% in those receiving no treatment and standard of care, respectively, a margin of 6% (M 2 ¼ the largest clinically acceptable difference (degree of inferiority) of the test drug compared to the active control) preserves 90% of the treatment effect of Standard of Care (SOC) while 12% (M 1 ¼ the entire effect of the active control assumed to be present in the NI study) preserves 80%. The relative reduction in duration to be evaluated in REStrUCTuRE is between one third (in either prognostic sub-population) and two-thirds (in the low-risk prognostic sub-population). Selecting a non-inferiority margin of 10% for the trial was thought therefore to be consistent with the range used in previous recent clinical trials and preserves 83% of the estimated treatment effect of SOC. Thus, the choice of 10% was made to be within M 2 and M 1 and risk is well balanced against the expected benefit in adherence and reduction in resistance. For this study, the NI margin is chosen to be equal in the two sub-populations to have consistent design parameters for the two separate trials. However, different non-inferiority margins could be also assumed for each trial.

| Type of power configuration and proposal for the design
The next step consists of evaluating various power strategies for the objective of the trial taking into account the recruitment feasibility and the distribution of patients across sub-population. We analyse the differences among the strategies for both sub-populations and we describe how the final decision on the design was reached with the clinical team. We consider the setting as described at the beginning of this section and a non-inferiority margin of 10% for both subpopulations. In order to ensure strong control of the FWER, 16 we consider the case where the same bounds, u f g, are used for each arm with k K h for the high-risk sub-population and k K l for the low-risk sub-population.
We derive the critical bounds separately for each sub-population as these are considered as separate trials and no combined hypotheses testing is planned. First, we determine analytically-relying on the asymptotical normal approximation of the vector of test statistics-the sample sizes and the critical values for each sub-population and power configuration. The normal approximation of the test statistics and the design's performance are evaluated using simulations. The critical bounds and the sample size are then tuned in order to get closer to 5% in the probability to reject at least one hypothesis under the global null and to reach 80% under the alternative hypothesis. Specifically, if after the first round of simulations, the pre-specified α-level is found to be inflated under the global null and/or the desired power level is not reached under the power configuration, then further simulations are run considering a grid of values for the critical bounds and the sample sizes-the grid of values is constructed around the analytical values previously found. The pair of values-critical bounds and sample size-that satisfy the FWER and power requirements is finally selected.
The first six rows in Table 2 provide the results for each sub-population and power configuration under the global and alternative hypotheses for the ORD with triangular critical bounds (given in Table 1 in the Supplementary Materials). The remaining rows of this Table report the results for the competing approach that is described in Section 6.
One possibility for the TB trial is to use a configuration of "reject at least one" for the high-risk sub-population and "reject all hypotheses" in the low-risk sub-population. This strategy would require 624 in the low-risk sub-population and 522 in the high-risk sub-population for a maximum sample size of 1146 and an expected sample of 826 under the alternative hypothesis. Inflating the sample size for 10%-that is simply a standard inflation of the required sample size-loss to follow-up gives a maximum sample size of 1261 and expected sample sizes of 909. Under the null hypothesis, the sample sizes are expected to be 375 and 316 for the low-risk and high-risk sub-populations, leading to a total of 691 patients.
Alternatively, ensuring the rejection of at least one hypothesis in each sub-population with the same design parameters, the maximum sample sizes for the low-risk and high-risk sub-populations were 432 and 522, respectively, leading to a total maximum sample size of 954. However, under the alternative hypothesis, the expected sample sizes were 321 and 377, a total expected sample size of 698. Adjusting these for 10% loss to follow-up gives maximum and expected sample sizes of 1050 and 768, respectively. Under the null hypothesis, the sample sizes are expected to be 260 and 316 for the low-risk and high-risk sub-populations, leading to a total expected sample size of 576 patients.
Despite the maximum and expected sample size under the alternative hypothesis being 20% and 18% larger in the suggested strategy compared to the global "reject at least one" strategy, the first approach is more 'efficient' in the sense that if we power the design to reject at least one hypothesis for the high-risk sub-population and to reject all for the low-risk sub-population we can get closer (the ratio of low-risk to high-risk participants is 1.2 instead of 0.83) to the 2:1 ratio-in favour of the low-risk sub-population-of the expected screened population. Instead, if the designs are both powered to reject at least one hypothesis then we are slightly far from the expected ratio in the screened population. Thus, our suggestion for the design is to use a configuration of "reject at least one" for the high-risk sub-population and "reject all hypotheses" in the low-risk sub-population at a β level of 0.20 and a one-sided α ¼ 0:05.

| Critical boundaries' shape
One of the features of the proposed design is the shape of the critical boundaries used to test the hypotheses. In order to decide which shape of the bounds suits best the objective of the trial, we explore the operating characteristics of the T A B L E 1 Design parameters of recent non-inferiority trials in TB.
f g, are used for each arm with k K h for the high-risk sub-population and k K l for the low-risk sub-population.
In this section, results are provided without tuning of the parameters. The expected sample sizes (ESS), that are the mean number of patients recruited to the trial before it is terminated, are also measured under the global and alternative hypotheses. The numerical results are found using R 27 and 10 6 replicate simulations. Table 3 provides the maximum sample sizes, the expected sample sizes and the probabilities to reject at least one hypothesis, all hypotheses or at least the first two for each sub-population and using Pocock, O'Brien and Fleming and Triangular boundaries. For each power configuration and for each sub-population, the design with O'Brien and Fleming critical bounds requires the least total maximum sample size. The efficiency of the proposed design, however, is measured by the ESS. Indeed, the two-stage design allows to reduce the maximum total sample size by a fraction that depends on the different critical bounds used at the interim and final analyses. For the O'Brien and Fleming bounds, the reduction in sample size (RSS), which is RSS ¼ 1 À ESS MaxSS , under the global and alternative hypotheses is quite small-a reduction up to 1% under the global null for each design and up to 14% under the alternative hypothesis. Despite triangular bounds require for each design and sub-population almost the highest total maximum sample sizewhich is still feasible to recruit given previous TB studies-compared to the other critical bounds, this shape of bounds provides the largest reduction in sample size under the global null and alternative hypotheses-a reduction of around 40% under the global null for each design and up to 29% under the alternative hypothesis. Thus, in order to minimize the expected sample size under the global null and alternative hypotheses, we decided to consider triangular boundaries for the REStrUCTuRe trial.

| Proposal for the design with only futility bounds at the interim analysis
In this section, results for the alternative approach described in Section 3.5 are provided. Table 4 provides the maximum sample sizes, the actual maximum sample size-that is the mean number of the patients recruited to the trial when all treatment arms proceed to the final analysis-, the expected sample sizes and the probabilities to reject all, at least one or the first two hypotheses for each sub-population and using triangular boundaries when for both sub-populations we are expected to recruit around 30 patients per month.
In this setting, we would require 450 and 544-in the worst case scenarios 470 and 560, respectively-patients in order to ensure 80% power to reject at least one and all hypotheses for the high-risk and the low-risk groups, The alternative hypothesis for the high-risk sub-population is: Cells coloured in blue correspond to the chosen power configurations for the TB trial. Results are provided using 10 6 replications. Values in bold refer to the target probability (around 80%) for each power configuration.
respectively. The trial is expected to end after 15.0 and 17.6 months for the high-risk group and after 18.4 and 21.9 months for the low-risk group under the null and alternative hypotheses, respectively.

| Competing approaches
In this section, we evaluate the operating characteristics of the proposed design under various treatment configurations. The proposed non-inferiority design is compared to the MAMS(m) design proposed by Magirr et al. 18 but that allows all treatment arms that have not crossed any critical bounds to continue to the next stage. This design has been adapted for the non-inferiority setting. The FWER expression for MAMS(m) is the same as derived by Magirr et al 18 but the power expression changes. Tables A2-A4 in the Appendix, highlight the differences in the decision rules between the ORD and MAMS(m) designs for 3-arm 2-stage and 4-arm 2-stage settings.

| Setting
We consider the setting as described in Section 4 and we evaluate the performance of the design under intermediate treatment effects configurations when the treatment effect on the first arm is fixed to be zero, θ The alternative hypothesis for the high-risk sub-population is: Results are provided using 10 6 replications. Values in bold refer to the target probability (around 80%) for each power configuration. For the high-risk sub-population, both designs are powered at 80% to reject at least one hypothesis, while for the low-risk sub-population designs are powered at 80% to reject all hypotheses when all durations have the same response rates as the standard of care. Additionally to the achieved power, the efficiency of the designs is measured by their expected sample size (ESS). However, the ESS depends on the recruitment rate and the time of the interim analysis and this might be close or equal to the maximum sample size of the selected design. Another metric to look at is the time to first positive claim. This is a metric, where a difference could be observed, but it would require assumptions on recruitment speed in order to gauge if a sequential design has the potential to decrease the study cost (or not). However, it is worth noting that the time to first positive claim is a metric that might not necessarily take into account the correct decision at the end of the trial (e.g., to reject two or more hypotheses when all arms are noninferior to the control).

| Numerical results
In this section, we provide the results of the simulation studies for the two separate sub-populations. Table 2 provides the results for each sub-population and power configuration under the global and alternative hypotheses for the chosen power configurations of the ORD-blue rows in Table 2-and the MAMS(m). The design's performances under the considered non-null treatment configurations are provided in Figures 2 and 3, describing for each scenario the probability to reject the alternative hypotheses, the expected sample size, the duration of the trial and the time to first positive claim for the 3-arm and 4-arm design, respectively. We consider a recruitment rate of around 30 patients per month for each sub-group and for the case where we do not have any rejection, the time to first positive claim is set to the end of the trial.  Note: The global null hypothesis for the high-risk sub-population is: The alternative hypothesis for the high-risk sub-population is: A recruitment rate of around 30 patients per month is assumed for both risk groups. Results are provided using 10 5 replications. Values in bold refer to the target probability (around 80%) for each power configuration.

| High-risk sub-population
Total maximum sample sizes of 522 and 444 patients are required to reach a power of 80% to reject at least one null hypothesis under θ ¼ 0, 0 ð Þ for the ORD and the MAMS(m), respectively. The sample size is lower for the MAMS design because we are powering to reject at least one hypothesis (rather than all hypotheses). The MAMS design does not take into account the order in rejecting the hypotheses. Thus, we need a smaller sample size in order to reject at least one hypothesis compared to the ORD. Both designs control the FWER under the global null at level α. Despite the differences in total maximum sample size, it can be observed that while the ORD has a power of around 70% to reject all true hypotheses under θ ¼ 0, 0 ð Þ, the MAMS(m) design reaches only 52%. Figure 2 shows that the ORD has higher power to reject all hypotheses-that coincides with the probability to reject H 02 -for all considered scenarios compared to the MAMS(m)-a difference up to 18%. For all scenarios, the ORD has around 80% of power to reject at least one hypothesis, while the MAMS(m) design has lower power when θ ≤ À 0:01-a difference up to 13%. When θ > 0, the MAMS(m) has higher power to reject at least one hypothesis compare to the ORD-a difference at most of 16%. The ORD has also higher power to reject the second hypothesis when θ < 0:02-up to 10%-compared to the MAMS(m) but it has lower power when θ ≥ 0:02-a decrease up to 16%.
In terms of ESS, the average number of patients is expected to be at most 10% higher for the ORD compared to the MAMS(m) depending on the considered scenario. In terms of duration of the trial, the ORD is expected to be slightly Note that the ORD_reject all and the ORD_reject H02 points overlap in the Figure. A recruitment rate of around 30 patients per month is assumed for the high-risk group. Results are provided using 5 Â 10 4 replications.
longer compared to the MAMS for each scenario and the time to first positive claim is smaller-up to 4 months-for the MAMS design compared to the ORD in each scenario.

| Low-risk sub-population
Total maximum sample sizes of 624 and 832 patients are required to reach a power of around 80% to reject all hypotheses under θ ¼ 0,0,0 ð Þfor the ORD and the MAMS(m), respectively. Here, we power to reject all hypotheses and thus we require more patients in MAMS compared to the ORD. Both designs control the FWER under the global null at level α. Figure 3 shows that the ORD has higher power to reject all hypotheses for almost all considered scenarios compared to the MAMS(m)-a difference up to 5%. For all scenarios, the ORD has around 92% of power to reject at least one hypothesis, while the MAMS(m) design has higher power for all scenarios-a difference at most of 6.7%. The ORD has also higher power to reject the first two hypotheses for almost all considered scenarios compared to the MAMS(m)-up to 7.7% of difference.
In terms of ESS, the ORD can provide a reduction up to 23%, depending on the scenario, compared to the MAMS(m). In terms of duration of the trial, the ORD is expected to be shorter-up to 6 months-compared to the MAMS design for each scenario with larger differences for scenarios with two negative treatment effects and the time to first positive claim is smaller-up to 5 months-for the MAMS design compared to the ORD for scenarios where all treatment effects are above zero. Overall, the non-inferiority ORD can provide higher power to reject all hypotheses and smaller expected sample size compared to the non-inferiority MAMS(m) design when it is powered to reject all hypotheses and the order assumption among the treatment effects is satisfied. The MAMS design outperforms the ORD when the order assumption is not satisfied, that is when θ 1 ð Þ ≥ θ 2 ð Þ ≥ θ 3 ð Þ is not satisfied. This is expected because the MAMS design does not take into account the order in rejecting the hypotheses.

| TIMING OF THE INTERIM ANALYSES
In the original ORD 16 and in the results obtained above, the information time was assumed to be the same for all treatments. When considering different durations of treatment, however, the information for each arm accumulates at different times. For example, consider the high-risk sub-population where the standard regimen is at fixed duration of 6 months, while the experimental regimens are at 4 and 3 months, respectively. As represented in Figure 4, assume that we start to recruit patients on each arm at the same time. At time t 1 the first block of patients is recruited and allocated to the three treatment arms. Similarly, at time t 2 and t 3 the second and third blocks of patients are recruited and allocated to the treatment arms, respectively. The recruitment process continues in this way, but, while the first block of patients that were recruited at time t 1 have their efficacy outcomes evaluated in the next 6 months if they were allocated to the control arm, the other patients will have their efficacy outcomes evaluated after 4 and 3 months if they were allocated to the longest or shortest duration arms, respectively. Thus, information accumulates at different times and at the time of the interim analysis a different number of patients in each arm has completed their treatment. Hence, for this clinical trial setting, different strategies about the timing of the interim analysis can be evaluated. Below, we propose and examine two possible strategies. The first one consists on having the same amount of information-that is, the same number of patients with their primary endpoints evaluated-on each treatment duration at the time of the planned interim analysis. We refer to this strategy as to SI (Same information at the Interim analysis). Thus, the interim analysis is performed when the same number of patients n 1 ¼ n k ð Þ 1 have completed their treatment k. The recruitment of patients ends when the total number of patients in the control arm is equal to 2n 1 . This means that at the end of the trial, more patients will be recruited in the longer duration arms that are still in the trial compared to the shorter durations. Indeed, in order to have the same F I G U R E 4 Schematic of the accumulation of information for the high-risk sub-population. C is the control arm, L is the longest active duration and S is the shortest active duration. Vectors indicate the length of the treatment. Patients are recruited and allocated to the treatment arms at times t i ,i 1, 2, …, T f g , with T being the end of the trial.
number of patients in each arm at the interim analysis, each month we need to recruit more patients on the longer durations and fewer in the shorter ones. Thus, for this strategy, we want to modify the allocation ratio in order to recruit more patients under longer treatment durations-as they require more time in order to have their primary outcomes evaluated-and fewer patients on the shorter arms. Figure 5 shows a schematic of the SI strategy.
In the second strategy, instead, the same amount of information is observed in the final analysis-the number of patients in each arm is equal to n 2 ¼ n k ð Þ 2 . We refer to this strategy as to SF (Same information at the Final analysis). In this case, more information on the shortest durations will be accumulated-as represented in Figure 4-at the interim analysis, which is done when half of the population in the control group has been observed. Note that at the end of the trial all accumulated data is used and included into the analysis. Thus, no data is thrown away and patients will still be followed up if shorter treatment arms have been dropped for futility at the interim analysis.

| Strategy SI
Let us denote with x k the number of patients recruited in each arm and for each month with k 0 f g[ K h f gfor the high-risk and k 0 f g[ K l f gfor the low-risk sub-population. Let d k be the expected number of months that are needed to recruit x k patients on treatment arm k and Þ be the vectors of durations (in months) of treatments that are tested for the low-risk and high-risk sub-populations, respectively. Let R be the number of patients that are recruited per month in each sub-population and n 1 ¼ n k ð Þ 1 be the number of patients in each arm up to stage 1. Thus, for the low-risk sub-population we satisfy the following and for the high-risk sub-population Schematic of the SI strategy for the high-risk sub-population. C is the control arm, L is the longest active duration and S is the shortest active duration.

| Numerical evaluation of the two strategies
We consider the same design settings as described in Section 4 and the suggested design proposed in Section 4.3. Based on the clinical team experience, we provide the results considering a recruitment rate of around 30 patients per month, R ¼ 30. Thus, we expect to recruit around one patient per day and this is randomly allocated to a treatment arm k with probability p k ¼ x k =R. We consider the following treatment durations (in months) D ¼ D 0 , D 1 , D 2 ,D 3 ð Þ ¼6,4,3,2 ð Þ and D ¼ D 0 , D 1 , D 2 ð Þ¼ 6,4,3 ð Þ for the low-risk and high-risk sub-population, respectively. The numerical results are found using 10 5 replicate simulations. Table 5 provides the operating characteristics of the two strategies under the global and alternative hypotheses for each sub-population together with the theoretical maximum sample size, the actual total sample size, that is the mean number of the patients recruited to the trial when all treatment arms proceed to the final analysis and the expected sample size.
For the high-risk sub-population, a total of 476 and 504 patients-502 and 519 actual maximum sample sizes, respectively-are required for the SI and SF strategies, respectively, if the design's parameters provided in Table 6 are used in the trial. It can be observed that the SF requires a larger total sample size compared to the SI strategy in order to reach 80% power to reject at least one hypothesis. However, it can be observed that the ESS at the interim analysis for both strategies is still below the actual maximum sample size. Thus, on average, not all patients are recruited at the time of the interim analysis. The expected duration of the trial under the alternative hypothesis is estimated to be around 15.6 and 19.2 months for the SI and the SF strategies, respectively. This small difference in durations is due to the difference in the number of patients that are required to be observed in each arm for the first interim analysis-the interim analysis in SI is done when the same amount of patients has completed their treatment, while the interim analysis in SF is done as soon as the last patient in the control arm has ended the treatment.
For the low-risk sub-population, a total of 566 and 584 patients-586 and 600 actual maximum sample sizes, respectively-are required for the SI and SF strategies, respectively, in order to reach 80% of power to reject all hypotheses, if the design's parameters provided in Table 6 are used in the trial. As for the high-risk sub-population, the ESS at the interim analysis for both strategies is still below the actual maximum sample size. Under the alternative hypothesis, the total duration of the trial is expected to be around 18.5 and 20.8 months for the SI and SF strategies, respectively. As for the high-risk sub-population, the difference in durations is reflected by the differences in the number of patients that are required to be observed in each arm for the first interim analysis.
Overall, the results suggest that the strategy that matches the sample size at the interim analysis is one that minimizes the total maximum sample sizes under the considered simulation scenarios, while the SF strategy minimizes the expected sample size under the global null and alternative hypotheses. The two strategies have almost the same duration under the two hypotheses-small differences are due to the different sample sizes and different probabilities to stop at the interim and final analyses. Thus, in order to minimize the expected sample size, the SF strategy is preferred.

| DISCUSSION
The aim of this work was to describe the application of the order restricted design proposed by Serra et al. 16 in the context of a tuberculosis trial. In this clinical trial setting, non-inferiority trials are the norm and hence an extension of the original design has been proposed. Practical considerations were provided regarding how to choose some design's parameters such as the non-inferiority margin and the shape of the critical bounds for the considered TB trial. Theoretical and practical considerations regarding several types of power configurations were provided and two different strategies were proposed in order to take into account the fact that the information is accumulating at different times when multiple treatment durations are simultaneously tested in the same trial.
The primary objective of the trial was to identify the shortest possible treatment duration in each sub-population. Thus, the trial is to be powered to reject all correct hypotheses. However, alternative power strategies can be more feasible. For example, when even some reduction in the duration of the treatment is of interest or the resources are limited. In these cases, one could consider to power the design in order to reject the correct hypotheses relating to the particular number of experimental arms (rather than all of them).
The ORD has been shown to be an efficient design that can be applied when multiple treatment durations are simultaneously tested in the same trial. Nevertheless, all the considerations provided in this work were specific to the REStrUCTuRe trial. The choice of the primary endpoint was driven by the clinical investigator in the team, and the design in the manuscript is proposed under the assumption that this endpoint is valid. This manuscript is focused on the questions on how one could design such a study upon the agreement on the particular choice of the endpoint. The methodology proposed in this work can be applied to different definitions of primary endpoint, for example endpoints that consider a minimum follow-up for patients after they have ended their treatment. Indeed, even though culture might be negative, bacteria may be left and it takes long for TB to grow, such that recurrence could only be observed months after treatment completion (while most appear to happen relatively shortly after completion). Culture conversion at treatment completion is not a perfect surrogate for long-term success-that's why the STEP design 29 has been proposed to de-risk subsequent Phase III studies. The efficiency in using the proposed adaptive design with another definition of primary endpoint would be determined by the specific setting, specifically, by the expected recruitment rate as well as the time of follow-up for every patient. This, however, holds for any adaptive design. 30 In this trial, we have assumed to consider an interim analysis when half of the total maximum sample size has completed their treatment. However, depending on the recruitment speed, it might happen that all patients are already Note: A recruitment rate of around 30 patients per month is assumed for both risk groups. Results are provided using 10 5 replications. Values in bold refer to the target probability (around 80%) for each power configuration. enrolled at the time of the interim analysis. Thus, other practical considerations and modifications of the design should be considered depending on the specific trial setting.
In addition, one of the limitations of this work is that it relies on the asymptotic normal distribution of the test statistics. Especially when the sample size is small, the normal distribution could be a poor approximation for the test statistic. Moreover, alternative strategies for the different timings of the analyses can be explored, that is, staggering the opening of the treatment durations in order to get the same amount of information at the interim or final analyses.

AUTHOR CONTRIBUTIONS
All authors contributed equally to the presented work.
To power the design in order to reject the first and second treatments in a 4-arm 2-stage design at level 1 À β, the following Equation (A3) needs to be satisfied: