Adaptive multiarm multistage clinical trials

Two methods for designing adaptive multiarm multistage (MAMS) clinical trials, originating from conceptually different group sequential frameworks are presented, and their operating characteristics are compared. In both methods pairwise comparisons are made, stage‐by‐stage, between each treatment arm and a common control arm with the goal of identifying active treatments and dropping inactive ones. At any stage one may alter the future course of the trial through adaptive changes to the prespecified decision rules for treatment selection and sample size reestimation, and notwithstanding such changes, both methods guarantee strong control of the family‐wise error rate. The stage‐wise MAMS approach was historically the first to be developed and remains the standard method for designing inferentially seamless phase 2‐3 clinical trials. In this approach, at each stage, the data from each treatment comparison are summarized by a single multiplicity adjusted P‐value. These stage‐wise P‐values are combined by a prespecified combination function and the resultant test statistic is monitored with respect to the classical two‐arm group sequential efficacy boundaries. The cumulative MAMS approach is a more recent development in which a separate test statistic is constructed for each treatment comparison from the cumulative data at each stage. These statistics are then monitored with respect to multiplicity adjusted group sequential efficacy boundaries. We compared the powers of the two methods for designs with two and three active treatment arms, under commonly utilized decision rules for treatment selection, sample size reestimation and early stopping. In our investigations, which were carried out over a reasonably exhaustive exploration of the parameter space, the cumulative MAMS designs were more powerful than the stage‐wise MAMS designs, except for the homogeneous case of equal treatment effects, where a small power advantage was discernable for the stage‐wise MAMS designs.

Two methods for designing adaptive multiarm multistage (MAMS) clinical trials, originating from conceptually different group sequential frameworks are presented, and their operating characteristics are compared. In both methods pairwise comparisons are made, stage-by-stage, between each treatment arm and a common control arm with the goal of identifying active treatments and dropping inactive ones. At any stage one may alter the future course of the trial through adaptive changes to the prespecified decision rules for treatment selection and sample size reestimation, and notwithstanding such changes, both methods guarantee strong control of the family-wise error rate. The stage-wise MAMS approach was historically the first to be developed and remains the standard method for designing inferentially seamless phase 2-3 clinical trials. In this approach, at each stage, the data from each treatment comparison are summarized by a single multiplicity adjusted P-value. These stage-wise P-values are combined by a prespecified combination function and the resultant test statistic is monitored with respect to the classical two-arm group sequential efficacy boundaries. The cumulative MAMS approach is a more recent development in which a separate test statistic is constructed for each treatment comparison from the cumulative data at each stage. These statistics are then monitored with respect to multiplicity adjusted group sequential efficacy boundaries. We compared the powers of the two methods for designs with two and three active treatment arms, under commonly utilized decision rules for treatment selection, sample size reestimation and early stopping. In our investigations, which were carried out over a reasonably exhaustive exploration of

INTRODUCTION
Adaptive multiarm multistage (MAMS) clinical trials compare multiple treatment arms in pairwise fashion to a common control arm over two or more stages. These trials are characterized by interim looks at the accumulating data in order to either stop the trial early for overwhelming efficacy, stop the trial early for futilty, or to make mid-course adaptive changes such as dropping ineffective treatment arms, changing the sample size, the error spending function, and the number of future looks. Two approaches, originating from different conceptual frameworks, have evolved for constructing adaptive MAMS designs in a statistically valid manner. We refer to them, respectively, as stage-wise MAMS and cumulative MAMS, because of the manner in which the test statistic is constructed by each method. Although both methods may be viewed as multivariate extensions of the classical two-arm group sequential design they differ in how they control the multiplicity inherent in an adaptive MAMS design. The stage-wise MAMS approach combines independent multiplicity adjusted P-values from the different stages of the trial in accordance with a prespecified combination function and utilizes closed testing 1 to ensure strong control of the family-wise error rate (FWER). It provides full flexibility, at the end of each stage, to make data-dependent adaptive changes, such as selecting a subset of the initial treatments or reestimating the sample size, for the remainder of the trial. Critical values for early efficacy stopping are obtained by applying the methods developed for classical two-arm group sequential designs. 2 Bauer and Köhne 3 introduced this idea for two-stage designs with multiple arms and Bauer and Kieser 4 elaborated it further to include treatment selection at the end of stage 1. Posch et al 5 introduced a larger family of multiplicity adjusted P-values for the two stages, proposed the inverse normal combination function for combining them, and discussed parameter estimation at the end of the trial. One can directly extend this approach to J > 2 stages, as was performed by Lehmacher and Wassmer 6 for the special case of two-arm trials and by Magirr, Stallard, and Jaki 7 (Section 3.1) for multiarm trials.
The cumulative MAMS approach extends the usual two-arm group-sequential efficacy boundaries 2 to the multiarm setting. A separate cumulative test statistic having an independent increments structure is obtained for the pairwise comparison of each treatment arm to a common control arm, and is monitored stage by stage. Efficacy can be claimed for any treatment arm whose statistic crosses an efficacy boundary. These efficacy boundaries are derived from the distribution of the maximum of the test statistics under the global null hypothesis that all treatment arms are ineffective. They provide strong control of the FWER. Magirr, Jaki and Whitehead 8 generated these boundaries for the maximum of the Wald statistics. Ghosh et al 9 reduced the computational complexity of this approach by using the maximum score statistic, in place of the maximum Wald statistic. In both these approaches, a futility boundary could be included for dropping nonperforming treatment arms at one or more stages. However, neither Reference 8 nor Reference 9 can allow for data-dependent adaptive changes such as treatment selection or sample size reestimation. To obtain this flexibility it is necessary to incorporate both closed testing 1 and conditional error rate methodology, 10,11 into the testing framework as was done by Koenig et al 12 for two-stage designs with no early stopping and by Magirr, Stallard and Jaki 7 (Section 3.2) more generally. This paper has two objectives. First, we show how to extend the cumulative MAMS approach of Ghosh et al 9 to permit adaptive dose selection and sample size reestimation by use of closed testing and preservation of conditional error rates. Our approach is similar to that of References 12 and 7, but presented within the group sequential framework of Reference 2. For completeness we also present the stage-wise MAMS approach within the group sequential framework of Reference 2, pointing out how it differs with respect to test statistics and group sequential boundaries from the cumulative MAMS approach. Second, we compare the operating characteristics of the cumulative MAMS and stage-wise MAMS approaches, both analytically and empirically, in several settings. It is seen that the cumulative MAMS designs outperform the stage-wise MAMS designs with respect to power in every setting but one, where there is a small, practically negligible, power advantage for the stage-wise MAMS design. While two-stage designs are by far the most common application of adaptive designs we have also included results for three-stage designs. These results were previously unavailable due to the heavy computational burden they impose. The computational methods developed by Ghosh et al 9 were essential for simulating the three-stage cumulative MAMS designs in a realistic amount of time and thereby evaluating their operating characteristics.
In Section 2 we introduce the cumulative MAMS approach, explain how the group sequential boundaries are obtained from the distribution of the maximum score statistic, and show how to incorporate adaptive treatment selection and sample size reestimation into the design. In Section 3 we review the stage-wise MAMS approach for making adaptive changes to an ongoing study. For ease of exposition we confine our discussion in these sections to two-stage designs, as this suffices to explain the main principles of cumulative MAMS and stage-wise MAMS adaptation. The more general case of J > 2 stages is discussed in Appendix. In Section 4 we compare the power of the cumulative and stage-wise MAMS approaches-analytically for two active doses vs placebo, and by simulation for three three active doses vs placebo. A more general simulation-based comparison that incorporates, treatment selection, early stopping, and sample size reestimation is presented in Section 5 for a recently completed cardiovascular trial. 13 We summarize our findings in Section 6 along with some recommendations for the choosing between the two approaches.

THE CUMULATIVE MAMS APPROACH
Consider a trial in which D treatment arms, indexed by i = 1, 2, … D, are each compared to a common control arm indexed by i = 0. Patients are randomized to either treatment arm i or to the control arm in accordance with a prespecified allocation ratio i . We assume that a patient's response on arm i is normal with mean i and variance 2 i . Let i = i − 0 , i = 1, 2, … D, represent the mean effect of treatment arm i relative to the control arm. Let H i 0 ∶ i = 0 denote the null hypothesis for treatment arm i and let H 0 = ∩ D i=1 H i 0 denote the global null hypothesis. In this section we will develop the cumulative MAMS approach for a two-stage adaptive design to test H 0 against the one-sided alternative that i > 0 for at least one i. The generalization to J > 2 stages is presented in Appendix A1.
Let j = 1, 2 denote the first and second stages, respectively, and let n ij be the sample size of arm i at stage j. Define the score statistic W ij =̂i j  ij , wherêi j is the maximum likelihood estimate of i and  ij = n 0j ( 2 0 + −1 i 2 i ) −1 is its Fisher information from data up to and including stage j.
These results hold exactly if the patient level data are normally distributed and asymptotically otherwise. 14 Let (2) be the score statistic for the incremental data accumulated between stage 1 and stage 2, where  i (2) (2) ) is independent of W 1 and has a multivariate normal distribution with E(W i(2) ) = i  i (2) , var(W i(2) ) =  i (2) , and cov(W i 1 (2) , W i 2 (2) ) = Λ i 1 Λ i 2 2 0 n 0(2) . In practice, when evaluating these distributions, we will replace the unknown Fisher information quantities  i1 ,  i2 and  i (2) by corresponding estimates, i1 , i2 , and i (2) , from the data. (See, for example, equation (9)). The simulation results in Table 1 of Section 5 demonstrate that this second-order approximation preserves type-1 error even for relatively small sample sizes. Using computational methods discussed in Ghosh et al 9 for multivariate Brownian processes we can obtain level-group sequential boundaries (b 1 , b 2 ) such that where P h (.) denotes probability under = h and 1 is the portion of the prespecified allowable type-1 error that is spent at stage 1.
We shall, throughout, denote observed values of random variables by lowercase letters. Thus w 1 denotes the observed value of W 1 . We may reject any hypothesis H i 0 for which the corresponding w i1 ≥ b 1 . The trial is then terminated for efficacy. If, however, max{w 1 } < b 1 the trial continues to stage 2 where again any hypothesis H i 0 is rejected for which the corresponding w i2 ≥ b 2 . Due to the use of the max statistic this hypothesis testing procedure maintains strong control of the FWER. 8 It is important to recognize that the efficacy boundaries for a multiarm group sequential design must be stricter than the corresponding efficacy boundaries for a two-arm group sequential design, since the former have to adjust for the multiplicity due to testing more than one hypothesis at each look. For example, if D = 4 the multiarm group sequential boundaries for treatment i, derived from the Lan and DeMets 15 error spending function are b 1 = 3.3453 √  i1 and b 2 = 2.4510 √  i2 for a one-sided test at = 0.025 and an interim look at 50% of the total information. In contrast the two-arm group sequential efficacy boundaries in this setting are b 1 = 2.9626 √  i1 and b 2 = 1.9686 √  i2 . We consider two possible adaptations at the end of stage 1. (a) Permit one or more treatment arms to be dropped. (b) Alter the sample size of each treatment arm i that will be proceeding to stage 2, while maintaining its allocation ratio i . Strong control of FWER can be maintained without any adjustment to the group sequential design if (a) is the only adaptation. We can, optionally, improve the efficiency of the design by recomputing the stage 2 boundary in conjunction with closed testing. If, on the other hand, the adaptation includes (b) then it is essential to recompute the stage 2 boundary in conjunction with closed testing in order to maintain strong control of FWER. We next discuss how this is accomplished.
Let  = {1, 2, … D} and S ⊆  denote the indices of the treatments selected for stage 2. At stage 2 we are interested in testing H i 0 for all i ∈ S while maintaining strong control of the FWER at level . To achieve this control, each H i 0 must be tested by a closed level-test. That is, H i 0 may only be rejected if, for all I ⊆  such that i ∈ I, H I 0 = ∩ g∈I H g 0 is rejected with a valid local level-test. 1 The valid local level-test of H I 0 is constructed in two steps.
Step 1 Compute two-stage group sequential level-boundaries (b I1 , b I2 ) for making ||I|| comparisons to a common control. These boundaries must satisfy where W Ij = {W gj ; g ∈ I}, j = 1, 2. If max{W I1 } ≥ b I1 , H I 0 is rejected. Otherwise we proceed to Step 2.
Step 2 After examining the stage 1 data a subset S ⊆  consisting of ||S|| treatments is selected for testing at stage 2.
Suppose that the incremental stage 2 sample size of the control arm is altered from n 0(2) to n * 0 (2) , and suppose that the incremental stage 2 sample sizes of the ||S|| treatment arms are correspondingly increased so as to preserve their respective allocation ratios relative to the control arm. Let I S = I ∩ S. In order to preserve the type-1 error of the trial we must replace the stage 2 boundary b I2 with b * I2 such that where W * I S 2 = {W * g2 ∶ g ∈ I S } and the " * " indicates that the sample size of the stage 2 statistic W * g2 has been altered from n g2 to n * g2 = n g1 + n * 0(2) g . We reject (2) is a consequence of the conditional error rate principle 11 which states that in order to preserve the overall type-1 error of the trial its conditional type-1 error after adaptation should not exceed the conditional type-1 error of the original trial, given the stage 1 data. Thereby H I 0 is rejected by a valid level-test.
Finally, rejection of H i 0 requires that H I 0 be rejected in the above manner for all possible subsets I ⊆  that contain i. This will ensure that the test of H i 0 is closed and will thereby guarantee strong control of FWER.

THE STAGE-WISE MAMS APPROACH
We recapitulate the two-stage method described by Reference 5, but present it in the classical group sequential framework of Reference 2, which facilitates generalization to J > 2 stages as given in Appendix A2. Recall from Section 2 that we can reject any elementary hypothesis H i 0 only if the intersection hypothesis H I 0 is rejected by a valid local level-test for all subsets I ⊆  that contain i. In stage-wise MAMS the test of H I 0 utilizes multiplicity adjusted P-values computed from the incremental data at stages 1 and 2. Any valid multiplicity adjusted P-values may be utilized for this purpose. Popular candidates include the t-test based P-values adjusted for multiplicity by the nonparametric Bonferroni and Simes procedures for which the appropriate formulae are given in Reference 5. However, in order to make a meaningful comparison between the cumulative and stage-wise MAMS approaches, we will utilize P-values that are derived from the maximum score statistic. In that case the multiplicity adjusted P-value for testing H I 0 at stage j is the single-stage Dunnett P-value 16 where W I (1) and W I (2) are the score statistics based on the incremental data at stages 1 and 2, respectively. To evaluate Equation (3) exactly we define, for all i ∈ I, , where i(j) is the estimated Fisher information from the incremental data of stage j. Define t I(j) = {t i(j) ; i ∈ I}. Then the multiplicity adjusted Dunnett P-value can be computed exactly as where T I(j) has a multivariate-T distribution with mean 0, n 0(j) + ∑ i∈I n i(j) − ||I|| − 1 degrees of freedom, and a known covariance matrix that depends on the allocation ratios of the treatment arms to the control arm.
A two-stage level-test of H I 0 can now be constructed as follows. Define the test statistic for stage 1 as We will use the same type-1 error, 1 , for stage 1 as was used in the cumulative MAMS approach. Thus for any . The trial terminates for efficacy at stage 1 if there exists at least one i ∈  such that for all I ⊆  that contain i, Z I1 ≥ c 1 , for then H i 0 can be rejected by a level-1 closed test. If the trial does not terminate at stage 1 let S ⊆  be the set of treatment indexes selected for stage 2 and I S = I ∩ S be the set of treatments from I that are carried forward to stage 2. Let max{W I S (2) } = max(W q (2) ; q ∈ I S ) denote the maximum incremental score statistic in the set I S . Then the second-stage P-value for testing H I 0 is We now compute the test statistic for stage 2 as a weighted sum of inverse normal components where h 1 and h 2 are prespecified weights whose sum of squares is 1. The statistics Z I1 and Z I2 are N(0, 1) under H I 0 and Z I2 − Z I1 is independent of Z I1 . Thus one can readily obtain the efficacy boundary c 2 such that by the usual methods for two-arm group sequential designs. 2 We reject H i 0 with strong control of FWER if Z I2 ≥ c 2 for all possible I ⊆  with i ∈ I. The generalization to J > 2 stages is given in Appendix A2.
Note that the efficacy boundaries (c 1 , c 2 ) only protect the multiplicity induced by testing the same hypothesis over two stages. In particular, they do not adjusted for the multiplicity due to testing multiple treatment arms against a common control arm. The latter multiplicity adjustment is applied through the Dunnett P-values. In contrast the cumulative MAMS approach applies the adjustments for both the sources of multiplicity directly through the efficacy boundaries. For example, if  = 4 the Lan-DeMets 15 efficacy boundaries for the stage-wise MAMS design are c 1 = 2.9626 and c 2 = 1.9868. These are the efficacy boundaries for comparing a single treatment arm to a control arm even though in fact four treatments are being compared to the same control. For the cumulative MAMS design, however, the Wald-scale boundaries for comparing four treatments to a common control would be b 1

CUMULATIVE MAMS VS STAGE-WISE MAMS
Our goal is to compare the cumulative and stage-wise MAMS approaches with respect to global power, defined here as the probability of rejecting H i 0 for any treatment i, i = 1, 2, … D. We will first make these comparisons for the special case of two active doses, no early stopping and no dose selection. In this ideal setting it is possible to make the comparisons analytically and thereby gain a deeper insight into the conditions under which one method has greater power than the other. We will then extend these comparisons to more general settings by simulation.

Analytical Comparison with Two Active Doses and Two Stages
Patients are randomized equally between the three arms of the study and each patient's response is normally distributed with 2 = 1. The control arm has a mean of zero and treatment i has mean i , i = 1, 2. The null hypothesis corresponding to the treatment i is H i 0 ∶ i = 0. We will test the global null hypothesis H 0 = H 1 0 ∩ H 2 0 against the one-sided alternative that i > 0 for at least one i = 1, 2. Under the assumption of no early stopping, no dropping of treatments and no adaptive sample size reestimation, one can derive analytical power functions for the cumulative and stage-wise MAMS designs. Let f 1 (w 11 , w 21 ) be the probability density function of W 1 = (W 11 , W 21 ), the stage 1 score statistics. Let f (2) (w 1(2) , w 2(2) ) be the probability density function of W (2) = (W 1(2) , W 2(2) ), the incremental stage 2 score statistics.(For notational convenience we have suppressed the dependence of these densities on .) Let b 2 denote the critical value for declaring statistical significance at the end of stage 2. Then we have shown in Appendix A1 that P(CUMUL) and P(STAGE), the respective cumulative and stage-wise MAMS probabilities of rejecting H 0 when the true treatment effect is = ( 1 , 2 ), are given by ) f 1 (w 11 , w 21 )dw 21 dw 11 (6) and where p 1 = P 0 (max{W 1 } ≥ max{w 1 }) and p (2) = P 0 (max{W (2) } ≥ max{w (2) }) are the multiplicity-adjusted P-values for the two stages, and g = Φ is a function of the maximum of (w 11 , w 21 ) through p 1 . It is instructive to compare the two power functions (6) and (7). They differ only in the upper limits of the inner (or stage 2) integrals. In P(CUMUL) the stage 2 score statistics (w 1(2) , w 2(2) ) are confined to the region Notice that this is the acceptance region for a test that rejects H 0 if either w 11 Thus P(CUMUL) is derived from a test that is based on sufficient statistics. In contrast the stage 2 score statistics (w 1(2) , w 2(2) ) in the expression for P(STAGE) are confined to the region (−∞, . This is the acceptance region for a test that rejects H 0 if h 1 z p 1 + h 2 z p (2) ≥ z . Clearly this test is not based on sufficient statistics.
The impact on global power of nonadherence to the sufficiency principle is shown in Figure 1, where the two-test methods are compared for 1 and 2 in the range 0 to 3, and in Figure 2, where the two-test methods are compared with equal values over the range 1 = 2 = 0 to 1 = 2 = 3. We have chosen = 0.05 for both test methods, with total statistical information  2 = 1 for evaluating P(CUMUL), and stage-wise statistical information  1 =  (2) = 0.5 for evaluating P(STAGE). With these design parameters both designs achieve 0.95 power at 1 = 2 = 3 and FWER equal to 0.05 at 1 = 2 = 0. The following conclusions may be drawn: 1. Except for a small region near 1 = 2 = 1.5, P(CUMUL) exceeds P(STAGE) everywhere, with absolute power gains between 0% and 5%. 2. When 1 = 2 = 1.5 there is a tiny power loss, P(CUMUL) − P(STAGE) = −0.2%, which disappears rapidly as soon as 2 moves away from 1 . 3. The power gain for P(CUMUL) is maximum when the two values differ by the greatest amount; 1 = 0, 2 = 3 or 1 = 3, 2 = 0 4. The slight loss in power at 1 = 2 = 1.5 shown in Figure 1 suggests that similar losses might also occur at other values of 1 = 2 . This is confirmed by an examination of Figure 2 where P(CUMUL) − P(STAGE) is plotted over the range 1 = 2 = 0 to 1 = 2 = 3. The power loss is zero at 1 = 2 = 0, increases gradually to a maximum of −0.002 at 1 = 2 = 1.5 and then declines, reaching zero once again at 1 = 2 = 3.
It is worth noting that, in this setting the cumulative MAMS design has the property of consonance. When H 0 is rejected by the cumulative MAMS method we can, in addition to rejecting H 0 , also reject either H 1 0 or H 2 0 or both of them, depending on which component(s) of w 2 crossed the efficacy boundary. For the P-value combination test, however, rejecting H 0 does not provide any additional information about the status of H 1 0 or H 2 0 individually. We need to further reject either H 1 0 or H 2 0 or both by local level-tests before we an make an efficacy claim for these dose groups. These additional tests have not been factored into the analytical power calculations for the P-value combination approach. Therefore we can conclude that the actual power of the P-value combination approach to identify efficacious doses is even less than P(STAGE).

Simulation-based comparison with three active doses and selection
The analytical expressions in Equations (6) and (7) were derived in the idealized setting of two active doses, no early stopping and no dropping of treatment arms at the end of stage 1. We now consider the more realistic setting of three active doses in which nonperforming doses are dropped at the end of stage 1.  Figure 4 is a similar 3D plot with the same 2 and range of values for the 's, but with a stricter criterion for dropping doses; here treatment i is dropped if̂1 < −0.3. Both plots are based on 10 000 simulated trials. By examining these plots one may draw three important conclusions about the power differential between the cumulative MAMS and stage-wise MAMS designs.
1. P(CUMUL) exceeds P(STAGE) with absolute power gains up to 9% when the cut-off for dropping doses iŝi < −0.1 and up to 11% when the cut-off for dropping doses iŝi < −0.3 2. The gain in power of P(CUMUL) over P(STAGE) appears to depend on the degree of heterogeneity among the values.
The greater the heterogeneity, the greater the power gain. To see this note the following:  Figure 4 than in Figure 3 for every ( 1 , 2 , 3 ) combination. As the only difference between the two figures is the value of̂i below which doses are dropped, it would appear that the stricter the criterion for dropping doses at the end of stage 1, the greater the power differential. We will revisit this conjecture in Section 5 in the context of an actual clinical trial.

THE SOCRATES-REDUCED TRIAL
SOCRATES-REDUCED was a multicenter, randomized, placebo-controlled trial which enrolled patients with worsening chronic heart failure after clinical stabilization. 13 Patients were randomized to three different dose groups (2.5, 5, and 10 mg) of oral vericiguat or placebo. The primary end point of the trial was change from baseline to week 12 in log-transformed N-terminal pro-B-type natriuretic peptide (NT-proBNP). The statistical analysis plan specified that for the analysis of the primary endpoint the patients from the three dose groups would be pooled and compared to the placebo arm. The trial was designed for 80% power to detect a difference of = 0.187 between the pooled dose group and placebo, at one-sided = 0.025. In order to meet these design requirements, and assuming that = 0.52, a total of 260 patients (65/arm) were randomized to the study. This trial, however, failed to show statistical significance. The observed treatment effect for the pooled dose group relative to placebo was only 0.122 (P-value = .075, one-sided). The data from the trial showed a dose-response relationship with an observed difference from placebo of 0.248 for the 10-mg dose group (P = .024), 0.073 for the 5-mg dose group (P = .15), and 0.04 for the 2.5-mg dose group (P = .19). Pooling the three dose groups for the final analysis caused a dilution of the observed treatment effect and resulted in a failed trial even though the 10-mg dose appears to be clearly effective. We will use this example to display the operating characteristics of alternative cumulative and stage-wise MAMS designs that might have been used for identifying effective doses in a multiarm setting.
A single-stage four-arm design based on Dunnett's test in which = 0.52 and = 0.187 for each dose vs placebo requires 388 patients (97/arm) for 80% power at one-sided = 0.025. Here power is defined as the probability that the null hypothesis = 0 will be rejected for at least one-dose group. In Table 1 we compare the operating characteristics of this single-stage Dunnett design with corresponding operating characteristics of stage-wise MAMS designs that utilize  Table 1A and over three equally spaced stages in Table 1B. The adaptation occurs at the end of stage 1 and consists of early stopping if any dose group crosses an efficacy boundary, or dropping any dose group having an observed treatment effect that is worse than placebo. When doses are dropped their remaining sample sizes are reallocated in equal proportion to the remaining doses or placebo. The Bonferroni, Simes, and Dunnett stage-wise MAMS procedures combine multiplicity-adjusted P-values derived from the Student's t distribution in accordance with Equation (A8) of Appendix A2. All table entries are based on 10 000 simulated trials. The value of j spent at each stage j to obtain the efficacy stopping boundaries is derived from the Lan and DeMets, O'Brien-Fleming type, error spending function. 15 For the stage-wise MAMS designs these are the usual two-arm group sequential boundaries, obtained as solutions to Equations (A11) and (A12) of Appendix A3. For the cumulative MAMS design, these are multiplicity adjusted multiarm group sequential boundaries, derived as shown in equations (A5) and (A6) of Appendix A2. However, as recommended by Wason et al, 17 these multiarm boundaries, b j , are further transformed by the formula to adjust for possible biases in small samples due to estimating the unknown 2 i for each treatment i in the compuation of the test statistic. Here is the estimated Fisher information about i at stage j,̂2 i is the estimated variance of the response to treatment i, based on cumulative data up to and including stage j, and T −1 d ij is the inverse of the Student's t distribution with degrees of freedom d ij = n 0j + n ij − 1. This adjustment to the boundaries allows us to use estimated Fisher information in place of the unknown actual Fisher information without inflating the type-1 error. The last rows of Table 1 show that this adjustment preserves the FWER, albeit slightly conservatively. We have verified that if the simulations are performed with the actual Fisher information, the FWER is exactly 0.025, thereby demonstrating that, in the absence of any large sample approximations, the adaptive cumulative MAMS design exhausts the entire . For the scenarios considered here, the adaptive cumulative MAMS design dominates the other designs with respect to power. Furthermore among the three stage-wise MAMS methods displayed in Table 1, the methods that utilize the Bonferroni or Simes adjustments have considerably lower power than the method that utilizes the Dunnett adjustment. The power gains of the cumulative MAMS design over the other designs are more pronounced for heterogeneous treatment effects compared to homogeneous treatment effects. For example, it is seen from Table 1A for two-stage designs where = (0, 0, 0.187), that the cumulative MAMS design produces 6% more power than the stage-wise MAMS design using Dunnett P-values, 13% more power than the stage-wise MAMS design using Simes P-values, 14% more power than the stage-wise MAMS design using Bonferroni P-values, and 7% more power than the single-stage Dunnett design.
It is interesting to observe that even in the homogeneous case where = (0.187, 0.187, 0.187) the stage-wise MAMS design using Dunnett P-values has 2% less power than the cumulative MAMS design. This would appear to contradict the results of Section 4 where there is essentially no difference in power between stage-wise and cumulative MAMS designs when the values are all equal. The explanation is that the designs in Section 4, unlike the SOCRATES-REDUCED designs, do not include early stopping. The presence of early stopping boundaries causes a loss of power for stage-wise MAMS relative to cumulative MAMS. Table 1B displays similar results for three-stage designs. Three-stage designs, however, have the additional advantage of lower average sample sizes due to the possibility of early stopping. This is seen in Table 2 We noted at the end of Section 4.2 that the stricter the criterion for dropping doses at the end of stage 1, the greater the gain in power for cumulative MAMS over stage-wise MAMS designs. It would be interesting to determine whether this result holds also for the SOCRATES-REDUCED designs. In Table 3 we explore this conjecture for two-stage designs with three different configurations for . In Table 3A, = (0.187, 0.187, 0.187). In Table 3B, = (0, 0.187, 0.187). In Table 3C, = (0, 0, 0.187). In each table we use three progressively stricter criteria for dropping treatments-anŷi 1 < 0 in row 1, anŷi 1 < − in row 2, and anŷi 1 < −2 in row 3.
In each table, for each design, a pattern emerges whereby P(CUMUL) − P(STAGE) increases in moving from row 1 to row 2 and then decreases in moving from row 2 to row 3. A similar pattern was observed for the three-stage designs. We are unable to find an explanation for this behavior. It is note-worthy however, that the gains in power increase substantially with increasing heterogeneity of the values. For example, in Table 3C the value of P(CUMUL) − P(STAGE) can be as high as 21% for Bonferroni, 20.3% for Simes and 14.3% for Dunnett.

DISCUSSION
The usual practice in clinical drug development has been to first run a phase 2 trial with multiple doses, and then run a separate two-arm phase 3 trial in which the best dose from phase 2 is compared to a control arm. Adaptive designs combine phase 2 and phase 3 into a single integrated trial and thereby utilize fewer patient resources and shorten the time required to identify and market efficacious medical products. To be acceptable for regulatory submissions such designs must have strong control of FWER. Both the stage-wise MAMS and the cumulative MAMS designs have this property.  ( 1 , 2 , 3 ) = (0.187, 0.187, 0.187) ( 1 , 2 , 3 ) = (0, 0.187, 0.187)  In stage-wise MAMS designs, FWER control is achieved by constructing the test statistic as a weighted combination of inverse normal multiplicity-adjusted P-values from the incremental data at each stage, and monitoring this statistic with respect to the classical two-arm group sequential boundaries. Since the weights are prespecified, this test statistic has the cannonical distribution of the usual two-sample Wald or score statistic under the global null hypothesis, even if the sample size is reestimated in the course of the trial. Additionally, closed testing is implemented to identify the active treatment arms. In cumulative MAMS designs, strong FWER control is achieved by constructing a separate cumulative Wald or score statistic for each pairwise comparison and monitoring it with respect to group sequential boundaries that are adjusted for testing multiple treatment arms. Although these boundaries provide strong control of the FWER in the presence of arbitrary or unplanned treatment selection, they can be sharpened through step-down closed testing and preservation of conditional error rates as described in Section 2 and Appendix A2. The sharpened boundaries provide additional flexibility to alter the sample size. Thus the stage-wise and cumulative MAMS designs provide the same degree of flexibility to make adaptive changes to an ongoing design. There is, however, a fundamental difference in the handling of multiplicity by the two methods. In stage-wise MAMS the multiplicity is incorporated into the adjusted P-values whereas in cumulative MAMS it is incorporated into the group sequential boundaries.
We have compared the stage-wise MAMS and cumulative MAMS approaches in a systematic manner under different configurations of the treatment effects and decision rules for dropping arms. Our first investigation, in Section 4.1, was for two treatment arms vs a common control arm with no treatment selection and no early stopping. In this simple setting it was possible to compare the two designs analytically and thus determine with great accuracy that only in the homogeneous case where 1 = 2 does the stage-wise MAMS design have greater power than the cumulative MAMS design. Moreover the power differential for this configuration of is at most 0.2%. For all other configurations the cumulative MAMS design has greater power with the power differential increasing as the values separate, and reaching 5% when the values are farthest apart. Next, in Section 4.2, we investigated the case of three treatment arms vs a common control arm, with treatment selection at the end of stage one but no early stopping. This investigation was by simulation and demonstrated greater power gains, up to 11% for cumulative MAMS designs over stage-wise MAMS designs. As before, the power gains increased with greater heterogeneity among the values. Finally, in Section 5 we simulated two and three-stage designs with dose selection as well as sample size reestimation for the SOCRATES-REDUCED clinical trial. Here too the cumulative MAMS designs had greater power than the stage-wise MAMS designs, with power gains that increased substantially with greater heterogeneity among the values. For example, for = (0, 0, 0.187) one could obtain a 14.3% power gain for cumulative MAMS over stage-wise MAMS with Dunnett-adjusted P-values, a 20.3% power gain over stage-wise MAMS with Simes-adjusted P-values and a 21% power gain over stage-wise MAMS with Bonferroni-adjusted P-values.
While the large power gains for cumulative MAMS designs over stage-wise MAMS designs shown here have not been shown previously, they are consistent with results published in Koenig et al, 12 Friede and Stallard 18 and Magirr et al. 7 Koenig et al 12 18 showed a benefit for the adaptive Dunnett test over the P-value combination test for two-stage designs with treatment selection but no early stopping or sample size reestimation. Magirr et al 7 investigated two and three-stage designs with treatment selection, early stopping and sample size reestimation, and showed a benefit for the "CE-SB" and "CE-AP" designs that utilize cumulative statistics and recompute multiplicity adjusted stopping boundaries through use of conditional error rates to control the FWER, over the "PC-SB" designs that control the FWER through inverse normal combination of adjusted P-values.

and Friede and Stallard
Even small gains in power can translate into huge sample size savings for cumulative MAMS designs over stage-wise MAMS designs. For example, it is seen from Table 1B that, for a sample size of 388, if = (0, 0, 0.187) the cumulative MAMS design has 64.7% power while the stage-wise MAMS design has 59.2% power. In order for the stage-wise MAMS design to also have 64.7% power, 448 subjects would be needed. Furthermore, as can be seen from Table 2, the average sample size of the cumulative MAMS design in this three-stage early-stopping setting is 343 subjects. We have determined in a separate simulation that the corresponding average sample size of the stage-wise MAMS design is 424 subjects.
It was conjectured by a reviewer that the power advantage of the cumulative MAMS design over the stage-wise MAMS design in Section 5 might be due to the specific sample-size increase rule utilized in our simulations. This rule, which might be termed "proportional upscaling," requires that the initially specified total sample size not be reduced when arms are dropped at an interim analysis. Instead the sample size that would have been assigned to the dropped arms is reallocated to continuing arms, in proportion to the original allocation ratios. To check the validity of this conjecture we resimulated the designs in Table 1A without proportional upscaling. In Table 4 we display power and sample size comparisons for the two-stage SOCRATES design in which the unallocated sample sizes of the dropped arm are not reassigned to the arms that continue. As can be seen, these results are qualitatively similar to those of Table 1A. Thus the power advantage of the cumulative MAMS design appears to hold with or without proportional upscaling.
The conclusions we draw from the results presented in this paper are as follows: 1. Cumulative MAMS designs appear to be more powerful than stage-wise MAMS design except in the homogeneous case where all the values are the same. 2. For the special case of two active treatments, with no treatment selection or sample size increase, analytical comparisons were possible. They revealed that when 1 = 2 there is a small advantage for the stage-wise MAMS design over the cumulative MAMS design, but it disappears as the two s begin to diverge. It is thus entirely plausible that the same effect is present in the more complex setting of multiple doses, multiple looks and sample size reestimation considered in Sections 4.2 and 5. If present, however, the effect is too small to be detected in an experiment involving 10 000 simulated trials. 3. The magnitude of the power gain of cumulative MAMS designs over stage-wise MAMS designs can be substantial and increases with increasing heterogeneity of the values. 4. Our results are based on a reasonably exhaustive exploration of the parameter space for three active treatment arms under specific decision rules for treatment selection, sample size reestimation and early stopping. We cannot claim that they hold for all possible adaptive designs. Nevertheless the designs that we have considered here are ones that are likely to adopted in practice. For other designs it is recommended to explore the operating characteristics of the two approaches by simulation using the tools we have discussed here.
We tried to ascertain why the cumulative MAMS approach was more powerful than the stage-wise MAMS approach. We have three conjectures.
1. For the special case of two active doses with no early stopping or dropping of doses we were able to obtain explict power functions for the two methods in Section 4.1 and thereby demonstrate that the cumulative MAMS test, unlike the stage-wise MAMS test is based on sufficient statistics 2. When there is no sample size reestimation the multiplicity-adjusted cumulative MAMS boundaries are consonant.
That is, although these boundaries have been constructed under the global null hypothesis H 0 , any elementary hypothesis H i 0 for which w ij ≥ b j can be rejected without loss of FWER control. In contrast, in order to reject H i 0 in the stage-wise MAMS approach, one must always go through the entire closed testing procedure 3. If treatments are dropped at an interim look in the cumulative MAMS design it is possible gain efficiency through boundary recomputation in conjunction with closed testing. Specifically, in the two-stage cumulative MAMS design, the final critical value for testing H I 0 is adjusted from b I2 to b * I2 by imposing the Müller and Schäfer condition 11 through Equation (2). Although not shown here, we have verified that b * I2 ≤ b I2 so that this adjustment confers an advantage on the group sequential approach that is not available to the P-value combination approach.
We have not been able to explain why P(CUMUL) − P(STAGE) increases with increasing heterogeneity of the values. We are also unable to explain why P(CUMUL) − P(STAGE) first increases with increasing conservatism of the rule for dropping arms and then decreases. This phenomenon is manifest in every column of Table 3. We believe that this behavior is worth further investigation.
Throughout this paper we have utilized score statistics for monitoring the data and performing the hypothesis tests. We assumed in Section 2 that the scores are normally distributed with independent increments. These distributional properties hold exactly for normal data with known variance and asymptotically for all other settings in which the variance is estimated by maximum likelihood methods. 14 We showed in Section 5, Equations (8) and (9), how one might use the t-distribution to transform the cumulative MAMS boundaries and thereby obtain type-1 error control for the case of normal data with unknown variance. We did not examine the accuracy of the asymptotic distributions when the underlying data are binomial or have time-to-event end points. In this regard the stage-wise MAMS approach, though not as powerful as the cumulative MAMS approach, might be more robust since one can combine P-values that are adjusted for multiplicity by nonparametric methods like the Bonferroni and Simes method rather than resort to normal approximations. On the other hand if convergence of the score statistics to asymptotic normality with independent increments was in doubt one could set the nominal type-1 error of the cumulative MAMS design to be smaller than the desired , say ∕2, so as to ensure that the actual type-1 error would be controlled at level-. The huge power advantage that the cumulative MAMS design enjoys over stage-wise MAMS designs that utilize multiplicity adjusted nonparametric P-values, as evidenced by Table 3 of Section 5, would probably not be offset even by extreme conservatism in the choice of the nominal . This reasoning would not, however, be applicable if we were interested in testing multiple endpoints rather than testing multiple treatment arms. The multiarm problem is amenable to cumulative MAMS designs because the interarm correlation structure can be determined exactly from the treatment to control allocation ratio. The correlations between multiple endpoint must be estimated from the data and hence are subject to sampling error. Thus for multiple endpoint problems the stage-wise MAMS methods that utilize the nonparametric Simes or Bonferroni adjustments to control the multiplicity might have an advantage over the cumulative MAMS methods that rely on large-sample approximations. This is a topic for further investigation.
Another topic for further investigation is parameter estimation at the end of the trial. Bias reduction methods were investigated by Posch et al 5 for stage-wise MAMS designs with dose selection but no sample size adaptation. For two-arm group sequential designs with adaptive sample size reestimation, methods have been developed by Gao et al, 19 Brannath et al, 20 and Mehta et al. 21 There has been some recent work on unbiased point estimates in phase 2-3 trials by Bowden and Glimm, 22 Robertson et al, 23 and Stallard and Kimani. 24 Magirr et al 25 have proposed simultaneous confidence intervals that are compatible with closed testing in adaptive designs. Further study is needed to understand how these methods may be incorporated into the general framework presented here.

APPENDIX A1. ANALYTICAL COMPARISON WITH TWO ACTIVE DOSES AND TWO STAGES
Patients are randomized equally between the three arms of the study and each patient's response is normally distributed with 2 = 1. The control arm has a mean of zero and treatment i has mean i , i = 1, 2. The null hypothesis corresponding to the treatment i is H i 0 ∶ i = 0. In this section we will test the global null hypothesis H 0 = H 1 0 ∩ H 2 0 against the one-sided alternative that i > 0 for at least one i = 1, 2 . There will be no early stopping for efficacy, no dropping of treatments and no adaptive sample size reestimation.
(a) Analytical Power for Cumulative MAMS Denote by P(CUMUL) the probability of rejecting H 0 when the true treatment effect is = ( 1 , 2 ). Since there is no early stopping, the first stage boundary b 1 is ∞. Let b 2 denote the second stage boundary. Let f 1 (w 11 , w 21 ) be the probability density function of W 1 = (W 11 , W 21 ), the stage 1 score statistics. Let f (2) (w 1(2) , w 2(2) ) be the probability density function of W (2) = (W 1(2) , W 2(2) ), the incremental stage 2 score statistics. These densities are multivariate normal with means, variances and covariance structures that depend on as specified in Section 2. For notational convenience, however, we do not express, explicitly, the dependence of these density functions on .
The fourth line of the above equation utilizes the fact that W j has independent increments.

(b) Analytical Power for Stage-Wise MAMS
We first evaluate the incremental P-values for the two stages. The stage 1 P-value is evaluated as The stage 2 P-value, P (2) , is computed from the incremental data obtained after the interim analysis. Letting max{W (2) } = max{W 1(2) , W 2(2) }, we have Since there is no early stopping, H 0 is rejected if where z = Φ −1 (1 − ). The power of the P-value combination test to reject H 0 is )) f 1 (w 11 , w 21 )dw 21 dw 11 (A2) P(STAGE) can be further simplified for better comparison with P(CUMUL). Define the univariate function Then Then, substituting Equation (A3) into Equation (A2) we have is a function of the maximum of (w 11 , w 21 ) through p 1 .

APPENDIX A2. EXTENDING CUMULATIVE MAMS TO J > 2 STAGES
Consider a J-stage multiarm group sequential design in which j is spent at stage j in accordance with some -spending function such that ∑ J j=1 j = . Then the corresponding efficacy boundaries (b 1 , b 2 , … b j ) satisfy the requirements P 0 (max{W 1 } ≥ max{w 1 }) = 1 (A5) and for j = 2, 3, … J, We have shown in Reference 9 how to compute such boundaries. Now suppose we perform a one-time dose selection and sample size reestimation at some stage q < J. Let  = {1, 2, … D} denote the indices of the D treatments and S ⊆  denote the indices of the treatments selected for further testing at stages q + 1, q + 2, … J. We wish to test H i 0 for all i ∈ S while maintaining strong control of FWER at level-. Therefore, based on the closed testing principle, each H i 0 may only be rejected if, for all I ⊆  such that i ∈ I, H I 0 = ∩ g∈I H g 0 is rejected by a valid local level-test. The following two-step procedure may be used to construct the local level-test of H I 0 .
Step 1 Compute J new group sequential boundaries (b I1 , b I2 , … b IJ ) that are suitable for making ||I|| ≤ D treatment comparisons to the common control arm. These boundaries must satisfy P 0 (max{W I1 } ≥ b I1 ) = 1 and for j = 2, 3, … J, where W Ij = {W ij ; i ∈ I}. If max{w Iq } ≥ b Iq , H i 0 is rejected. Otherwise we proceed to Step 2.
Step 2 After examining the stage q data a subset S ⊆  consisting of ||S|| treatments is selected for testing at stages q + 1, q + 2, … J, possibly accompanied by an increase in the sample sizes of the selected doses. Let I S = I ∩ S. In order to obtain a valid level-test of H I while accommodating this adaptation, we must replace the future boundaries (b I,q+1 , b I,q+2 … b IJ ) with updated boundaries (b * I,q+1 , b * I,q+2 … b * IJ ) that satisfy the Müller and Schäfer criterion 11 for preserving the conditional type-1 error. Thus these updated boundaries must satisfy the relationship where W * I S l = {W * ql ∶ q ∈ I S } and the } * ′ indicates that the sample size of the stage l statistic W * I S l has been altered. We reject H I 0 if, for any l ∈ {q + 1, q + 2, ||S||}, max{w * I S l } ≥ b * Il . The method of Ghosh et al 9 can be applied to Equation (A7) to obtain (b * I,q+1 , b * I,q+2 … b * IJ ), possibly with a spending function for the remaining that is different from the one that was selected initially. The details have been omitted for brevity.

APPENDIX A3. EXTENDING STAGE-WISE MAMS TO J > 2 STAGES
Recall that we can reject any elementary hypothesis H i 0 only if the intersection hypothesis H I 0 is rejected by a valid local level-test for all subsets I ⊆  that contain i. To test H I 0 at any stage j we require the multiplicity-adjusted P-values P I (1) , P I (2) , … P I(j) where each P I(l) utilizes only the incremental data of subjects enrolled between stages l − 1 and l. These P-values are transformed by the inverse normal function and combined with prespecified weights to form the stage-wise test statistic where, for l = 1, 2, … j, h lj = √ N (l) ∕N J √ (N (1) + N (2) + … + N (j) )∕N J , N J is the preplanned total sample size of the trial, and N (l) is preplanned incremental number of patients to be enrolled between stages l − 1 and l. Any valid multiplicity-adjusted P-values may be utilized in Equation (A8) for the test of H I 0 . Popular candidates include the t-test based P-values adjusted for multiplicity by the nonparametric Bonferroni and Simes procedures as shown in Reference 5. However, in order to make a meaningful comparison between the cumulative MAMS and the stage-wise MAMS approaches, we will utilize P-values that are derived from the maximum score statistic. In that case the multiplicity adjusted P-values in Equation (A8) are given by where, for l = 1, 2, … j, W I(l) is the vector of incremental score statistics contained in the subset I ⊆  (or in the subset I ∩ S if S ⊆ I treatments have been selected for further testing by stage l). Thereby Equation (A8) is a weighted sum of inverse normal P-values derived from Dunnett's test. 16 To evaluate P I(l) exactly we standardize the observed score to be where T I(l) has a multivariate-T distribution with mean 0, n 0(l) + ∑ i∈I n i(l) − ||I|| − 1 degrees of freedom, and a known correlation matrix that depends on the allocation ratios of the treatment arms to the control arm.
Since the P-values in Equation (A8) are computed from independent cohorts of patients and are combined with prespecified weights whose sum of squares is 1, the statistic Z Ij is N(0, 1) with independent increments under the null hypothesis H I 0 for all j and all I. Thus one can readily obtain efficacy boundaries c 1 , c 2 , … c J such that P H I 0 (Z I1 ≥ c 1 ) = 1 (A11) and for j = 2, 3, … J, by the usual methods for two-arm group-sequential designs. 2 The j s are obtained by specifying any suitable error spending function. The null hypothesis H I 0 is rejected at the first j such that Z Ij ≥ c j . We reject H i 0 with strong control of FWER if H I 0 is rejected for all possible I ⊆  with i ∈ I.