Design and monitoring of multi‐arm multi‐stage clinical trials

Two‐arm group sequential designs have been widely used for over 40 years, especially for studies with mortality endpoints. The natural generalization of such designs to trials with multiple treatment arms and a common control (MAMS designs) has, however, been implemented rarely. While the statistical methodology for this extension is clear, the main limitation has been an efficient way to perform the computations. Past efforts were hampered by algorithms that were computationally explosive. With the increasing interest in adaptive designs, platform designs, and other innovative designs that involve multiple comparisons over multiple stages, the importance of MAMS designs is growing rapidly. This article provides break‐through algorithms that can compute MAMS boundaries rapidly thereby making such designs practical. For designs with efficacy‐only boundaries the computational effort increases linearly with number of arms and number of stages. For designs with both efficacy and futility boundaries the computational effort doubles with successive increases in number of stages.


Introduction
Group sequential design and monitoring of randomized clinical trials has played an important role in clinical drug development for over 40 years. The literature on group sequential methods is vast and has been presented in a unified manner in outstanding textbooks by Jennison and Turnbull (2000)and Proschan, Lan, and Wittes (2006). Throughout its 40-year history, the development and use of group sequential methodology has been largely confined to two-arm trials, in which a proposed new medical intervention is compared to a standard-of-care control arm with respect to a single primary endpoint. There is increasing interest, however, in extending the methodology to multiple pair-wise comparisons to a common control -the so-called multi-arm multi-stage (MAMS) case. Parmar, Carpenter, and Sydes (2014) have argued persuasively that multiarm trials allow more treatments to be assessed in less time than could ever be done in a series of twoarm trials. Equally important, for patients and policy makers, the multiarm trial produces contemporaneous results for all research treatments. Jaki (2016) compared the sample size requirements of separate two-arm trials to those of multi-arm trials for Alzheimer's disease treatments and showed savings of up to 50%.
A major hurdle to adopting MAMS designs routinely is the computational effort of obtaining stopping boundaries [Correction added on September 25, 2017, after first online publication: Acknowledgment of contribution by Professor Ralph D'Agostino, Sr. and edits to sentences in Section 4.2 under Look 2 and Look l] that guarantee strong control of type-1 error. This problem has not, so far, been handled satisfactorily. Attempts were made by, among others, Hughes (1993), Follman, Proschan, and Geller (1994), Stallard and Todd (2003), Chen, DeMets, and Lan (2010) and Magirr, Jaki, and Whitehead (2012) with varying degrees of success, as we discuss in Section 5. MAMS designs face two major computational hurdles -unless their special structure is exploited, the computational complexity grows exponentially with number of arms as well as with number of stages. In all previous work on MAMS designs, the fact that the independent increments property, so successfully exploited for 2-arm trials (see, for e.g., Armitage et al., 1969), could be generalized to the multi-arm setting was never recognized. Additionally, multi-arm designs involve multivariate normal integration for which special techniques were developed only in the 1990s (see, for e.g., Genz and Bretz, 2009), and hence were not available to the early proponents of MAMS designs.
In this article, we present an algorithm that exploits independent increments and also utilizes the quasi-Monte Carlo methods of Genz and Bretz (2009, page 91) to perform multivariate normal integration efficiently. As a result, the computational effort required to create designs with efficacyonly boundaries is linear in both number of arms and number of stages, while the computational effort required to create designs with both efficacy and futility, boundaries remains linear in number of arms but doubles with each successive increase in number of stages. Furthermore, because we utilize quasi-Monte Carlo methods, the error in the numerical integration can be quantified and the accuracy of the resulting stopping boundaries can be bounded to any desired level of confidence. This is not possible with the other methods discussed above since they utilize numerical quadrature in their algorithms.
We formulate the problem in Section 2 and standarize it in Section 3 so as to eliminate dependence on the variance-covariance matrix of the test statistic. The numerical algorithm is presented in Section 4. Sections 5 and 6 discuss how the method compares with alternative methods that have been developed for MAMS designs.The method is applied, for illustrative purposes, to a multi-arm clinical trial of chronic obstructive pulmonary disease in Section 7. We end in Section 8 with some concluding remarks.

Stopping Boundaries for Normally Distributed Data
Consider a group sequential clinical trial in which D treatment arms, indexed by i = 1, 2, . . . D, are compared in a pairwise manner to a common control arm, indexed by i = 0. The trial comprises up to J looks at accumulating data, indexed by j = 1, 2, . . . J. As subjects enter the trial they are randomized to either the ith treatment arm or the control arm in accordance with a pre-specified allocation ratio of λ i . Let X ij ∼ N(μ i , σ 2 i ) denote the response of the ith subject who is enrolled between look j − 1 and look j. Let δ i = μ i − μ 0 . We shall be interested in testing the global null hypothesis H 0 : δ i = 0 for all i = 1, 2, . . . D against the one side alternative hypothesis H 1 : δ i > 0 for at least one i, through a group sequential hypothesis test strategy that ensures strong control of type-1 error. To that end, let n i1 < n i2 · · · < n iJ be the cumulative sample sizes on arm i, x ij be the sample mean for the n ij subjects on arm i andδ ij = x ij − x 0j be the maximum likelihood estimate of δ i at look j. The Fisher information for δ i at look j, denoted by I ij , is estimated by the square inverse of the standard error ofδ ij . Thus it is easy to show that I ij = n 0j i where Denote the efficient score statistic by W ij =δ ij I ij and let W j = (W 1j , W 2j , . . . W Dj ). Then W j , j = 1, 2, . . . J, is a multivariate Brownian process with the following properties: Property (2.1) implies that the W j process has independent increments, that is, W j 1 and W j 2 − W j 1 are independent. Let δ = (δ 1 , δ 2 , . . . δ D ) and max{W j } = max i (W ij , i = 1, 2, . . . D). A J-look level-α group sequential design with early stopping only for efficacy involves evaluation of efficacy where P h (.) denotes probability under δ = h. These efficacy boundaries are typically computed recursively. The overall type-1 error α is first split into J positive components (α 1 , α 2 , . . . α J ) in accordance with some pre-specified error spending function (Lan and DeMets, 1983) such that j α j = α. Now suppose that b 1 , . . . b j−1 have already been evaluated. Then b j is obtained from the jth term in equation (2.2) by solving Notice that these efficacy boundaries have been evaluated under the complete null hypothesis δ = 0. It can be shown, however, that because they utilize max{W j } at each look, they do provide strong control of type-1 error (see, Magirr et al., 2012).
Once the efficacy boundaries have been evaluated, one can compute β, the type-2 error of the design at any alternative hypothesis, say δ = δ 1 , as (2.4) One can also incorporate futility boundaries, (a 1 , a 2 , . . . a J−1 , a J = b J ), into the design such that the trial may be stopped for futility if max{W j } ≤ a j for any j, in which case the type-2 error is (2.5) In practice, the futility boundaries are made nonbinding by first evaluating the efficacy boundaries b 1 , b 2 , . . . b J . Next, an arbitrary value is selected for β, the type-2 error for any pre-specified sample size, and is split into J components (β 1 , β 2 , . . . β J ), J j=1 β j = β, in accordance with some desired error spending function. Now suppose that a 1 , . . . , a j−1 have already been evaluated. Then at look, j, a j can be evaluated by equating the jth term in equation (2.4) with β j . Thus This entire procedure is iterated with different values of β until the a J = b J , that is, the boundaries must meet at the last look. The final value of β thus evaluated will be the actual type-2 error of the study.
We show in Appendix S1 of the Supplementary Materials that the probability equations (2.3) and (2.6) can be solved by repeated evaluation of "component probabilities" of the form where c l is either a l or b l , depending on the context. We also show in Appendix 1 that at any look j, it is necessary to evaluate 2 j , such individual components each being of order j × D.

Generalization to Discrete and Time-to-Event Models
Although the above formulation for designing a MAMS group sequential design assumes that the underlying data are normally distributed, it is applicable more generally to clinical trials with continuous endpoints, binary endpoints, time-toevent endpoints, and even to regression models in which the test statistics are derived as efficient scores from the likelihood function, by a straightforward application of methods by Jennison and Turnbull (1997) (JT) for two-arm group sequential designs. The basic result is Theorem 2 of JT which states that for general parametric regression models in which δ denotes the coefficients of the covariates and c T δ is any linear contrast, the sequentially computed score statistics . . J, are multivariate normal with independent increments. Theorem 3 of JT presents an analogous result for the Cox proportional hazard model in which δ is a vector of log hazard ratios. These results form the basis of most major software packages for group sequential inference with normal, binomial, and time to event endpoints. The East r software package provides simulation tools to verify the accuracy of the results or make appropriate adjustments for small sample sizes.

Repeated Confidence Intervals
The repeated confidence intervals proposed by Jennison and Turnbull (1989) for two arm designs extend naturally to MAMS designs. Consider a D-arm, J-look MAMS design with level-α efficacy boundaries b j , j = 1, 2, . . . J computed as discussed above. These boundaries must satisfy the relationship In other words, the event {W ij − h i i ij < b j , i = 1, . . . D, j = 1, . . . J} occurs with probability 1 − α. Since W ij =δ ij I ij , it follows that for every comparison i at every look j, the event {δ ij I ij − h i I ij < b j } occurs with probability at least 1 − α whereupon, the 1 − α lower repeated confidence bound for h i isδ ij − b j /I ij . By a symmetrical argument, the upper repeated confidence bound isδ ij + b j /I ij .

Standardized Score Statistics
In order to evaluate component probabilities of the form (2.7), it is desirable to transform W j such that the sample size affects the transformed statistic only through its mean and not through its covariance matrix, for then the efficacy boundaries will not depend on sample size. Accordingly, we define a new stochastic process . . D} is the maximum information at the final look, over all the δ i parameters. Let . . D}, and define the "drift parameter" for treatment i by Then U j is a multivariate brownian process of independent increments with: Define the D × D matrix ρ by where d l = c l / √ I max . We will develop an efficient numerical algorithm for evaluating (3.9).

Integration Over a Unit Hypercube
The independent increment structure of U j enables us to factor (3.9) into a product of nested integrals that can be solved recursively. To see this, observe that where f (u 1 , u 2 , . . . u j |t, η, ρ) is the joint density of (U 1 , U 2 , . . . U j ) and u l < d l means that u il < d l for all i = 1, 2, . . . D. Factoring f (.) as a product of conditional probabilities, we can rewrite (4.10) as We may now rewrite (4.11) as a recursive multivariate integral of a product of densities of independent increments of the form (j) or more conveniently as The next step is to replace (4.12) with a product of univariate integrals. To this end, let ρ = CC be the Choleskey decomposition of ρ where C is a lower triangular matrix. Using the linear transformation (u (l) − t (l) η)/ √ t (l) = Cy l , l = 1, 2, . . . j, as discussed in Genz and Bretz (2009, page 29), we have = y l y l and the Jacobian of the transformation is given by Therefore det(J) = det( t (l) C) = det(t (l) ρ) and the product of the j multivariate normal integrals (4.12) can be rewritten as a product of D × j univariate normal integrals where the limits of integration are shown in Appendix S2 of the Supplementary Materials to be C im y ml (4.14) for i = 1, 2, . . . D, l = 1, 2, . . . j, and a summation is null if its lower limit exceeds its upper limit. We utilize the transformation (y il ) = z il to convert (4.13) to the form for integration over the unit hypercube [0, 1] D×j , and e il is re-expressed as . . i. This relationship shows that although (4.17) appears to be a product of independent integrals on the unit hypercube, the integrands, e ij , are linked recursively. Note, for future reference, the recursion We shall evaluate (4.17) by quasi-Monte Carlo integration.

Quasi-Monte Carlo Integration
Denote the multiple integral (4.17) by is computed by evaluating each e il recursively as described below.
Look 1 : At look l = 1, each e i1 , i = 1, 2, . . . D, is evaluated as a function of the preceding e m1 , m = 1, 2, . . . i − 1, according to the formula Finally the terms Q il = −1 (e il x il ) √ t l(l) + Q i,1−1 , i = 1, 2, . . . D, are saved for use at look l + 1. These computations proceed look by look for l = 1, 2, . . . j and f (x) is finally evaluated as the product of the individual e il terms. Let , and e l = e 1l e 2l · · · d Dl Then the above computations imply that One can obtain a crude Monte Carlo estimate for I by sampling x uniformly from a D × j dimensional unit hypercube. (N) be the N sampled values. Then is an unbiased estimate of I with standard error of order O(N −1/2 ). It is possible to improve the accuracy of the crude Monte Carlo estimate by a quasi-Monte Carlo (QMC) procedure that combines the above randomly generated points with a fixed sequence of points created by lattice rules. We shall use a QMC procedure proposed by Genz and Bretz (2009, page 47-48). A rank-1 lattice is a point set of the form where P is a prime number and (in our setting) υ P is a D × j dimensional vector that depends on P. Many choices have been proposed for (P, υ P ). In order to achieve O(N −1+ ) integration errors, good lattice parameters must be determined. Genz (2000, http://www.math.wsu.edu/ faculty/genz/software/fort77/mvnpack.f+) has created a table of (P, υ P ) combinations based on the component by component algorithms of Dick and Kuo (2004), and Nuyens and Cools (2006) that are well suited to multivariate normal integration. Our computations utilize (P, υ P ) values from this table, a subset of which is displayed in Figure S1 of the Supplementary Materials. The QMC procedure is outlined below.
1. Select a (P, υ P ) combination and construct the lattice L P consisting of P points each of dimension D × j.
where {a} = a mod 1. Then a QMC estimate of I is given by 4. Further, variance reduction is achieved by repeating steps 1 to 3 with additional (P, υ P ) combinations. Suppose, we utilize K rank-1 lattices, generated by combinations (P k , υ P k ) and resulting in the estimates I P k ,M , k = 1, 2, . . . K. Then the final estimate of I, due to LePage (1978), is

Accuracy
Standard error estimates of the crude Monte Carlo (MC) and quasi-Monte Carlo (QMC) procedures are displayed in Table 4.2 for various choices of D and J. The MC estimates are displayed for 50,000 independent samples. The QMC estimates are displayed for M = 6 and a suitable value of K such that K k=1 P k ≈ 50, 000. Thus, for the comparison of the two procedures with respect to accuracy, the number of sampled points is approximately the same. These results ensure that the 99.9% confidence interval for I based on QMC will be at least as accurate as I ± 0.001 even for a design with 5 comparisons to a common control and 5 looks at accumulating data. Larger problems are unlikely to be encountered in the clinical trials setting.

A Non-Technical Explanation of Quasi-Monte Carlo
Multivariate normal probability computations typically involve semi-definite integration. For one-dimensional problems semi-definite integration is performed by quadrature methods. A multi-dimensional integral can be expressed as repeated one-dimensional integrals by applying Fubini's theorem. But the computational effort of using numerical quadrature grows exponentially with the number of dimensions. To avoid that, one usually considers Monte Carlo methods for multi-dimensional integration.
The Monte Carlo method requires the multivariate integral to be a definite integral. The transformations in Section 4.1 convert the semi-definite multivariate normal integral (4.10) into a definite integral over a hypercube. The regular Monte Carlo method averages the repeated evaluations of the integrand function over a random sequence of grid points. However, as we demonstrated in Table 4.2, using random sequences in this way is not sufficiently efficient for MAMS boundary computations. This led us to the quasi-Monte Carlo integration described in Section 4.2. In quasi-Monte Carlo integration some points are chosen deterministically using low-discrepancy measures to make the computations converge faster. These low-discrepancy points are combined in a specific way with random sequences of points so as to make the computation robust. The creation of these sequences of lowdiscrepancy points utilize number theory methods that are outside the scope of this article. In particular, we have used the Korobov (2002) sequence of low-discrepancy points and the Mersenne-Twister method (Matsumoto and Nishimura, 1998) for combining them with random sequences.

Comparison with Alternative Algorithms
The most recent published method for generating stopping boundaries and computing sample size for MAMS designs is by Magirr, Jacki, and Whitehead (2012). An R program implementing this method and maintained by Jaki is available at https://cran.r-project.org/web/packages /+MAMS/index.html Tables 2 and 3 compare the execution times of the new algorithm described in Section 4, hereafter referred to as NEW and the R implementation of the Magirr, Jackie, and Whitehead (2012) method, hereafter referred to as MJW, for a range of treatment comparisons and looks at the accruing data. All computations were executed on a Lenovo Think Pad, Model T440P with Intel i7 processor and 8 core CPUs. Table 2 displays results for designs with efficacy-only boundaries while Table 3 displays results for both efficacy and futility boundaries.
It is evident from these tables that, for designs with efficacy-only boundaries, the computing times of NEW are linear in both D and J, while they increase as 2 J for designs with both efficacy and futility boundaries. In contrast the It is insightful to analyze why MJW, unlike NEW, is computationally explosive as the number of stages increase. Both algorithms must compute the probability of the very same for a normally distributed statistic T ij and suitably standardized efficacy and futility boundaries l j and u j . The difference is that MJW uses the Wald statisticδ i I ij for T ij , whereas NEW uses the score statisticδ i I ij for T ij , for computing the event (5.21). This initial choice of test statistic dooms the MJW method for it can no longer utilize the underlying stage-wise independent increments structure of the problem. Instead, the problem gets transformed into the form where, for the ith treatment comparison, j {L ij , U ij } denotes the result of integrating the j-dimensional multivariate normal density over a region defined by a vector of lower limits L ij and a vector of upper limits U ij . Decomposition of (5.22) into a product of univariate normal integrals such as we have obtained in (4.17) is clearly impossible. Evaluation of (5.22) is by numerical quadrature. Suppose, each of the J dimensions of the outer integral j {L ij , U ij } is evaluated by repeated calls to a function such as mvtnorm (Genz et al., 2016). For each i = 1, 2, . . . D, there are j, repeated calls to mvtnorm, in which each call evaluates a region {L ij , U ij } of a j-dimensional multivariate normal density. This computation must be repeated for j = 1, 2, . . . J. It follows that the MJW algorithm must make J j=1 G D×j calls to mvtnorm to evaluate (5.22). Assuming that G = 20, the current implementation of MJW in R, one can see why the problem breaks down entirely for J > 3, even for the balanced case where only J j=1 G j calls to mvtnorm are needed.
In contrast the NEW algorithm, by exploiting the independent increments structure of W ij , is able to transform the problem into a simple product of univariate integrals of the form (4.17) from which the stopping boundaries, type-1 error or power can be obtained by a recursive computation that is linear in D or J when only efficacy boundaries are evaluated, and linear in D but increasing like 2 J when both efficacy and futility boundaries are evaluated.
Prior to MJW, methods to obtain group sequential boundaries for MAMS designs were proposed by Hughes (1993), Follman, Proschan, andGeller (1994), Stallard and Todd (2003) and Chen, DeMets and Lan (2010 ). All these methods utilized the Wald statistic for the computations and so suffer from the same limitation as MJW. Hughes (1993) simply utilizes the boundaries of a two-arm clinical trial, relying on binding futility rules established via simulation, for dropping non-performing arms in mid-course, and thereby preserving the type-1 error conservatively. There is no guarantee that this approach will provide strong control of type-1 error. Follman et. al. (1994) start out by computing Bonferroni based stopping boundaries and then adjusting them by simulation. This approach is satisfactory for precomputing and tabulating stopping boundaries for specific α values, number of arms and number of looks. It may not be as satisfactory when boundaries have to be re-computed via α-spending at interim analyses that do not adhere to the pre-specified design parameters. Stallard and Todd (2003) propose to select the dose with the maximum Wald statistic at the first interim analysis and drop the other doses, so that the remainder of the trial utilizes conventional two-arm boundaries. The option to carry more than one dose forward is not provided. Chen et. al. (2010) utilize numerical quadrature when J × D ≤ 6 and recommend simulation when J × D is more than 6 .

Comparison with Combination P-value
Methods An alternative method to test multiple treatment arms against a common control arm with possible treatment selection at one or more interim analysis time points is by combining p-values from the different stages with pre-specified weights and using a closed test for the final analysis. Since the p-values from the separate stages are independent and uniformly distributed under the null hypothesis (or stochastically larger than uniform in the discrete case), their combination with pre-specified weights is also uniformly distributed so that valid level-α tests may be constructed. In this approach, pvalues are required both to test elementary hypotheses of the form H j : δ j = 0 as well as intersection hypotheses of the form H i ∩ H j ∩ H k : δ i = δ j = δ k = 0. The latter p-values are adjusted for multiplicity by methods proposed by, among others, Dunnett, Bonferroni, and Simes. An excellent reference to the combination p-value method is the paper by Posch et al., (2005).
Although the MAMS and combination p-value approaches tackle essentially the same problem, the two approaches are fundamentally different. MAMS, having its roots in Markov processes, exploits the known correlation structure of the sequentially computed score statistic when computing the early stopping boundaries. The p-value combination approach on the other hand, having its roots in multiple comparisons methodology, exploits closed testing (Markus et al., 1976) to control the type-1 error. Moreover since, in this approach the independent incremental data from each stage are combined, correlations between sequentially computed cumulative statistics across stages are not exploited. It would thus be of interest to make power comparisons between the two methods. There are actually three definitions of power in a multiarm setting; global power, disjunctive power, and conjunctive power. Global power is the probability that at least one treatment arm will attain statistical significance. Disjunctive power is the probability that at least one non-null treatment arm will attain statistical significance. Conjunctive power is the probability that all non-null treatment arms will attain statistical significance. Table 4 displays disjunctive and conjunctive power comparisons between the MAMS method and three commonly used combination p-value methods -Bonferroni, Simes, and Dunnett. These power comparisons are for 3, 4, and 5-arm designs with three equally spaced looks, 50 subjects per arm, the Lan  (1987), O'Brien-Fleming (1979) type boundary for early efficacy stopping, δ/σ = 0.5 for each treatment arms relative to the control arm, and a futility rule that drops any treatment arm if its estimatedδ < 0 at an interim look. All table entries are based on 100,000 simulations. The MAMS designs dominate over all the combination pvalue designs. Bonferroni is less powerful than Simes, which in turn is less powerful than Dunnett. Furthermore, as shown in Section 2.3, the MAMS approach can provide repeated confidence intervals that guarantee coverage of each δ at each look, albeit conservatively. Confidence intervals for treatment effects are not yet available by p-value combination methods.

The INHANCE Clinical Trial of COPD
Indacaterol to Help Achieve New COPD Treatment Excellence (INHANCE) was a randomized clinical trial for the treatment of chronic obstructive pulmonary disease in which four doses (75 mg, 150 mg, 300 mg, 500 mg) of inhaled indacaterol, a once-daily long-acting β 2 -agonist bronchodilator were compared to placebo (Donohue et al., 2010). The primary efficacy objective was to show the superiority of at least one dose over placebo at week 12 with respect to 24hour post-dose (trough) forced expiratory volume in 1 second (FEV 1 ). The improvement in FEV 1 for indacaterol versus placebo, denoted by δ, was expected to be between 0.14 and 0.18 liters and the between-subject variability was assumed to be σ = 0.5. Although the actual trial had only two stages and utilized closed testing for preserving the type-1 error (Lawrence et al., 2014), we will use this setting to illustrate a MAMS design comprising three pair-wise comparisons (150 mg, 300 mg, 500 mg) to placebo over up to four equallyspaced looks at the accumulating data. The design will utilize the Lan and DeMets (1983) error spending function to generate O' Brien-Fleming (1979) type boundaries (LD(OF) boundaries) to generate early stopping efficacy boundaries. Table 5 displays these boundaries on the Wald scale and contrasts them with corresponding boundaries for a conventional four-look group sequential design (GSD) with only one treatment arm versus placebo.
The 4-arm boundaries are stricter than the 2-arm ones because, with three comparisons to placebo, there are three times as many chances for declaring efficacy under the global Suppose, we wish the COPD trial to have 90% global power under the alternative hypothesis H 1 : δ 1 = 0.14, δ 2 = 0.16, δ 3 = 0.18. We will adopt the LD(OF) boundaries for early efficacy stopping with type-1 error = 0.025 to reject the global null hypothesis H 1 : δ 1 = δ 2 = δ 3 = 0. These efficacy boundaries will preserve the type-1 error conservatively if treatment arms are dropped in midstream for any reason, including excess toxicity. It is nevertheless desirable to incorporate formal futility boundaries into the design so as to have objective criteria for dropping non-performing arms at one or more interim analysis time points. Table 6 displays the maximum sample size, expected sample size under H 1 , and expected sample size under H 0 for designs with between 1 and 4 looks, LD(OF) efficacy boundaries and non-binding LD(OF) futility boundaries.
There are large savings in expected sample size, with diminishing returns as the number of looks increase. The efficacy and futility boundaries for the 4-look design are displayed on the Wald scale in Figure S2 of the Supplementary Material.
The results in Table 6 have not taken into consideration the possibility of recomputing the efficacy boundary whenever a treatment arm is dropped. We now show that substantial additional efficiency gains can be obtained by boundary recomputation. Boundary re-computation if treatment arms are dropped is not a trivial problem, however. If done incorrectly, the type-1 error will be inflated. We utilize a generalization of the conditional error rate principle, proposed by Müller and Schäfer (2001) for adaptive designs, to recompute the efficacy boundary without inflating the type-1 error. For expository purposes, we will explain the principle in the context of a three-look, four-arm design from which a single arm is dropped at look 1. Let (W 1j , W 2j , W 3j ) be the score statistics for the three treatment comparisons and let b j be the  1  624  624  624  2  644  558  446  3  668  522  420  4  684  502  401 corresponding efficacy boundary at look j, j = 1, 2, 3. Suppose that at look 1, with (W 11 , W 21 , W 31 ) = (w 11 , w 21 , w 31 ), treatment 1 is dropped. Denote the recomputed boundaries for looks 2 and 3 by b * 2 and b * 3 , respectively. Define the conditional error rate of the original design as (w 11 , w 21 , w 31 ) = P 0 {max(W 12 , W 22 , W 32 ) ≥ b 2 or max(W 13 , W 23 , W 33 ) ≥ b 3 |(w 11 , w 21 , w 31 )} and the conditional error rate of the design with the modified boundaries as * (w 21 , w 31 ) The Müller and Schäfer error rate principle states that the type-1 error will be preserved if the procedure used to compute the modified boundaries (b * 2 , b * 3 ) always satisfies the conditional error rate requirement * (w 21 , w 31 ) ≤ (w 11 , w 21 , w 31 ). (7.23) Additional constraints similar to (7.23) are imposed for other intersection hypotheses.
To study the impact of recomputing the efficacy boundaries, we ran 100,000 simulations with and without boundary recomputation for two and three-look COPD designs with δ 1 = 0.14, δ 2 = 0.16, δ 3 = 0.18, and LD(OF) efficacy boundaries. At the first interim analysis the treatment arm with the greatest standardized treatment effect was selected and the other two treatment arms were dropped. The gains in power due to recomputing the boundaries are displayed in Table 7.
The numbers in column 6 indicate the additional patients, one would have to enroll to get the same gain in power without boundary recomputation. For example, to boost the power from 90 to 94.4% in a 3-look design without boundary recomputation would require 160 additional patients, a 20% increase in sample size.

Discussion
We have shown how one may create MAMS designs efficiently, illustrated their application to trials involving multiple doses and found them to have greater power than competing designs based on combining p-values from independent stages. On the face of it the computational problem appears intractable. If one attempts to solve it by numerical quadrature as was attempted by Magirr, Jaki, and Whitehead (2012) (MJW), the complexity increases exponentially with number of looks and breaks down entirely for designs with more than 3 looks at the accumulating data. Our major contribution was to show that if the problem is viewed in the proper framework it can be transformed such that the computational complexity is linear in both number of looks and number of arms for designs with efficacy-only boundaries and doubles with successive stages for designs with both efficacy and futility boundaries. Thus, for example, designs with six arms (including the control arm) and five stages can be created in Results in each row based on 100,000 simulated trials * Indicates additional patients required to obtain equivalent power if the boundary is not recomputed under two minutes by the NEW algorithm. This type of performance frees the trial designer to experiment with different design options including number of looks, types of stopping boundaries and sample size, under alternative scenarios for the treatment effects. This is a crucially important consideration for optimizing trial design. If one had to wait several hours or days in between scenarios, it is unlikely that one would consider more than one or two design options and might miss out on the best possible design for the situation under consideration.
The full potential of our algorithm has not yet been realized. We have seen from Table 2, that once the design has more than one interim analysis the computational effort of the MJW algorithm can increase by two orders of magnitude. For example, a 3-look design with efficacy-only boundaries involving three comparisons to a common control takes 1 second with the NEW algorithm but 148 seconds with the MJW algorithm. While 148 seconds might not be a limitation for onetime computation of early-stopping boundaries, it can be an impassable barrier for evaluating the operating characteristics of more sophisticated MAMS designs. Increasingly, there is interest in efficient adaptive designs where doses are dropped, the sample size is reestimated, and the stopping boundaries are recomputed at interim looks (see, Magirr, Stallard and Jaki, 2014;Gao, Liu and Mehta, 2014). In such designs the type-1 error is preserved by matching the conditional error rates of the original and adapted designs as was done for the INHANCE trial in Section 7. The operating characteristics of such designs can only be evaluated by simulation experiments in which the early stopping boundaries are repeatedly recomputed on-the-fly. The evaluation of even a single scenario involving 10,000 simulations (the bare minimum for a realistic evaluation of operating characteristics in an actual clinical trial) would become impractical for a design with three or more looks if it required 148 seconds per simulation. With the NEW algorithm, however, it was computationally feasible to run as many as 100,000 simulations per design and demonstrate the gains in power displayed in Table 7.
Recently, there has been much discussion about so-called "platform trials" in which several promising treatment regimens from different companies are tested on a common platform. In such trials, multiple arm are monitored in group sequential fashion. Losing arms are dropped and replaced by new arms. Strong control of type-1 error is a regulatory requirement of such studies. The STAMPEDE trial (Sydes et al., 2009) is one such trial. We are currently working on various examples of adaptive designs that will exploit the full power of our algorithm and will present these results in subsequent papers.
We end on a note of caution. As stated in Section 2.2, the early stopping boundaries are based on the asymptotic normality of the score statistic. We have found, by simulating t-statistics 100,000 times with 200 subjects/arm, that the actual type-1 error of a design with nominal 1-sided α = 0.025 is 0.026, regardless of the number of stages. With 400 subjects/arm, however, the actual type-1 error is preserved at 0.025 or less. This suggests that at the design stage of a confirmatory clinical trial one must verify through extensive simulation what nominal type-1 error will guarantee that the 1-sided actual type-1 error remains below 0.025. A systematic study of the rate of convergence to normality of the various test statistics commonly used for normal, binomial and timeto-event data would be desirable, and is made possible by the ready availability of the relevant stopping boundaries through the present work.

Supplementary Materials
Appendices 1, 2 and Figures S1, S2, referenced in Sections 2.1, 4.1, 4.2, and 7, respectively, are available with this article at the Biometrics website on Wiley Online Library.
Software for designing MAMS trials is implemented in the East r software package and is freely available to instructors and students for course work by contacting support@cytel.com.