Maximum type 1 error rate inflation in multiarmed clinical trials with adaptive interim sample size modifications

Sample size modifications in the interim analyses of an adaptive design can inflate the type 1 error rate, if test statistics and critical boundaries are used in the final analysis as if no modification had been made. While this is already true for designs with an overall change of the sample size in a balanced treatment-control comparison, the inflation can be much larger if in addition a modification of allocation ratios is allowed as well. In this paper, we investigate adaptive designs with several treatment arms compared to a single common control group. Regarding modifications, we consider treatment arm selection as well as modifications of overall sample size and allocation ratios. The inflation is quantified for two approaches: a naive procedure that ignores not only all modifications, but also the multiplicity issue arising from the many-to-one comparison, and a Dunnett procedure that ignores modifications, but adjusts for the initially started multiple treatments. The maximum inflation of the type 1 error rate for such types of design can be calculated by searching for the “worst case” scenarios, that are sample size adaptation rules in the interim analysis that lead to the largest conditional type 1 error rate in any point of the sample space. To show the most extreme inflation, we initially assume unconstrained second stage sample size modifications leading to a large inflation of the type 1 error rate. Furthermore, we investigate the inflation when putting constraints on the second stage sample sizes. It turns out that, for example fixing the sample size of the control group, leads to designs controlling the type 1 error rate.


Introduction
In the last decade, adaptivity in clinical trials with design modifications such as sample size reassessment or treatment selection at an interim analysis has gained increasing attention. One may argue that there have always been modifications when performing clinical trials, for example simply covered by amendments to the study protocols. However, it has been shown that if, after design modifications, the critical boundaries and test statistics for the corresponding fixed sample size design are used, then the type 1 error rate is inflated. For the comparison of the means of a normally distributed outcome with known variance between a single treatment and a control in parallel groups and balanced sample sizes, that is equal sample size in the treatment and control group, Proschan and Hunsberger (1995) derived the maximum possible type 1 error rate inflation. They assumed that the experimenter, for any interim outcome, would choose the second stage sample sizes in such a way that the conditional type 1 error rate is maximized ("worst case scenario"). This strategy will also maximize the overall type 1 error rate. They showed that the type 1 error rate can be inflated from 0.05 to 0.11. Graf and Bauer (2011) extended these worst case arguments to the case of unbalanced sample size reassessment showing that the maximum type 1 error rate increases to 0.19 when the allocation ratio is allowed to change at interim. However, in this unbalanced case the maximum of the conditional type 1 error rate can only occur if the experimenter knows the value of the nuisance parameter, the common mean under the null hypothesis. This may at least approximately apply for the control treatment if a large number of data from previous experiments is available.
Many methods for type 1 error control in adaptive designs are available for testing a single hypothesis (Bauer, 1989;Bauer and Koehne, 1994;Proschan and Hunsberger, 1995;Lehmacher and Wassmer, 1999;Mueller and Schaefer, 2001;Brannath et al., 2002;Mueller and Schaefer, 2004;Gao et al., 2013) and have been applied in clinical trials. Multiarmed selection designs have been proposed (e.g.  and have been extended to allowing for adaptive design modifications (Bauer and Kieser, 1999;Koenig et al., 2008;Bretz et al., 2009;Bebu et al., 2013;Sugitani et al., 2013). With the rise of adaptive methods in clinical trials, the main emphasis has been on strict control of the type 1 error rate to maintain the strictly confirmatory nature (EMA, 2007;FDA, 2010;Wang et al., 2013).
However, there are complaints that the adaptive machinery has become too complicated with "tests that resort to nonstandard adjustments and weightings appear mysterious to all but the specialist in adaptive design" (Metha and Pocock, 2012). From an operational perspective, adaptations put a burden on data analysts who have to clean data for interim decision making and on drug supply managers who have to deal with the possibility that doses may be added to or removed from the trial. Uncertainty at the planning stage about the total funds needed for the trial can also be a concern. From a statistical perspective, it has been argued by some experts that adaptive designs offer little advantage over more conventional group-sequential designs (Tsiatis and Metha, 2003;Jennison and Turnbull, 2006;Levin et al., 2013) and that they use test statistics that might violate desirable principles like sufficiency (Burman and Sonesson, 2006). However, these criticisms of adaptive designs are not uncontroversial themselves (Brannath et al., 2006). In any case, such additional burden may prevent experimenters from using adaptive design methodology and resort to either ignoring the issue or using seemingly simple adjustments like Bonferroni or Dunnett corrections. It is therefore desirable to investigate the maximum type 1 error inflation arising from such strategies. Regarding specific clinical trials, the precise quantification of the inflation can also be a guide to decide whether the implementation of the adaptive test machinery is really necessary, or whether a simpler adjustment might suffice, possibly after additional restrictions of the interim decision options, like upper and lower limits on the allowed sample size modifications.
In this work, we investigate the maximum type 1 error rate when k test treatments are compared to a single common control and when treatment selection is allowed at interim either with or without flexible sample size reassessment. Designs of multiarmed clinical trials with interim treatment selection have attracted a lot of research in the last decade (Zeymer et al., 2001;Gaydos et al., 2009;Barnes et al., 2010). Nevertheless, the number of conducted or started trials seems to be rather limited (Elsaesser et al., 2014;Morgen et al., 2014).
In Section 2, we give a motivating example of a clinical trial where the experimenters decided to use the conservative Bonferroni procedure instead of an adaptive approach. In Section 3, we introduce the hypothesis tests and the type of interim adaptations investigated to calculate the maximum type 1 error rate. In Section 4, we consider the situation when the treatment with the largest observed interim effect is always selected for the second stage. Furthermore, we investigate the maximum type 1 error rate when second stage sample sizes are restricted to range within a prefixed interval. In Section 5, we mainly focus on the case of k = 2 treatment arms, always proceeding with both treatments and the control to the second stage. In Section 6, we discuss our findings in the context of the C 2014 The Author. Biometrical Journal published by WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com motivating example and give some practical considerations. This is followed by concluding remarks in Section 7. Barnes et al. (2010) give a recent case study for a two-stage clinical trial on the drug indacaterol to treat chronic obstructive pulmonary disease (COPD). This study comprised a first stage for dosefinding with dose selection after 14 days of treatment, and a second stage evaluating efficacy and safety during 26 weeks of treatment. The dose-finding stage included seven randomized treatment arms, four doses of the study drug, placebo and two further treatment groups with active comparators. At an interim analysis after the first stage the indacaterol doses were selected using preset efficacy and safety data (Lawrence et al., 2014). A multiplicity correction using a Bonferroni adjustment with α/4 was applied, despite the fact that in the final efficacy analysis only the two selected indacaterol doses should have been compared individually against placebo based on the pooled data of both stages with prefixed sample sizes. This approach controls the type 1 error rate if the sample size, as in the given example, is prefixed. However, due to the overcorrection, this approach is conservative. The authors themselves acknowledge that the approach "is statistically somewhat conservative, but it has the merit of simplicity". The question arises whether for such a design sample size reassessment strategies would have been possible without inflating the type 1 error rate.

Trial design
In the following, we assume that a clinical trial is designed for k treatment and one control arm where a two-stage design should be applied. In a first stage the observed outcome measures x (1) j,i from patients j = 1, . . . , n (1) i , randomized to one of k + 1 groups, that is to the control, denoted by index i = 0, or to one of the treatment groups, i = 1, . . . , k are investigated. The outcome is assumed to be normally distributed with common known variance, X j,i ∼ N(μ i , σ ). Without loss of generality we set σ = 1. Having obtained at the end of the first stage n (1) 0 observations in the control and n (1) i = a i n (1) 0 , i = 1, . . . , k observations in the treatment groups, the sample meansx (1) j,i for i = 0, . . . , k are calculated. The a i > 0 denote the prefixed first-stage-allocation-ratios between treatment group i and control. The experimenter may set the second stage sample sizes to n (2) i = r i n (1) i = r i a i n (1) 0 in the treatment groups and to n (2) 0 = r 0 n (1) 0 in the control group with the second-to-first-stage-ratios 0 ≤ r i ≤ ∞, i = 0, . . . , k based on the interim sample means.
In the final analysis, after the second stage, we test the hypotheses using the standardized mean difference T i pooling the data of both stages and comparing it to the critical boundary c 1−α as used for the fixed sample size design. This means that adaptivity is not accounted for, neither in the test statistics nor in the critical boundary. The test statistics is defined as: . . , k denoting the second stage sample means. We obtain the worst case scenarios for each possible interim outcome by searching for the second-tofirst-stage ratios maximizing the conditional type 1 error rate,r i . Generalizing the formula in Koenig where α is the preplanned level for the type 1 error rate and c 1−α the critical boundary of the preplanned . . , k, j = 1, 2 are defined as the standardized differences between the sample mean and the common true mean μ under the global null hypothesis of stage j = 1 (at interim) and j = 2, respectively (without loss of generality μ = 0). The cumulative distribution function and density of the standard normal distribution are denoted by and φ, respectively. Note that Z ( j) i follow independent standard normal distributions. When second stage sample sizes are not constrained, the maximum type 1 error rate is given by Whereas CE α is a function of r 0 , . . . , r k , CE α is a function ofr 0 , . . . ,r k , the second-to-first-stageratios leading to the maximum CE α . Ther i are determined for a given interim outcome (Z (1) 0 , . . . , Z (1) k ) and are therefore a function of the Z (1) i . Thus, CE α does not depend on r 0 , . . . , r k . In the following we use a quasi Newton method provided by the R-function optim for numerical optimization and for numerical integration we used the R-function integrate (R Development Core Team, 2012). R-programs to calculate the maximum type 1 error rate are available as Supplementary Information.
Remark 3.1. The critical boundary c 1−α of the preplanned test may be defined in different ways: (i) as the (1 − α)-quantile of the standard normal distribution, z 1−α , if no correction at all for multiplicity is applied or (ii) as a Dunnett critical boundary (Dunnett, 1955) based on the preplanned first-stage-allocation-ratios a i , i = 1, . . . , k to adjust for multiplicity due to the treatment-control comparisons. Even strategy (ii) may not guarantee type 1 error control if additional sample size reassessment is performed at interim. Moreover, in case of sample size reassessment (and/or treatment selection) the Dunnett critical boundary would not be fixed a priori when calculated for the actual sample sizes in the final analysis. For simplicity ,we will apply the pre-fixed Dunnett boundary, d 1−α , based on the preplanned first-stage-allocation-ratios a i between treatment and control in the following. Remarks 4.1 and 4.2 discuss how results change if instead critical boundaries are based on actual (reassessed) total sample sizes in the final analysis.

Remark 3.2.
For k ≥ 2 we calculate the maximum type 1 error rate under the global null hypothesis μ i = μ 0 , i = 1, . . . , k. A proof that the maximum type 1 error is attained under the global null hypothesis is given in Appendix A2.
Remark 3.3. For k = 1, Graf and Bauer (2011) showed, by numerical evaluation, that the maximum type 1 error in the case of balanced first stage sample size between treatments before the interim analysis (a i = 1, i = 1, . . . , k) is an upper bound. For k ≥ 1 we will likewise set a i = 1, since it is the most common scenario applied in practice. Note that for many-to-one comparisons, the scenario with a i = 1/ √ k leads to the smallest required sample size for a given power and significance level. Therefore we will also give some numerical results for this allocation ratio.

Selection of the most promising treatment at interim
We first consider that in the interim analysis the treatment group m with the largest observed interim effect The secondto-first-stage-ratios r i , i ∈ {0, m} may be set based on the interim results, 0 ≤ r 0 , r m ≤ ∞. In the final analysis, only the selected treatment group m is compared to the control group (using data of both stages). The corresponding null hypothesis H 0m is rejected, if the final test statistic T m exceeds the critical value c 1−α . Note that the maximum type 1 error rate for the case of always selecting the best treatment is an upper bound for the maximum type 1 error rate when in a particular trial another single treatment is selected, for example the treatment with the second largest observed effect at interim because of potential safety issues for the most effective treatment. Clearly, under the global null hypothesis and for balanced first stage sample sizes over the k treatments, selecting a treatment with an interim effect smaller than the largest observed interim effect will reduce the maximum type 1 error rate. Following the lines of Graf and Bauer (2011), the conditional type 1 error rate (1) for this scenario simplifies to Note that if the treatment with the largest observed interim effect is selected, m is random and therefore also a m is a random variable. In the following we set a 1 = . . . = a k so that a m is no longer a random variable and the maximum type 1 error rate can be evaluated by and

Equal second-to-first-stage-ratios
Let r 0 = r m = r with 0 ≤ r ≤ ∞, that means only allowing for equal second-to-first-stage ratios, and let furthermore a 1 = · · · = a k = 1 indicating balanced first stage sample sizes for the treatment and the control groups. After the second stage, the selected treatment group is compared to the control group (using data of both stages) applying the critical value c 1−α of the pre-planned test. Note that for this scenario the final test is balanced between both groups. In a slight modification of Proschan and Hunsberger (1995), the conditional type 1 error rate (3) of the final treatment-control comparison for r 0 = r m = r and a 1 = · · · = a k = 1 reduces to 2 is used. The conditional type 1 error rate does not depend on the unknown nuisance parameter μ.
Calculation of E * α in this balanced case follows the lines of Proschan and Hunsberger (1995). The essential difference is that the density of the maximum of k independent standard normal distributions has to be used in the integration. The subspaces of the interim sample space to perform separate optimizations remain the same (see Appendix A3).
The black lines in Fig. 1 show that if no correction for multiplicity is done (Fig. 1A), the type 1 error is highly inflated and increases with k. Using Dunnett boundaries for k treatment-control C 2014 The Author. Biometrical Journal published by WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com Table 1 Maximum type 1 error rate for k = 2 with and without treatment selection, with or without adjustment for multiplicity and with equal or flexible second-to-first stage ratios as compared to the case k = 1.  . 1B), the overall type 1 error decreases with k, that means correcting for multiplicity of all possible individual treatmentcontrol comparisons leads to a smaller inflation of the overall type 1 error as compared to k = 1. For increasing k, the correction is done for an increasing number of k − 1 hypotheses not tested in the final analysis. Correcting for all possible individual treatment-control comparisons would be a conservative approach if the second stage sample size would be fixed independently of the data, for example in the planning phase. Here the inflation of the maximum type 1 error rate is caused by the worst case sample size reassessment rule. For a direct comparison with the case of no treatment selection discussed later (see Section 5), the columns "equal" in Table 1 show the maximum overall type 1 error rate for k = 2 with and without correction for multiplicity as well as for the case of k = 1 (Proschan and Hunsberger, 1995).

Flexible second-to-first-stage-ratios
"Flexible" second-to-first-stage ratios allow different sample size reassessments for the selected treatment and the control, for example a sample size decrease for the control, but a sample size increase for the selected treatment group. For each interim outcome, the worst caser 0 andr m may differ. The sample size of the final treatment-control comparison may then be unbalanced between treatment arms. If we again assume balanced first stage sample sizes, the conditional type 1 error rate is now calculated by (3) setting a m = 1. We use the independence of Z (1) 0 and Z (1) m to get rid of the nuisance parameter μ. The conditional type 1 error rate cannot be written as a function of the test statistic T (1) m as in Section 4.1. As in Graf and Bauer (2011), the calculation of E * α can be separated into several parts of the interim subspace using Z (1) m instead of Z (1) 1 . To evaluate the maximum type 1 error rate we partition the interim subspace in a way analogous to Graf and Bauer (2011) (see Section 1 in the Supplemental Materials).
The gray lines in Figs. 1A and B show that allowing for flexible second-to-first-stage ratios substantially increases the possible maximum type 1 error rate. Using d 1−α (Fig. 1B) in all scenarios leads to a nonmonotonous behavior with respect to the number of treatments k. An explanation for this is that the fixed boundaries are correct for the worst case scenarios, where the overall sample size is balanced between treatment and control, whereas for the unbalanced worst case scenarios they lead to smaller critical boundaries as compared to the boundaries using the actual total sample sizes. For larger k this difference in the correlation matrices is extended to all the k − 1 dropped treatments at interim, so that the differences between unbalanced and balanced critical boundaries tend to increase with increasing k which in the end leads to an increase in the maximum type 1 error rate. Again, to allow a direct comparison to the other discussed scenarios, the columns "flexible" in Table 1 show the values for k = 2 for both choices of the critical boundary as well as k = 1 (Graf and Bauer, 2011).
Remark 4.2. When using Dunnett critical boundaries as in Remark 4.1, the maximum type 1 error rate up to k = 10 (data not shown) is smaller than for Dunnett critical boundaries based on balanced sample sizes. The maximum type 1 error rate is decreasing in k and hence also differences between the two approaches increase with k.

Constrained second stage sample size
Unconstrained sample size reassessment of course will hardly be used in practice. We therefore put constraints on the second-to-first-stage-ratios r i , r i,lo ≤ r i ≤ r i,up , i ∈ {0, m}. The ranges for the maximization in formula (5) are therefore changed to r 0,lo ≤ r 0 ≤ r 0,up and r m,lo ≤ r m ≤ r m,up . Figure 2 shows the maximum type 1 error rate E * α , α = 0.025 for different constraints on sample size reassessment using the Dunnett critical boundary d 1−α : I. r 0,lo = r m,lo = 0, r 0,up = r m,up , r m,up = 1, 2, . . . , 10: Setting the lower boundary to 0 means that we allow for early rejection at interim. The solid lines in Fig. 2 show that E * α is increasing with increasing upper boundary, flattening off for larger values. Allowing for flexible second-tofirst-stage-ratios (solid lines in Fig. 2B), the increase with the upper boundary is even steeper than for equal ratios ( Fig. 2A). However, the results for k ≥ 3 are very similar. II. r 0,lo = r m,lo = 1, r 0,up = r m,up , r m,up = 1, 2, . . . , 10: In this scenario, the second stage sample size has to be at least as large as the first stage sample size for the selected treatment and the control. The dashed lines in Fig. 2A  1 error is always below the nominal α = 0.025. Calculations including numerical integration of E * α for k = 4 and r i,up = ∞ give a value of 0.02509. Therefore, for k = 4 selecting always only one treatment and the control, such type of constraints may be safely applied in practice without inflating the type 1 error rate. The reason is that there is a tradeoff (i) between the overcorrection from using Dunnett boundaries adjusting for treatment-control comparisons that are not carried over to the final test and (ii) the inflation due to data-dependent choice of the final sample size of the selected treatment (equal ratios, total sample size per selected treatment at least twice the first stage sample size per group). The smaller the prefixed range for the second stage sample sizes the smaller the impact of the latter effect. Similar results can be found for a nominal α of 0.05 and 0.01. For k = 4 and r i,up = ∞ the values are E * 0.01 = 0.0106 and E * 0.05 = 0.0483. Note that in the scenario for k = 4 without any interim sample size reassessment, for example: r 0,lo = r m,lo = r 0,up = r m,up = 1, the selection of one treatment and the control would happen quite late in the trial in terms of total sample over all groups (at a fraction of 5/7). Allowing for flexible second-to-first-stage-ratios (Fig. 2B) only for smaller windows (smaller r 0,up and r m,up ) E * α does not exceed α. For example for α = 0.025 and r 0,up = r m,up = 2, the number of treatments has to be larger than 3 so that E * α will always be below 0.025. III. r 0,lo = r m,lo = 1, r 0,up = 1, r m,up = 1, 2, . . . , 10: In this case, the second-to-first-stage-ratios are allowed to be flexible by definition, the only option for sample size adaptation is the choice of a second stage sample size for the selected treatment to be at least as large as in the first stage and not to exceed r m,up (see dotted lines in Fig. 2B). Such an adaptation may arise from a rare adverse event in the selected treatment group requiring additional information. It is interesting to note that for k > 2 the maximum type 1 error rate E * α will never exceed the nominal level, even if the upper boundary is set to ∞. For k = 2 no inflation occurs with r m,up = 2. Similar results can be found for a nominal α of 0.05 and 0.01. Note that Fig. 2B shows that the type 1 error rate is not inflated when Dunnett critical boundaries are used in case of an allocation ratio to control of 1/(k + 1) between treatment(s) and control in both stages, that is r m,lo = r m,up = k.

No treatment selection at interim
Since selecting only the treatment with the largest interim effect is a natural strategy often discussed in the literature (Cohen and Sackrowitz, 1989;Bowden and Glimm, 2008;Friede and Stallard, 2008;Stallard et al., 2008;Bauer et al., 2010), we first elaborated on this in Section 4. However, if all initially planned treatment arms are further investigated in the second stage, under the global null hypothesis, the maximum type 1 error rate is larger than for any other case with treatment selection. The reason is that dropping treatments at the interim analysis can be viewed as a constrained sample size reestimation problem (with r i = 0 or r i = 1 as the only options for treatment i), and this cannot produce a larger maximum of the conditional type 1 error rate than the unconstrained optimization problem.
For k > 1 we were not able to find a general closed solution for the maximum type 1 error rate (even if a single constant c 1−α is used as a critical boundary for all the k standardized treatment vs. control test statistics). To put the above optimization problem into a manageable framework, we illustrate the calculation for the case of two experimental treatment arms (k = 2) in the following. For the less complex scenario of equal second-to-first-stage ratios, numerical results are reported for k > 2.

Equal second-to-first-stage-ratios
As an extension to Proschan and Hunsberger (1995) we first investigate the case of equal second-tofirst-stage-ratios setting r 0 = r 1 = r 2 = r. Assuming furthermore that the first stage sample sizes are balanced, that is setting a 1 = a 2 = 1 (and therefore also that the final stage sample sizes are balanced between treatment arms), for k = 2 formula (1) simplifies to As in Section 4.1, for notational convenience, the first stage test statistics T (1) i for comparing treatment i to the control are used. The conditional type 1 error rate does not depend on the nuisance parameter μ. The cumulative distribution function of the multivariate normal distribution with two-dimensional mean zero-vector 0 and covariance-matrix with elements σ 11 = σ 22 = 1 and covariance σ 12 = 1/2 is denoted by 0, (x). To calculate the worst case conditional type 1 error rate we have to partition the (T (1) I. If T (1) 1 < 0 and T (1) 2 < 0 the largest conditional type 1 error rate is obtained by settingr = ∞, r denoting the worst case second-to-first-stage ratio. The second stage is now overruling the negative interim effect and therefore yielding a CE α = 1 − 0, c 1−α that is equal to α if 3 for the bivariate normal distribution with σ 12 = 1/2 (see e.g. Kotz et al., 2000), the contribution of this subspace to the overall maximum type 1 error rate E * α is 1 3 1 − 0, c 1−α . II. If T (1) 1 > c 1−α or T (1) 2 > c 1−α the largest conditional type 1 error rate CE α = 1 (applying early rejection at interim and settingr = 0) is obtained. This leads to a contribution to E * α of P[(T (1) In the remaining interim subspace we were not able to find a closed solution for CE α . Therefore, we used numerical optimization of the single parameter r. The "equal"-columns of Table 1 show the results for the overall E * α for the case of k = 2, with and without correction for multiplicity. As is to be expected, applying the naive unadjusted critical boundary z 1−α may result in a further considerable type 1 error rate inflation as compared to k = 1. An interesting finding is that when using the Dunnett critical value, E * α are close to the results for k = 1. For k = 3 and using Dunnett critical boundaries for α = 0.025 the maximum type 1 error rate is still inflated up to 0.0545, but interestingly the inflation is smaller compared to k = 2. For k = 4 treatments, E * 0.025 is flattening off at an inflated level of 0.0543. For α = 0.01 and 0.05 the same tendencies can be found.

Flexible second-to-first-stage-ratios
If we allow for flexible second-to-first-stage-ratios, we again have to use the independent Z (1) i (instead of the test statistics T (1) i ) to get rid of the nuisance parameter μ. If we assume balanced first stage sample size, the conditional type 1 error rate is now calculated by (1) setting k = 2 and a 1 = a 2 = 1. To explain the worst case scenarios in more detail, we will focus on the subspaces in terms of the interim outcome of the control group Z (1) 0 .
A. Subspace (Z (1) 0 ≤ −c 1−α ): CE = 1 is obtained by setting eitherr 1 orr 2 to ∞ andr 0 = 0. The contribution of this subspace to E * α therefore is (−c 1−α ). B. Subspace (Z (1) 0 ≥ 0): The worst case choice is settingr 0 = ∞ in the final analysis, getting two independent tests against the asymptotically fixed mean μ = 0. Hence the conditional type 1 error rate reduces to √ r 2 independent of Z (1) 0 . A detailed explanation for the calculation of the maximum type 1 error rate for this subspace B is given in Section 2 in the Supplemental Materials. Summing up the results for Z (1) 0 ≥ 0, the contribution to the overall maximum type 1 error rate can be calculated by C. Subspace (−c 1−α < Z (1) 0 < 0): In this region the worst case conditional type 1 error rate depends on all three interim values of control and treatment groups, respectively. If either Z (1) 1 or Z (1) 2 is larger than min(c 1−α √ 2 + Z (1) 0 , c 1−α ) again a conditional type 1 error rate of 1 can be achieved. For the remaining regions we used numerical point-wise optimization and integration for calculating the contribution to the overall type 1 error rate E * α .
The columns "flexible" for k = 2 of Table 1 show the total E * α for flexible second-to-first-stage-ratios applying critical boundaries z 1−α or d 1−α . Without any correction for multiplicity (z 1−α ), the maximum type 1 error is clearly increased as compared to the case k = 1. Interestingly, as for the results of equal second-to-first-stage ratios (see Section 5.1), when using the pre-specified Dunnett critical boundary, E * α is close to the results for k = 1. Due to the numerical burden we did not calculate the maximum type 1 error rate for k > 2. However, we expect similar findings as for the case of equal second-to-first-stage ratios at least for k = 3 and C 2014 The Author. Biometrical Journal published by WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com 4, that is the maximum type 1 error rate sightly decreasing when using a Dunnett adjusted critical boundary.

Practical recommendations
The results presented for the case of selecting the most promising hypothesis at interim are of great practical interest, because they demonstrate that, given certain restrictions on the second stage sample size, naive strategies may even lead to an adequate control of the type 1 error rate. For example, if the sample size per treatment group in the second stage is at least as large as in the first stage and we only allow for equal second-to-first-stage-ratios, no inflation of the type 1 error rate occurs for the number of treatments k ≥ 4 when simply using the Dunnett critical boundaries. For k = 3, no inflation occurs when restricting the second-stage sample size to be at maximum 4 times the first-stage sample size (see Fig. 2A). If we fix the overall sample size in the control group, allowing for any choice of the overall sample size in the selected treatment group that increases its first stage sample size more than twofold does not lead to an inflation of α for k ≥ 3 (see Fig. 2B). Therefore, if in the case study of Barnes et al. (2010) (see Section 2) only the selection of a single treatment group and control had been pre-specified, the experimenter would have been permitted to do any balanced increase of the sample size, even when using the conventional test statistic and the less conservative Dunnett critical boundary (instead of the applied Bonferroni adjustment) for final testing. If a flexible sample size reassessment for the second stage would have been allowed for (as in Section 4.2), no type 1 error inflation would have occurred if the second stage sample size would have been constrained to be between the first-stage and twice the first stage sample size. However, it has to be noted that for realistic scenarios (as e.g. an upper bound of twice the first stage sample size) and a larger k, the obtained maximum type 1 error rate may be much smaller than α so that even using the Dunnett critical boundaries would lead to conservative procedures. Note that these results only apply when using prespecified-binding constraints on the selection rules.
Allowing for early rejection at interim, the maximum type 1 error rate will always be inflated. In such scenarios, if the use of conventional test statistics is preferred, one may adjust the critical boundary so that the maximum type 1 error rate is controlled. As an example, assume that we only allow for equal second-to-first-stage ratios setting the upper bound of the second-stage-sample size of the selected treatment and control to be twice the first stage sample size. For k = 2 an adjusted level of 0.013 (instead of 0.025) has to be used to control the maximum type 1 error rate. In more detail, if we assume for both treatments an effect size of 0.5 times the standard deviation, a sample size of n = 65 per group would be needed to achieve 80% power. Compared to a fixed sample size test with Dunnett adjusted critical boundaries, this would be a 20.4% increase of the per-group sample size. For increasing k, this is only slightly decreasing: for k = 3 an increase of 18.8% and for k = 4 an increase of 17.0% of the per-group sample size is needed to control the maximum type 1 error rate when additionally allowing for the given sample size reassessment. To achieve a power of 90%, a slightly smaller increase in the per-group sample size is needed, that means an increase of 16.4%, 16.7%, and 15.6% would be needed for k = 2, 3, and 4, respectively. All these examples show that adjusting for the worst case would be a rather conservative strategy and adaptive tests should be implemented instead (Koenig et al., 2008;Bretz et al., 2009).

Discussion
In this paper, we have investigated the maximum type 1 error rate arising from the application of a nonadaptive test used by experimenters who freely adapt their ongoing trials. This problem has been addressed by Proschan and Hunsberger (1995) for the comparison of one treatment with a control and C 2014 The Author. Biometrical Journal published by WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com balanced sample sizes before and after the adaptive interim analysis. They considered a restricted rule incorporating a stopping for futility criterion. This leads to procedures where the effect of adjusting the adaptation of the sample size is no longer dramatic. Graf and Bauer (2011) have extended the worst case calculations allowing for unbalanced sample sizes. In this paper, a further level of complexity has been added by considering multiple comparisons of k treatments with a single control. For the case without selection of a treatment arm at interim, we calculate the maximum type 1 error rate for k = 2 in the case of equal and flexible second-to-first-stage-ratios (assuming balanced first stage sample sizes). Not surprisingly, when applying uncorrected level α treatment-control comparisons, the worst case type 1 error is dramatically inflated. By using Dunnett-adjusted critical boundaries, the worst case inflation is still large. Interestingly, the inflation is very similar to the case of comparing k = 1 treatment to a control (Graf and Bauer, 2011). This means that when adjusting for the number of treatments for k = 2, no noticeable further maximum inflation of the type 1 error rate occurs as compared to k = 1.
The case of equal and flexible second-to-first-stage ratios was investigated for scenarios where only a single treatment and the control are selected at the interim analysis. In this scenario, there is a trade-off between inflation due to sample size reassessment and the overcorrection for the k − 1 treatments finally not selected and not tested in the statistical analysis. For equal ratios, the maximum type 1 error is monotonically decreasing with k with a finite limit noticeably larger than the nominal level α. As expected, the impact of flexible ratios is more severe, the maximum inflation of the actual level α, though decreasing for small k, is increasing with larger k.
There are several caveats to be mentioned here. First, for the case of flexible ratios the conditional error can only be calculated when the nuisance parameter, the common mean under the global null hypothesis, is known. Secondly, the maximum type 1 error only occurs if the experimenters apply the worst case sample size reassessment rule (maximizing the conditional type 1 error rate) at any point in the interim sample space. Thirdly, in some interim subspace, the maximum is assumed if some of the second stage sample sizes go to infinity. Although theoretically interesting, this of course means that these maximum type 1 error rates can never be reached in real clinical trials. Adjusting for these "unrestricted worst cases" would be an extremely conservative strategy and cannot be recommended for use in practice. Therefore, we also investigated maximum type 1 error rates that arise when the second stage sample sizes are constrained by upper and lower limits. Some of these results are practically interesting, because they demonstrate that in certain cases, when putting restrictions on the secondstage-sample sizes, naive strategies can control the type 1 error rate. Such calculations under constraints could replace simulations of the type 1 error rate in designs with adaptive selection rules, the latter being considered problematic by some researchers (Posch et al., 2011).
Open research problems are at present the unconstrained optimization for k > 2, which imposes a burden of numerical integration and optimization. For the unconstrained scenario of k = 2, the optimization lasts up to one half second for one grid point on an Intel(R)Core(TM)i5 CPU M540 processor with 2.53GHz and it is therefore still a time consuming numerical challenge to derive a sufficiently narrow grid over the three dimensional interim subspaces with sufficiently accurate values of the maximum conditional error functions to be integrated. Also scenarios where the selection of s, 1 < s < k out of k treatment groups and the control are prespecified are of high interest.
As a conclusion, we do not recommend the use of unrestricted "worst case" adjustments since they will be far too conservative for serious consideration. If limits on sample size modifications can be imposed, it is still important to compare the operating characteristics of adaptive designs with the maximum-type-1-error-based adjustments discussed here. Only then we can decide whether sample size limits can or should be imposed and how tight they might be.
Bremen. She is grateful to Werner Brannath for hospitality at the Competence Center for Clinical Trials as well as for helpful comments. Peter Bauer's and Franz Koenig's research has received funding from the European Union Seventh Framework Programme [FP7 2007 under grant agreement No.: 602552. Furthermore, we thank the unknown reviewers, the associate editor and editor Lutz Edler for helpful suggestions as well as Byron Jones for critical proofreading of the paper, which have improved the paper substantially.

A.1 Calculation of the conditional type 1 error rate
In the following we assume that the global null hypothesis applies (μ i = μ for i = 0, . . . , k). The conditional type 1 error for rejecting at least one treatment-control comparison in the final analysis, given the interim data, can be calculated as follows: In the final analysis after the second stage each test (comparing treatment i to the control group) is based on the following global test statistic: (1) 0 +n (2) 0x (2) 0 n (1) 0 +n ( = (x ( j) i − μ) n ( j) i , i ≥ 0, j = 1, 2, have independent standard normal distributions. If the overall test statistic T i is larger than the critical boundary c 1−α we get a false positive decision, which leads to the following inequality: Since Z (2) i and Z (2) 0 have independent standard normal distributions for every set of values a i , r 0 , r i , Z (1) 0 and Z (1) i (and hence are independent of these quantities), the conditional error can be written as in formula (1) www.biometrical-journal.com 1 1+k , 0 here denoting the k-dimensional zero vector and the k-dimensional covariance matrix with σ ii = 1 and σ i j = 1/2 for i = j. II. Within the subspace 0 < T (1) m < c 1−α (or equivalently Z (1) Proschan and Hunsberger (1995) showed thatr = ( c 1−α T (1) m ) 2 is leading to a worst case conditional type 1 error rate CE α (T (1) m ) = 1 − ( c 2 1−α − (T (1) m ) 2 ). We found no simplification of the twodimensional integration in this subspace.